at sync. we're making video as fluid and editable as a word document.

how much time would you save if you could record every video in a single take?

no more re-recording yourself because you didn't like what you said, or how you said it.

just shoot once, revise yourself to do exactly what you want, and post. that's all.

this is the future of video: AI modified >> AI generated

we're playing at the edge of science + fiction.

our team is young, hungry, uniquely experienced, and advised by some of the greatest research minds + startup operators in the world. we're driven to solve impossible problems, impossibly fast.

our founders are the original team behind the open sourced wav2lip — the most prolific lip-sync model to date w/ over 9k+ GitHub stars.

we’re at a stage today in computer vision where we were w/ NLP two years ago — have a bunch of disparate, specialized models (eg. Sentiment classification, translation, summarization, etc), but LLMs (a generalized large language model) displaced them.

we’re taking the same approach – curating high quality datasets + training a series of specialized models to accomplish specific tasks, while building up to towards a more generalized approach for one model to rule them all.

post batch our growth is e^x – we need help asap to scale up our infra, training, and product velocity.

we look for the following: [1] raw intelligence [2] boundless curiosity [3] exceptional resolve [4] high agency [5] outlier hustle

Skills: Torch/PyTorch, Python, GPU Programming, Computer Vision

About sync.

We’re a team of artists, engineers, and researchers building tools to understand and modify people in video — in the last year we graduated at the top of our YC batch (W24), raised a $5.5M seed backed by GV, won the AI grant from Nat Friedman and Daniel Gross, and scaled to $2M+ ARR from $0.

We’re building a zero-shot generalized video model to understand, generate, and gain fine-grained control over any human in any video. We’ve already released a state-of-the-art generalized lip-syncing model for content translation and word-level video editing — you can access our models through our developer playground and API.

We’ve had a breakthrough: we learned a highly accurate generalized representation of a human face — this unlocks many new editing tasks that have never been possible before.

We’re expanding our research team – we’re bringing on an outstanding Research Scientist to design novel experiments, unlock new model capabilities, and push the boundaries of science to achieve extraordinary results.

What are we working on?

We live in an extraordinary time.

Video generation is world modeling. Deep learning unlocked the ability to decipher the world around us, to understand it, to compress its data and information into a 70b param network plus weights.

By simply changing these underlying numbers – these latent representations — we can reimagine and reconstruct reality in any way we see fit.

This is profound. A high schooler can craft a masterpiece with an iPhone. A studio can produce a movie at a tenth of the cost 10x faster. Every video can be distributed worldwide in any language with perfect preservation of meaning, instantly. Video becomes as malleable as text.

But we have two fundamental problems to tackle before this is a reality:
[1] Large models are great at generating entirely new scenes and worlds, but struggle with precise control and fine grained edits. The ability to make subtle, intentional adjustments – the kind that separates good content from great content – doesn’t exist.

[2] If video generation is world modeling, each human is a world unto themselves. We each have our idiosyncrasies that make us unique — creating primitives to capture, express, and modify them is the key to breaking through the uncanny valley.

Our focus in research is to push the boundary on what’s possible by unlocking new capabilities, scaling up models, and optimizing pipelines to serve users faster, more reliably, and with more consistency across generations.

Key Responsibilities:

Devour cutting-edge research papers on audio/visual generative models, distilling insights and techniques to fuel our own research
Implement and experiment with state-of-the-art architectures like GANs, VAEs, and Transformers, adapting them to push the boundaries of video generation
Architect and optimize training pipelines for massive-scale generative models, leveraging your expertise in distributed training and GPU optimization
Conduct rigorous experiments to evaluate and improve model performance, digging deep into metrics like FID, IS, and LPIPS
Collaborate with research and engineering to dream up novel approaches and architectures that break free from the limitations of current techniques
Prototype and scale promising models with breakneck speed, iterating rapidly to achieve record-breaking results
Dive deep into the latest techniques for audio/visual synchronization, temporal consistency, and controllable generation, pioneering new approaches that set the standard for the industry

Required Skills and Experience:

4+ years of experience building and training deep learning models, with a focus on audio/visual domains
Extensive knowledge of state-of-the-art generative architectures and their applications in video synthesis, style transfer, and controllable generation
Mastery of deep learning frameworks like TensorFlow and PyTorch, with the ability to optimize performance and scalability at every level of the stack
Proven ability to implement and experiment with cutting-edge research papers, adapting techniques to novel problem domains
Exceptional coding skills in Python and/or C++, with a deep understanding of software engineering best practices for research code
Relentless curiosity and drive to push the boundaries of what's possible, never settling for "good enough"

Preferred Skills and Experience:

Track record of publications that showcase new ideas and applications. We want to read your work if you think it is interesting.
Research history in at least one of these: (i) generative AI, (ii) self-supervised representation learning, (iii) video understanding, (iv) multimodal generation/understanding.
Open-source projects that demonstrate your ability to implement and train generative models.
Experience in training deep learning models with several thousands of hours of video data.
Previous experience in understanding or editing humans/faces in video.

Outcomes:

Develop and scale generative models that achieve unprecedented levels of realism, diversity, and control in video generation
Foster a culture of relentless experimentation, rapid iteration, and knowledge sharing within the research team
Collaborate with the engineering team to bring research breakthroughs to production at record speed, unlocking new possibilities for our users
Establish Sync as the undisputed leader in AI-powered video generation and editing, attracting top talent and partnerships from across the industry

Our goal is to keep the team lean, hungry, and shipping fast.

These are the qualities we embody and look for:

[1] Raw intelligence: We tackle complex problems and push the boundaries of what's possible.

[2] Boundless curiosity: We're always learning, exploring new technologies, and questioning assumptions.

[3] Exceptional resolve: We persevere through challenges and never lose sight of our goals.

[4] High agency: We take ownership of our work and drive initiatives forward autonomously.

[5] Outlier hustle: We work smart and hard, going above and beyond to achieve extraordinary results.

[6] Obsessively data-driven: We base our decisions on solid data and measurable outcomes.

[7] Radical candor: We communicate openly and honestly, providing direct feedback to help each other grow.

next.js nest.js python pytorch aws/gcp/azure kubernetes

We’re a small team who works hard to create outsized impact. our interview process is grounded in reality:

We expect whoever we hire into this role to have a high degree of agency and maniacal urgency.

Our interview process is grounded in reality — its hard to get a sense of how well we'd work together from a traditional interview question or take home test.

Here is our process:

[1] 30 mins, intro call to understand goals, evaluate mutual fit, and set up next steps.

[2] 3 hrs, technical assessment and interview loop. Can solve p2p live, or offline, you choose. We understand everyone is different, and solving a real world problem in a similar environment and pace as you would normally in the role will be the best way of giving you the best chance to succeed.

[3] 4hrs, irl on-site interview. we’ll fly you into Bangalore where we’ll work on a problem together.

From there, you’ll get an offer or decision in <24 hrs.

We want to set expectations – we work hard, work fast, and do a lot with very little. You'd be joining an outlier team at the ground floor, and a culture of obsession is what we care about most.

Other jobs at sync.

Hundreds of YC startups are hiring on Work at a Startup.