Machine Learning Performance Engineer at Replicate (W20)

$150K - $250K •

Run machine learning models in the cloud

Remote (San Francisco, CA, US)

Full-time

3+ years

Apply Now

About Replicate

What we're doing

Machine learning can now do some extraordinary things: it can understand the world, drive cars, write code, make art.

But, it is still extremely hard to use. Research is typically published as a PDF, with scraps of code on GitHub and weights on Google Drive (if you’re lucky!). It is near-impossible to take that work and apply it to a real-world problem, unless you are an expert.

We’re making machine learning accessible to everyone. People creating machine learning models should be able to share them in a way that other people can use, and people who want to use machine learning should be able to do it without getting a PhD.

With great power also comes great responsibility. We believe that with better tools and safeguards, we will make this powerful technology safer and easier to understand.

How we work

We're a kind, creative, hard-working bunch. We care about our work and our users. We're humble and show humility. We're looking for the same in the people we work with.

When starting this company, we thought: instead of getting a job at the best place to work, let's make that best place to work. We want to work with the best people in an inclusive, supportive environment. And, just have fun while we're at it. You will help us make that place.

You can be located anywhere. We have a beautiful office in San Francisco, CA (specifically The Mission) where some of us work, but we operate as a remote-first company across American and European timezones.

We want our team to feel invested in what we're building. We pay market salary, but well-above market equity. And, all the usual things. (We're European so you'll get really good healthcare.)

About the role

Skills: C++, Python, Machine Learning, CUDA

You're an engineer who lives and breathes high-performance machine learning. You have a deep understanding of how to make AI models run faster and more efficiently, and you're excited about pushing the boundaries of what's possible with current hardware.

At Replicate, we're building the fastest way to deploy machine learning models. Your role will be crucial in optimizing the performance of the diverse range of models we host, ensuring they run as efficiently as possible on our infrastructure.

We're looking for the right person, not just someone who checks boxes, so you don't need to satisfy all of these things. But, you might have some of these qualities:

Strong applied engineering skills. You've deployed machine learning models in scaled-up production environments and know the challenges that come with it.
Deep expertise in CUDA programming and GPU acceleration techniques. You can write custom kernels in your sleep.
Proficiency in C++ and Python. You're comfortable diving deep into low-level optimizations and high-level model architectures alike.
Extensive experience with deep learning frameworks like Torch or JAX. You know their strengths, weaknesses, and how to squeeze every ounce of performance out of them.
A solid grasp of machine learning algorithms, especially with a focus on diffusion models, large language models, or other generative AI techniques.
Familiarity with model quantization techniques, distillation, model pruning, etc. You understand the tradeoffs and know when to apply which technique.
You stay up-to-date with the latest developments in ML performance optimization. When a new technique drops, you're already thinking about how to implement it.

You might be particularly good for this job if:

You've written custom CUDA kernels to significantly improve model latency and can share war stories about the process.
You can discuss the tradeoffs between fp8 and int8 quantization in depth, and have applied either (or both) to whatever hot new model dropped last week.
You get excited about diving into academic papers on ML optimization techniques and turning them into practical, production-ready code.

Technology

We have a web product (currently React + Django), an open source CLI (Go + Python), and Kubernetes ML serving infrastructure.

Apply Now

What we're doing

How we work

Other jobs at Replicate

Hundreds of YC startups are hiring on Work at a Startup.