Replicate Version control for machine learning

Replicate versions all of the models you train and stores them on Amazon S3 or Google Cloud Storage, so you can pull down those models into production inference systems.

Load models within Python

Using the Replicate Python API, you can load a model directly from within your inference script. For example, if you did this in your training script:

import torch
import replicate
def train():
experiment = replicate.init(path=".", params={...})
for epoch in range(num_epochs):
# ...
torch.save(model, "model.pth")
experiment.checkpoint(
path="model.pth",
metrics={"loss": loss},
primary_metric=("loss", "minimize")
)

Then you can use this in your inference script to get the model back:

import replicate
experiment = replicate.experiments.get("e510303")
checkpoint = experiment.best()
model = torch.load(checkpoint.open("model.pth"))

Load models from the CLI

You can also get files using the command-line interface. This might be useful if you want the model weights on disk, or if you're building a Docker image with the weights inside.

For example, if you run this for the example training script above:

replicate checkout e510303 -o weights/

Then the model weights will be written to weights/model.pth.

Note: Either an experiment ID or checkpoint ID can be passed to replicate checkout. The checkpoint ID makes a better versioning identifier because it specifies a specific version of your model weights.

You can only use an experiment ID in the Python API, currently. Support for checkpoint IDs is being worked on. See this GitHub issue for more details.

Let’s build this together

Everyone uses version control for software, but it’s much less common in machine learning.

This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and results can’t be reproduced. Somebody who wrote a model has left the team? Bad luck – nothing’s written down and you’ve probably got to start from scratch.

So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script. There are some solutions for these things, but they feel like band-aids.

We spent a year talking to people in the ML community about this, and this is what we found out:

  • We need a native version control system for ML. It’s sufficiently different to normal software that we can’t just put band-aids on existing systems.
  • It needs to be small, easy to use, and extensible. We found people struggling to migrate to “AI Platforms”. We believe tools should do one thing well and combine with other tools to produce the system you need.
  • It needs to be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.

We need your help to make this a reality. If you’ve built this for yourself, or are just interested in this problem, join us to help build a better system for everyone.

Join our Discord chat  or  Get involved on GitHub


Sign up for occasional email updates about the project and the community:

Core team

Ben Firshman

Product at Docker, creator of Docker Compose.

Andreas Jansson

ML infrastructure and research at Spotify. PhD in ML for music.

We also built arXiv Vanity, which lets you read arXiv papers as responsive web pages.

Replicate Version control for machine learning

```