Reproducibility in ML: why your training runs don't match and how to fix it

2026-02-08 · Sofia Lindqvist

Reproducibility in ML

"Can you rerun this experiment from last quarter?" is the question that separates mature ML platforms from piles of notebooks. Reproducibility is almost always broken, and almost always for the same five reasons.

Source 1: data

The training CSV from three months ago is not the same one you have today. Maybe the upstream table has a new row. Maybe a backfill updated old values. Maybe someone "cleaned" a column.

Fix: immutable artifacts. Every training input is a pinned artifact with a content hash. Never s3://bucket/latest/.

Source 2: code version

"Trained with main branch" is meaningless a month later. You need the exact commit, plus the state of any generated files.

Fix: record git SHA of the training code, plus the SHA of any generated artifacts (e.g. compiled feature definitions).

Source 3: environment

Python 3.11.4 + torch 2.2.1 + CUDA 12.1 is different from Python 3.11.5 + torch 2.2.2 + CUDA 12.2. Subtle numerical differences, sometimes large ones.

Fix: container digest (not tag). mypipeline:latest changes under you. mypipeline@sha256:abc... doesn't.

Source 4: randomness

import random, numpy as np, torch

random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# CUDA non-determinism still possible:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Even with all four, some GPU operations are non-deterministic (atomicAdd in scatter ops). For strict reproducibility, use torch.use_deterministic_algorithms(True) — it will raise if you hit a non-deterministic op.

Source 5: hardware

Same code, same data, different GPU architecture — different results. Rare, but real. For regulated environments (healthcare, finance), pin the hardware type.

A reproducibility record

What we persist for every production training run:

{
  "run_id": "train-20260208-0341",
  "code": {"git_sha": "a1b2c3d", "repo": "models/"},
  "container": "mlpipeline-train@sha256:...",
  "data": {
    "features": "features-v7@content=7f3a...",
    "labels": "labels-v3@content=9b21..."
  },
  "hardware": {"gpu": "A100-80GB", "count": 4},
  "seeds": {"python": 42, "numpy": 42, "torch": 42},
  "hyperparams": {...},
  "output_model": "model-v7@content=..."
}

Given this record, a year later, we can rerun and get the same (or demonstrably similar) model.

How to test reproducibility

Don't just assume it works. Periodically rerun a known experiment and diff: - Model weights hash - Metrics on a fixed eval set

If they don't match — something escaped your pinning. Investigate before shipping more models.

Conclusion

Reproducibility is a discipline, not a feature. Pin everything, record everything, and rerun periodically to verify. The day someone asks "why did we make this decision?" — you'll be glad you did.

← Back to all posts