Reproducibility in ML: why your training runs don't match and how to fix it
2026-02-08 · Sofia Lindqvist
Reproducibility in ML
"Can you rerun this experiment from last quarter?" is the question that separates mature ML platforms from piles of notebooks. Reproducibility is almost always broken, and almost always for the same five reasons.
Source 1: data
The training CSV from three months ago is not the same one you have today. Maybe the upstream table has a new row. Maybe a backfill updated old values. Maybe someone "cleaned" a column.
Fix: immutable artifacts. Every training input is a pinned artifact with a content hash. Never s3://bucket/latest/.
Source 2: code version
"Trained with main branch" is meaningless a month later. You need the exact commit, plus the state of any generated files.
Fix: record git SHA of the training code, plus the SHA of any generated artifacts (e.g. compiled feature definitions).
Source 3: environment
Python 3.11.4 + torch 2.2.1 + CUDA 12.1 is different from Python 3.11.5 + torch 2.2.2 + CUDA 12.2. Subtle numerical differences, sometimes large ones.
Fix: container digest (not tag). mypipeline:latest changes under you. mypipeline@sha256:abc... doesn't.
Source 4: randomness
import random, numpy as np, torch
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# CUDA non-determinism still possible:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Even with all four, some GPU operations are non-deterministic (atomicAdd in scatter ops). For strict reproducibility, use torch.use_deterministic_algorithms(True) — it will raise if you hit a non-deterministic op.
Source 5: hardware
Same code, same data, different GPU architecture — different results. Rare, but real. For regulated environments (healthcare, finance), pin the hardware type.
A reproducibility record
What we persist for every production training run:
{
"run_id": "train-20260208-0341",
"code": {"git_sha": "a1b2c3d", "repo": "models/"},
"container": "mlpipeline-train@sha256:...",
"data": {
"features": "features-v7@content=7f3a...",
"labels": "labels-v3@content=9b21..."
},
"hardware": {"gpu": "A100-80GB", "count": 4},
"seeds": {"python": 42, "numpy": 42, "torch": 42},
"hyperparams": {...},
"output_model": "model-v7@content=..."
}
Given this record, a year later, we can rerun and get the same (or demonstrably similar) model.
How to test reproducibility
Don't just assume it works. Periodically rerun a known experiment and diff: - Model weights hash - Metrics on a fixed eval set
If they don't match — something escaped your pinning. Investigate before shipping more models.
Conclusion
Reproducibility is a discipline, not a feature. Pin everything, record everything, and rerun periodically to verify. The day someone asks "why did we make this decision?" — you'll be glad you did.
← Back to all posts