ML pipeline best practices: reproducibility as a first-class citizen
2026-03-28 · Sofia Lindqvist
ML pipeline best practices
After building ML platforms at three different companies, you start to see the same mistakes repeat. Here are the seven principles that changed how my team designs pipelines.
1. Pipelines are code, data is a parameter
A pipeline that reads s3://prod-data/latest/ hard-coded is not reusable, not testable, not portable. Every pipeline takes input datasets as parameters. Dev, staging, and prod are just different parameter bindings of the same graph.
2. Every step is idempotent
Re-running a step with the same inputs must produce byte-identical outputs (or declare the non-determinism up front, e.g. with a seed). This is the foundation of both caching and reproducibility. If step 7 fails, re-running from step 5 must give the same state.
3. Inputs and outputs are typed and versioned
steps:
- name: featurize
inputs:
raw_events: s3://.../events@v12
outputs:
features:
schema: features-v3
partition: date
No "whatever columns happen to be there". A schema registry + contract tests catches 80% of breakage before it hits training.
4. Caching is cheap, recomputation is expensive
Hash (step_code_version, input_artifact_hashes, params_hash) and skip the step if output already exists. On our main training pipeline, cache hit rate averages 64%, saving ~9 hours per full rerun.
5. Metadata is as important as data
For every run we persist: git SHA, container digest, input artifact IDs, output artifact IDs, metrics, runtime, user. Three years later, someone will ask "why did model v4.2 predict X?" — and you need the receipts.
6. Failure is a first-class output
Don't raise and crash. Emit a structured failure record:
{
"run_id": "r-20260328-1412",
"step": "train_model",
"status": "failed",
"error_type": "OOM",
"attempt": 2,
"retryable": true,
"diagnostics": {...},
}
Your alerting, retry, and triage logic all read this same record. No more regex-parsing stack traces.
7. Local execution must work
A pipeline that only runs in Kubernetes is a pipeline where iteration is measured in hours. Every step must be runnable on a laptop with a subset of the data — ideally via mlpipeline run --step featurize --sample 1%. This is a 10x productivity multiplier for debugging.
The meta-principle
When in doubt, make it boring. Fancy DAG abstractions and clever magic decorators age badly. A pipeline that's easy to read at 3am when it's broken is worth more than one that's elegant on slides.
← Back to all posts