Shadow deployments for ML: catching regressions before they hit users

2026-02-22 · Mikael Laakso

Shadow deployments for ML

Canary deployments work well for services. For ML models, they have one awkward property: the canary's predictions go to real users, so a bad canary affects someone. Shadow deployment sidesteps this by routing a copy of production traffic to the new model without using its output.

The pattern

              ┌─── [Model v7] ─── served to user
request ──────┤
              └─── [Model v8] ─── logged, compared, discarded

Both models see the same input. Only v7's prediction is used. v8's prediction is logged alongside v7's. You now have a real-traffic A/B comparison without any user-visible risk.

What to compare

Metric Reveals
Prediction agreement rate How often models disagree
Confidence distribution Is new model more/less certain?
Latency distribution Production-load performance
Error rate Hidden bugs under real inputs
Resource use CPU/GPU/memory under load
Segment-level metrics Differences on specific cohorts

The last one is what usually saves us. An aggregate 2% improvement can hide a 15% regression on one user segment.

Implementation

Simplest version: a teeing serving layer.

@app.post("/predict")
async def predict(req: Request):
    primary = await model_v7.predict(req.features)
    # Fire-and-forget
    asyncio.create_task(shadow_and_log(req, primary))
    return primary


async def shadow_and_log(req, primary_pred):
    try:
        shadow_pred = await model_v8.predict(req.features)
        await log_comparison(req.id, primary_pred, shadow_pred)
    except Exception as e:
        metrics.increment("shadow.error", tags={"model": "v8"})

Two things matter here: 1. Shadow must not affect primary latency. If v8 blocks on I/O, it should not slow v7. Fire-and-forget or separate worker pool. 2. Shadow errors must not affect primary. Caught, logged, never raised up the primary path.

How long to shadow

Depends on traffic volume and seasonality. Rule of thumb: - Statistical significance: until you have at least 10k comparison samples per relevant segment. - Temporal coverage: at least one full business cycle (for us, usually a week — Monday looks different from Saturday).

Pitfalls

  • Non-idempotent features. If feature generation has side effects (logging, calls to external APIs), running it twice can double-count. Use read-only feature fetch for shadow.
  • Cost. You're doubling inference cost. On GPU-heavy models, shadowing at 100% traffic is expensive. Sample 10–25% instead.
  • Delayed labels. Prediction agreement ≠ quality. You still need delayed ground truth to evaluate which model is actually better.

Shadow vs canary vs A/B

Pattern User impact Label needed for decision Use when
Shadow none for ground truth only pre-promotion validation
Canary small % affected yes, from canary segment catch live regressions early
A/B measured cohort yes compare two live models

Shadow is pre-promotion. Canary is post-promotion. Don't skip either.

Conclusion

Shadow deployment is the cheapest bug-catcher in an ML team's toolkit. It doesn't replace offline eval or A/B tests — it's the layer between them, and it catches the bugs that only show up in production traffic patterns.


← Back to all posts