GPU scheduling on Kubernetes: bin-packing for ML workloads
2025-12-12 · Jan van der Berg
GPU scheduling on Kubernetes
The default Kubernetes scheduler treats GPUs as indivisible integers. That works fine when every pod wants a full A100. It works terribly when your pods want 0.3 of one, and you have a fleet of them. Here's what we did.
The problem
ML training jobs have wildly different GPU profiles: - Fine-tuning a small LLM: 1 GPU, full utilization - Embedding service: 0.2 GPU, intermittent load - Feature batch job: 0.5 GPU, 20-minute burst - Evaluation runner: 0.1 GPU, constant
Naive scheduling gives every pod a whole GPU. Our average utilization was 31%.
Bin-packing: the concept
Instead of "does it fit", ask "where does it fit with the least waste". Standard 2D bin-packing, where the bins are GPUs and the dimensions are (compute fraction, memory fraction).
The Kubernetes scheduler supports scoring plugins. We wrote one that:
- Filters nodes with capacity for the request
- Scores each by how tightly the pod packs onto existing workloads
- Prefers nodes where the pod fits without fragmenting free space
Fractional GPU via MIG and time-slicing
Two options, NVIDIA-side:
MIG (Multi-Instance GPU): A100/H100 can be partitioned into up to 7 isolated instances. Memory-isolated, compute-isolated, predictable.
Time-slicing: Multiple pods share a GPU, scheduler multiplexes. Not isolated — noisy neighbors possible.
We use MIG for training jobs (isolation matters) and time-slicing for inference (latency is bounded by service SLOs anyway).
apiVersion: v1
kind: Pod
spec:
containers:
- name: train
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # one MIG slice, 10GB
Spot instances with graceful preemption
GPU spot prices are 60–75% cheaper. The catch: they die with 30s notice. Our wrapper:
- Trap SIGTERM, checkpoint model state to S3
- Emit "preempted" metric with job ID
- Scheduler notices, replays job on next available GPU from the latest checkpoint
Net effect: spot for everything that can restart (training), on-demand for things that can't (realtime inference).
Priority queues
Not all jobs are equal. We run three queues:
| Queue | Priority | Preemptable | Use case |
|---|---|---|---|
| interactive | high | no | Notebook sessions, dev |
| batch | medium | by interactive | Scheduled training |
| backfill | low | by both | Historical recomputes |
When an interactive job arrives and no GPU is free, a backfill or batch job gets bumped. Fair, predictable, documented.
Results after 3 months
| Metric | Before | After |
|---|---|---|
| Average GPU utilization | 31% | 64% |
| Spot-to-on-demand ratio | 0% | 72% |
| GPU spend (monthly) | €84,200 | €52,100 |
| p95 job queue time | 14 min | 4 min |
What we got wrong initially
- Over-fragmentation. Early bin-packer split GPUs too aggressively; inference pods had sub-millisecond kernel launches but per-pod memory limits caused swapping. Fixed with minimum slice size.
- Priority inversion. Low-priority backfills holding GPU for hours blocked medium-priority jobs. Added max-duration-before-preemption.
- MIG re-partitioning cost. Changing MIG profile on a node takes ~90s. Keep profile stable, don't reshape per-job.
Conclusion
Default Kubernetes + default NVIDIA device plugin is a good starting point. Getting to real efficiency needs a scheduler that understands fractional GPUs, spot-friendly checkpointing, and priority queues. The payback on engineering effort is measured in weeks, not months.
← Back to all posts