GPU scheduling on Kubernetes: bin-packing for ML workloads

2025-12-12 · Jan van der Berg

GPU scheduling on Kubernetes

The default Kubernetes scheduler treats GPUs as indivisible integers. That works fine when every pod wants a full A100. It works terribly when your pods want 0.3 of one, and you have a fleet of them. Here's what we did.

The problem

ML training jobs have wildly different GPU profiles: - Fine-tuning a small LLM: 1 GPU, full utilization - Embedding service: 0.2 GPU, intermittent load - Feature batch job: 0.5 GPU, 20-minute burst - Evaluation runner: 0.1 GPU, constant

Naive scheduling gives every pod a whole GPU. Our average utilization was 31%.

Bin-packing: the concept

Instead of "does it fit", ask "where does it fit with the least waste". Standard 2D bin-packing, where the bins are GPUs and the dimensions are (compute fraction, memory fraction).

The Kubernetes scheduler supports scoring plugins. We wrote one that:

  1. Filters nodes with capacity for the request
  2. Scores each by how tightly the pod packs onto existing workloads
  3. Prefers nodes where the pod fits without fragmenting free space

Fractional GPU via MIG and time-slicing

Two options, NVIDIA-side:

MIG (Multi-Instance GPU): A100/H100 can be partitioned into up to 7 isolated instances. Memory-isolated, compute-isolated, predictable.

Time-slicing: Multiple pods share a GPU, scheduler multiplexes. Not isolated — noisy neighbors possible.

We use MIG for training jobs (isolation matters) and time-slicing for inference (latency is bounded by service SLOs anyway).

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: train
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1   # one MIG slice, 10GB

Spot instances with graceful preemption

GPU spot prices are 60–75% cheaper. The catch: they die with 30s notice. Our wrapper:

  1. Trap SIGTERM, checkpoint model state to S3
  2. Emit "preempted" metric with job ID
  3. Scheduler notices, replays job on next available GPU from the latest checkpoint

Net effect: spot for everything that can restart (training), on-demand for things that can't (realtime inference).

Priority queues

Not all jobs are equal. We run three queues:

Queue Priority Preemptable Use case
interactive high no Notebook sessions, dev
batch medium by interactive Scheduled training
backfill low by both Historical recomputes

When an interactive job arrives and no GPU is free, a backfill or batch job gets bumped. Fair, predictable, documented.

Results after 3 months

Metric Before After
Average GPU utilization 31% 64%
Spot-to-on-demand ratio 0% 72%
GPU spend (monthly) €84,200 €52,100
p95 job queue time 14 min 4 min

What we got wrong initially

  • Over-fragmentation. Early bin-packer split GPUs too aggressively; inference pods had sub-millisecond kernel launches but per-pod memory limits caused swapping. Fixed with minimum slice size.
  • Priority inversion. Low-priority backfills holding GPU for hours blocked medium-priority jobs. Added max-duration-before-preemption.
  • MIG re-partitioning cost. Changing MIG profile on a node takes ~90s. Keep profile stable, don't reshape per-job.

Conclusion

Default Kubernetes + default NVIDIA device plugin is a good starting point. Getting to real efficiency needs a scheduler that understands fractional GPUs, spot-friendly checkpointing, and priority queues. The payback on engineering effort is measured in weeks, not months.


← Back to all posts