Felix Pinkston
Apr 09, 2026 17:23
NVIDIA’s Slinky challenge allows operating Slurm clusters on Kubernetes, already deployed on 8,000+ GPU programs for large-scale AI coaching infrastructure.

NVIDIA has launched Slinky, an open-source challenge that bridges the hole between Slurm—the job scheduler operating over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The corporate already runs Slinky in manufacturing throughout clusters with greater than 8,000 GPUs.
The technical drawback right here is actual: organizations have years invested in Slurm job scripts, fair-share insurance policies, and accounting workflows. However Kubernetes has grow to be the usual for managing GPU infrastructure. Operating two separate environments creates operational complications that compound at scale.
How Slinky Truly Works
Slinky’s slurm-operator represents every Slurm element—scheduling, accounting, compute staff, API entry—as Kubernetes Customized Useful resource Definitions. You outline a Slurm cluster utilizing Customized Assets, and Slinky spins up containerized Slurm daemons in their very own pods.
The high-availability story issues for manufacturing deployments. Slinky handles management airplane HA by means of pod regeneration quite than Slurm’s native mechanism. Configuration modifications propagate routinely with zero scheduler downtime. Staff can autoscale primarily based on cluster metrics, and on scale-in, Slinky totally drains nodes earlier than terminating pods—operating workloads full first.
For NVIDIA’s GB200 NVL72 structure, the place GPUs talk throughout nodes by means of multinode NVLink, Slinky allows ComputeDomains that dynamically handle high-bandwidth GPU-to-GPU connectivity. Distributed coaching jobs obtain full NVLink bandwidth throughout node boundaries.
Manufacturing Outcomes at NVIDIA
NVIDIA reviews GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable influence from the Kubernetes layer. New clusters reportedly go from zero to operating jobs in hours utilizing Helm charts.
The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside commonplace Kubernetes metrics. When well being checks flag an unhealthy node, the state syncs routinely between programs. Rolling updates proceed whereas coaching jobs proceed on remaining capability.
One constraint value noting: Slinky presently assumes one employee pod per node. Should you’re operating solely single-node Slurm jobs, this over-provisions relative to what you want.
What’s New in v1.1.0
The not too long ago launched slurm-operator v1.1.0 provides dynamic topology help—employee pods now register with topology primarily based on their Kubernetes node, enabling topology-aware scheduling as pods transfer. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters the place each GPU node ought to run a Slurm employee.
The roadmap consists of swish cluster upgrades, deliberate outage workflows, and configuration rollback. For AI infrastructure groups weighing build-versus-integrate choices, Slinky represents a significant choice that did not exist a yr in the past. The code is on the market on GitHub below the SlinkyProject group.
Picture supply: Shutterstock
