Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads
Custom Kubernetes scheduler plugins improve GPU utilization by understanding GPU topology, workload types, and gang scheduling requirements.
Join the DZone community and get the full member experience.
Join For FreePicture this: Enterprises burn $400K monthly on GPU clusters humming at 35% capacity while workloads queue endlessly outside. Why? The stock scheduler thinks GPUs are interchangeable, counting tokens — oblivious to silicon geography, workload personality, or the thundering cost-per-second of idle accelerators.
What follows dissects how purpose-built scheduler plugins flip that equation. We're talking technical guts: architectural decisions, deployment mechanics, working code that actually ships. No hand-waving. Just the machinery needed to make GPUs earn their keep.
The Hidden Flaw in Standard Scheduling for AI and ML
Stock Kubernetes scheduling runs on arithmetic: pod demands nvidia.com/gpu: 2, scheduler locates nodes advertising two vacant slots, done. However, AI/ML disrupts that simplicity - the hardware topology, workload characteristics, and memory arrangement all demand attention, yet the scheduler remains oblivious.
GPU Topology Blindness
The interconnect speeds differ significantly — NVLink achieves 600 GB/s, while PCIe operates at a mere 64 GB/s. Request 8 GPUs for distributed training and performance craters or soars depending on which backplane they share. Scheduler sees matching quantities, ships the pod, ignores the physics.
Workload Characteristic Ignorance: 100-millisecond inference hits get identical treatment as 72-hour training marathons. Kill the wrong one mid-flight, and thousands of dollars of compute evaporate into heat and regret.
Gang Scheduling Absence
Distributed training demands simultaneous ignition — all replicas or none. When 7 of 8 pods land successfully but the eighth starves indefinitely, the whole job fossilizes while hoarding resources.
Memory Fragmentation
GPU VRAM ships in fixed denominations — 40GB or 80GB blocks. Pack three 15GB models onto one die, and suddenly 35GB sits stranded, unreachable by a 30GB model circling hungrily overhead.
The Kubernetes Scheduler Framework
Kubernetes 1.19+ shipped an extensible scaffold—injection points where custom logic can intercept decisions mid-flight:
Pod → PreFilter → Filter → PostFilter → PreScore → Score → Reserve → Permit → PreBind → Bind → PostBind
Key extension points:
- Filter: Binary guillotine — node passes or gets axed immediately
- Score: Numeric beauty contest — rank survivors 0-100
- Permit: Final airlock before commitment — where gang coordination happens
GPU Topology-Aware Filtering Implementation
Watch how a Filter plugin reads silicon geography. Pods demanding multi-GPU placement only land where high-bandwidth pathways physically exist:
package main
import (
"context"
"fmt"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
type TopologyAwareGPUGate struct {
handle framework.Handle
}
func (t *TopologyAwareGPUGate) Name() string {
return "TopologyAwareGPUGate"
}
func (t *TopologyAwareGPUGate) Filter(ctx context.Context, state *framework.CycleState,
pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
// Sniff how many accelerators this pod craves
acceleratorDemand := extractGPUDemand(pod)
if acceleratorDemand <= 1 {
return nil // Solo GPU? Topology irrelevant
}
// Pull node's physical interconnect map from labels
interconnectType := nodeInfo.Node().Labels["nvidia.com/gpu-topology"]
nvlinkIslands := parseInterconnectIslands(nodeInfo.Node().Labels)
// Distributed training? Verify all GPUs share an NVLink island
if detectsPytorchDDP(pod) && acceleratorDemand > 1 {
if !hasUnifiedIsland(nvlinkIslands, acceleratorDemand) {
return framework.NewStatus(framework.Unschedulable,
fmt.Sprintf("Node lacks %d GPUs in unified NVLink island", acceleratorDemand))
}
}
// Tensor parallelism? Demand premium interconnect
if requiresTensorSplitting(pod) && interconnectType != "nvlink" && interconnectType != "nvswitch" {
return framework.NewStatus(framework.Unschedulable,
"Tensor parallelism demands NVLink/NVSwitch fabric")
}
return nil
}
func hasUnifiedIsland(islands map[string]int, demand int) bool {
for _, capacity := range islands {
if capacity >= demand {
return true
}
}
return false
}
This plugin reads silicon topology from node labels (DaemonSet continuously probes nvidia-smi topo -m) and guillotines nodes lacking adequate bandwidth. Core insight: Identical GPU counts mean nothing — physical placement determines if you're getting 600 GB/s or a crawling 64 GB/s.
Scoring: Intelligent Bin-Packing for Mixed Workloads
After filtering survivors, scoring applies strategy — training versus inference workloads want opposite placement philosophies:
type WorkloadStrategyScorer struct {
handle framework.Handle
}
func (w *WorkloadStrategyScorer) Score(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (int64, *framework.Status) {
nodeInfo, err := w.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(err)
}
workloadPersonality := detectWorkloadPersonality(pod)
switch workloadPersonality {
case "training":
return scoreTrainingPlacement(pod, nodeInfo), nil
case "inference":
return scoreInferencePacking(pod, nodeInfo), nil
default:
return 50, nil
}
}
func scoreTrainingPlacement(pod *v1.Pod, nodeInfo *framework.NodeInfo) int64 {
// Training craves: isolation, predictable performance, no neighbors
acceleratorDemand := extractGPUDemand(pod)
vacantAccelerators := countVacantGPUs(nodeInfo)
// Heavily penalize mixed training/inference cohabitation
if detectsInferenceNeighbors(nodeInfo) {
return 10 // Terrible fit—avoid contamination
}
// Reward contiguous GPU ID allocation
if offersContiguousIDs(nodeInfo, acceleratorDemand) {
return 90
}
return 50
}
func scoreInferencePacking(pod *v1.Pod, nodeInfo *framework.NodeInfo) int64 {
// Inference loves: density, memory tetris, cohabitation
vramDemand := extractVRAMDemand(pod)
// Boost score for nodes already running inference (consolidate!)
existingVRAMLoad := measureInferenceVRAMLoad(nodeInfo)
// Target 70-80% VRAM saturation—headroom matters
projectedSaturation := (existingVRAMLoad + vramDemand) / totalNodeVRAM(nodeInfo)
switch {
case projectedSaturation >= 0.7 && projectedSaturation <= 0.8:
return 95 // Goldilocks zone
case projectedSaturation > 0.8:
return 30 // Too tight—OOM risk looms
default:
return 60 // Wasteful but acceptable
}
}
This scoring embeds a fundamental tension: training jobs demand hermetic isolation and stable performance; inference workloads thrive on aggressive bin-packing to maximize VRAM efficiency. Organizations routinely time-slice inference models (multiple models per GPU) since inference bursts sporadically rather than sustaining 100% saturation.
Gang Scheduling With Permit Plugins
The most surgical intervention for distributed training: gang scheduling enforces atomic job placement—every replica launches simultaneously, or the entire job aborts. Absent this, partial deployments consume resources indefinitely waiting for stragglers that never arrive:
import (
"sync"
"time"
)
type ReplicaSetState struct {
requiredCount int
waitingCount int
readyCount int
mutex sync.Mutex
}
type AtomicReplicaCoordinator struct {
handle framework.Handle
pendingJobs sync.Map // jobName -> ReplicaSetState
}
func (a *AtomicReplicaCoordinator) Permit(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (*framework.Status, time.Duration) {
jobIdentifier := pod.Labels["job-name"]
if jobIdentifier == "" {
return framework.NewStatus(framework.Success), 0
}
minimumReplicas := extractMinimumReplicaCount(pod)
if minimumReplicas <= 1 {
return framework.NewStatus(framework.Success), 0
}
// Track replica arrival
rawState, _ := a.pendingJobs.LoadOrStore(jobIdentifier, &ReplicaSetState{
requiredCount: minimumReplicas,
waitingCount: 0,
readyCount: 0,
})
replicaState := rawState.(*ReplicaSetState)
replicaState.mutex.Lock()
replicaState.waitingCount++
currentCount := replicaState.waitingCount
replicaState.mutex.Unlock()
if currentCount < minimumReplicas {
// Gang incomplete—hold at airlock
return framework.NewStatus(framework.Wait), 30 * time.Second
}
// Gang assembled—release all simultaneously
a.handle.IterateOverWaitingPods(func(wp framework.WaitingPod) {
if wp.GetPod().Labels["job-name"] == jobIdentifier {
wp.Allow(a.Name())
}
})
return framework.NewStatus(framework.Success), 0
}
The Permit plugin acts as a coordination airlock. Pods reaching this stage already passed filtering/scoring — resources reserved, placement decided. The plugin holds each pod in suspension until siblings arrive. Once the gang completes, all doors open simultaneously.
This mechanism eliminates the classic pathology: 15 of 16 PyTorch DDP workers successfully reserve GPUs, the 16th starves eternally, and the cluster remains deadlocked on incomplete jobs, consuming resources without progressing.
Priority and Preemption for Cost Optimization
GPU time hemorrhages money — 8xA100 nodes burn roughly $24/hour on major clouds. Interrupt a 48-hour training run near completion, and you've torched $1,000+ in wasted cycles. Strategic preemption logic slashes this waste:
import (
"time"
)
type CostAwarePreemptor struct {
handle framework.Handle
}
func (c *CostAwarePreemptor) evaluateVictims(pod *v1.Pod, nodeInfo *framework.NodeInfo) []*v1.Pod {
incomingPodMetadata := extractPodCharacteristics(pod)
var candidateVictims []*v1.Pod
for _, existingPod := range nodeInfo.Pods {
victimMetadata := extractPodCharacteristics(existingPod.Pod)
// Sacred cows: never preempt high-priority training
if victimMetadata.workloadType == "training" && victimMetadata.priority >= 1000 {
continue
}
// Low-hanging fruit: preempt stateless inference (restarts cheaply)
if victimMetadata.workloadType == "inference" {
candidateVictims = append(candidateVictims, existingPod.Pod)
continue
}
// Training victims: surgical decisions required
if victimMetadata.workloadType == "training" {
elapsedHours := time.Since(victimMetadata.startTime).Hours()
// Proximity to completion? Hands off
if victimMetadata.estimatedCompletion > 0 &&
victimMetadata.estimatedCompletion-elapsedHours < 2 {
continue
}
// Checkpointing absent? Don't torch irreplaceable work
if !hasCheckpointingEnabled(existingPod.Pod) {
continue
}
// Lower priority + recent checkpoint? Acceptable victim
if victimMetadata.priority < incomingPodMetadata.priority &&
victimMetadata.timeSinceCheckpoint < 30*time.Minute {
candidateVictims = append(candidateVictims, existingPod.Pod)
}
}
}
return candidateVictims
}
This preemption logic encodes workload economics: Inference services carry no state between invocations and restart instantly; training jobs accumulate state, and bleeding one mid-run destroys significant investment. By interrogating checkpoint freshness and elapsed time, the scheduler minimizes squandered compute.
Deployment Architecture
Custom scheduler plugins demand thoughtful deployment. Recommended pattern: Secondary scheduler running in parallel to the default:
apiVersion: v1
kind: ConfigMap
metadata:
name: ai-scheduler-config
namespace: kube-system
data:
scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: ai-scheduler
plugins:
filter:
enabled:
- name: TopologyAwareGPUGate
score:
enabled:
- name: WorkloadStrategyScorer
weight: 10
permit:
enabled:
- name: AtomicReplicaCoordinator
pluginConfig:
- name: TopologyAwareGPUGate
args:
requiredTopologies:
- nvlink
- nvswitch
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-scheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
component: ai-scheduler
template:
metadata:
labels:
component: ai-scheduler
spec:
serviceAccountName: ai-scheduler
containers:
- name: scheduler
image: your-registry/ai-scheduler:v1.0
command:
- /usr/local/bin/kube-scheduler
- --config=/etc/kubernetes/scheduler-config.yaml
volumeMounts:
- name: config
mountPath: /etc/kubernetes
volumes:
- name: config
configMap:
name: ai-scheduler-config
Workloads opt-in via schedulerName: ai-scheduler in pod specs. This enables incremental rollout and instant rollback if things blow up.
Real World Impact
Organizations deploying custom scheduler plugins witness dramatic shifts:
- GPU utilization: Increased from 30-40% to 60-80% through enhanced bin-packing and topology awareness
- Queue time: Reduced from 4.6 hours to an average of 18 minutes by eliminating head-of-line blocking
- Cost savings: $180K annually from smarter preemption in one large ML org
- Developer productivity: Ticket floods about "stuck jobs" evaporate when gang scheduling guarantees atomic placement
Monitoring and Observability
Critical metrics that are worth tracking:
var (
scheduleAttempts = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "ai_scheduler_attempts_total",
},
[]string{"workload_type", "result"},
)
gangSchedulingWaitTime = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gang_scheduling_wait_seconds",
Buckets: []float64{10, 30, 60, 300, 600},
},
[]string{"job_name"},
)
)
Limitations and Drawbacks
- Operational complexity: You're now maintaining critical control plane machinery. One team invested 2 engineer-months hunting race conditions in gang scheduling.
- State management: Schedulers require state (gang membership, topology cache) while appearing stateless to Kubernetes. Most systems use in-memory caches, which need regular updates.
- Testing complexity: Unit testing is easy, but integration tests, where mocking tricky and complex cluster states and cascading failures can be hard.
- Performance impact: Each plugin may add up to 10-50ms of latency.
Go
func (t *TopologyAwareGPUGate) Filter( ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo, ) *framework.Status { startTime := time.Now() defer func() { if duration := time.Since(startTime); duration > 50*time.Millisecond { klog.Warningf("Filter consumed %v for pod %s", duration, pod.Name) } }() // ... filter logic } - Version compatibility: Scheduler framework APIs evolve. Plugins targeting K8s 1.24 may require surgery for 1.28.
When NOT to Use Custom Scheduler Plugins
- Small GPU footprint: Under 10 GPU nodes, operational overhead dwarfs benefits. Standard node labels and priority classes suffice.
- Workload homogeneity: Pure inference or pure training? Default scheduling with quotas typically handles this gracefully.
- Existing solutions available: Verify whether your cloud provider or NVIDIA GPU Operator already solves your problem.
- Insufficient observability: Custom schedulers starve without real-time GPU telemetry. Build observability infrastructure first.
Alternate Approaches
- Volcano scheduler: An open-source batch scheduler that provides gang scheduling and is inherently aware of GPU topology:
YAML
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: distributed-training spec: minAvailable: 4 # Gang scheduling schedulerName: volcano tasks: - replicas: 4 template: spec: containers: - name: trainer resources: limits: nvidia.com/gpu: 1 - Kueue: CNCF project delivering job queueing with multi-tenancy and quotas, cooperates with the default scheduler.
- Ray/Kubeflow/Flyte: Higher-level ML orchestrators managing resources above the Kubernetes abstraction layer.
- Multi-cluster: For massive GPU fleets, federate smaller clusters rather than engineering monolithic single-cluster complexity.
Making the Decision
Build custom schedulers when:
- GPU fleet exceeds 20 nodes with mixed workloads
- Measurable waste stems from topology bottlenecks
- Team possesses Kubernetes control plane expertise
- Robust GPU observability already operational
Use existing solutions when:
- Small footprint (under 10 nodes)
- Homogeneous workloads
- Limited SRE capacity
- Cloud provider solutions address majority requirements
Hybrid approach: Launch with Volcano/Kueue, inject targeted custom plugins only where gaps materialize.
Conclusion
Custom Kubernetes scheduler plugins unlock dormant GPU infrastructure potential for AI/ML workloads. By encoding domain expertise about topology constraints, workload characteristics, and economic realities into scheduling logic, SREs transform utilization from abysmal to exceptional.
Custom schedulers aren't a perfect answer. They can make things difficult to manage, need constant upkeep, and understanding Kubernetes is a must. Before starting any custom project, businesses should really see if Volcano, Kueue, or any other cloud services already work for what they need.
For teams proceeding: Start minimal — implement topology filtering first, measure impact, then layer scoring logic, finally tackle gang scheduling and preemption. Every enhancement builds upon itself, and the return on investment is swiftly realized through reduced cloud expenses and heightened developer productivity when executed at the appropriate scale.
The ideal scenario for custom scheduler plugins involves organizations managing over 20 GPU nodes with a combination of training and inference tasks, established observability systems, and committed platform engineering teams.
For these organizations, custom schedulers transition from optional optimization to essential infrastructure for serious ML platforms.
Start with the current tools, accurately assess the pain points, and develop tailored solutions solely in areas where clear deficiencies are evident. The landscape of GPU infrastructure is changing swiftly — what is custom code today may turn into the baseline requirements of open-source tomorrow.
Opinions expressed by DZone contributors are their own.
Comments