Kubernetes Scheduler Plugins: Optimizing AI/ML Workloads

Custom Kubernetes scheduler plugins improve GPU utilization by understanding GPU topology, workload types, and gang scheduling requirements.

Mar. 20, 26 · Analysis

Likes (0)

Comment

Save

4.7K Views

Picture this: Enterprises burn $400K monthly on GPU clusters humming at 35% capacity while workloads queue endlessly outside. Why? The stock scheduler thinks GPUs are interchangeable, counting tokens — oblivious to silicon geography, workload personality, or the thundering cost-per-second of idle accelerators.

What follows dissects how purpose-built scheduler plugins flip that equation. We're talking technical guts: architectural decisions, deployment mechanics, working code that actually ships. No hand-waving. Just the machinery needed to make GPUs earn their keep.

The Hidden Flaw in Standard Scheduling for AI and ML

Stock Kubernetes scheduling runs on arithmetic: pod demands nvidia.com/gpu: 2, scheduler locates nodes advertising two vacant slots, done. However, AI/ML disrupts that simplicity - the hardware topology, workload characteristics, and memory arrangement all demand attention, yet the scheduler remains oblivious.

GPU Topology Blindness

The interconnect speeds differ significantly — NVLink achieves 600 GB/s, while PCIe operates at a mere 64 GB/s. Request 8 GPUs for distributed training and performance craters or soars depending on which backplane they share. Scheduler sees matching quantities, ships the pod, ignores the physics.

Workload Characteristic Ignorance: 100-millisecond inference hits get identical treatment as 72-hour training marathons. Kill the wrong one mid-flight, and thousands of dollars of compute evaporate into heat and regret.

Gang Scheduling Absence

Distributed training demands simultaneous ignition — all replicas or none. When 7 of 8 pods land successfully but the eighth starves indefinitely, the whole job fossilizes while hoarding resources.

Memory Fragmentation

GPU VRAM ships in fixed denominations — 40GB or 80GB blocks. Pack three 15GB models onto one die, and suddenly 35GB sits stranded, unreachable by a 30GB model circling hungrily overhead.

The Kubernetes Scheduler Framework

Kubernetes 1.19+ shipped an extensible scaffold—injection points where custom logic can intercept decisions mid-flight:

Pod → PreFilter → Filter → PostFilter → PreScore → Score → Reserve → Permit → PreBind → Bind → PostBind

Key extension points:

Filter: Binary guillotine — node passes or gets axed immediately
Score: Numeric beauty contest — rank survivors 0-100
Permit: Final airlock before commitment — where gang coordination happens

GPU Topology-Aware Filtering Implementation

Watch how a Filter plugin reads silicon geography. Pods demanding multi-GPU placement only land where high-bandwidth pathways physically exist:

    Go
   
 

   package main

import (
    "context"
    "fmt"
    
    v1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/kubernetes/pkg/scheduler/framework"
)

type TopologyAwareGPUGate struct {
    handle framework.Handle
}

func (t *TopologyAwareGPUGate) Name() string {
    return "TopologyAwareGPUGate"
}

func (t *TopologyAwareGPUGate) Filter(ctx context.Context, state *framework.CycleState, 
    pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    
    // Sniff how many accelerators this pod craves
    acceleratorDemand := extractGPUDemand(pod)
    if acceleratorDemand <= 1 {
        return nil // Solo GPU? Topology irrelevant
    }
    
    // Pull node's physical interconnect map from labels
    interconnectType := nodeInfo.Node().Labels["nvidia.com/gpu-topology"]
    nvlinkIslands := parseInterconnectIslands(nodeInfo.Node().Labels)
    
    // Distributed training? Verify all GPUs share an NVLink island
    if detectsPytorchDDP(pod) && acceleratorDemand > 1 {
        if !hasUnifiedIsland(nvlinkIslands, acceleratorDemand) {
            return framework.NewStatus(framework.Unschedulable, 
                fmt.Sprintf("Node lacks %d GPUs in unified NVLink island", acceleratorDemand))
        }
    }
    
    // Tensor parallelism? Demand premium interconnect
    if requiresTensorSplitting(pod) && interconnectType != "nvlink" && interconnectType != "nvswitch" {
        return framework.NewStatus(framework.Unschedulable, 
            "Tensor parallelism demands NVLink/NVSwitch fabric")
    }
    
    return nil
}

func hasUnifiedIsland(islands map[string]int, demand int) bool {
    for _, capacity := range islands {
        if capacity >= demand {
            return true
        }
    }
    return false
}
  

This plugin reads silicon topology from node labels (DaemonSet continuously probes nvidia-smi topo -m) and guillotines nodes lacking adequate bandwidth. Core insight: Identical GPU counts mean nothing — physical placement determines if you're getting 600 GB/s or a crawling 64 GB/s.

Scoring: Intelligent Bin-Packing for Mixed Workloads

After filtering survivors, scoring applies strategy — training versus inference workloads want opposite placement philosophies:

    Go
   
 

   type WorkloadStrategyScorer struct {
	handle framework.Handle
}

func (w *WorkloadStrategyScorer) Score(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	nodeName string,
) (int64, *framework.Status) {
	nodeInfo, err := w.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
	if err != nil {
		return 0, framework.AsStatus(err)
	}

	workloadPersonality := detectWorkloadPersonality(pod)

	switch workloadPersonality {
	case "training":
		return scoreTrainingPlacement(pod, nodeInfo), nil
	case "inference":
		return scoreInferencePacking(pod, nodeInfo), nil
	default:
		return 50, nil
	}
}

func scoreTrainingPlacement(pod *v1.Pod, nodeInfo *framework.NodeInfo) int64 {
	// Training craves: isolation, predictable performance, no neighbors
	acceleratorDemand := extractGPUDemand(pod)
	vacantAccelerators := countVacantGPUs(nodeInfo)

	// Heavily penalize mixed training/inference cohabitation
	if detectsInferenceNeighbors(nodeInfo) {
		return 10 // Terrible fit—avoid contamination
	}

	// Reward contiguous GPU ID allocation
	if offersContiguousIDs(nodeInfo, acceleratorDemand) {
		return 90
	}

	return 50
}

func scoreInferencePacking(pod *v1.Pod, nodeInfo *framework.NodeInfo) int64 {
	// Inference loves: density, memory tetris, cohabitation
	vramDemand := extractVRAMDemand(pod)

	// Boost score for nodes already running inference (consolidate!)
	existingVRAMLoad := measureInferenceVRAMLoad(nodeInfo)

	// Target 70-80% VRAM saturation—headroom matters
	projectedSaturation := (existingVRAMLoad + vramDemand) / totalNodeVRAM(nodeInfo)

	switch {
	case projectedSaturation >= 0.7 && projectedSaturation <= 0.8:
		return 95 // Goldilocks zone
	case projectedSaturation > 0.8:
		return 30 // Too tight—OOM risk looms
	default:
		return 60 // Wasteful but acceptable
	}
}
  

This scoring embeds a fundamental tension: training jobs demand hermetic isolation and stable performance; inference workloads thrive on aggressive bin-packing to maximize VRAM efficiency. Organizations routinely time-slice inference models (multiple models per GPU) since inference bursts sporadically rather than sustaining 100% saturation.

Gang Scheduling With Permit Plugins

The most surgical intervention for distributed training: gang scheduling enforces atomic job placement—every replica launches simultaneously, or the entire job aborts. Absent this, partial deployments consume resources indefinitely waiting for stragglers that never arrive:

    Go
   
 

   import (
	"sync"
	"time"
)

type ReplicaSetState struct {
	requiredCount int
	waitingCount  int
	readyCount    int
	mutex         sync.Mutex
}

type AtomicReplicaCoordinator struct {
	handle      framework.Handle
	pendingJobs sync.Map // jobName -> ReplicaSetState
}

func (a *AtomicReplicaCoordinator) Permit(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	nodeName string,
) (*framework.Status, time.Duration) {
	jobIdentifier := pod.Labels["job-name"]
	if jobIdentifier == "" {
		return framework.NewStatus(framework.Success), 0
	}

	minimumReplicas := extractMinimumReplicaCount(pod)
	if minimumReplicas <= 1 {
		return framework.NewStatus(framework.Success), 0
	}

	// Track replica arrival
	rawState, _ := a.pendingJobs.LoadOrStore(jobIdentifier, &ReplicaSetState{
		requiredCount: minimumReplicas,
		waitingCount:  0,
		readyCount:    0,
	})
	replicaState := rawState.(*ReplicaSetState)

	replicaState.mutex.Lock()
	replicaState.waitingCount++
	currentCount := replicaState.waitingCount
	replicaState.mutex.Unlock()

	if currentCount < minimumReplicas {
		// Gang incomplete—hold at airlock
		return framework.NewStatus(framework.Wait), 30 * time.Second
	}

	// Gang assembled—release all simultaneously
	a.handle.IterateOverWaitingPods(func(wp framework.WaitingPod) {
		if wp.GetPod().Labels["job-name"] == jobIdentifier {
			wp.Allow(a.Name())
		}
	})

	return framework.NewStatus(framework.Success), 0
}
  

The Permit plugin acts as a coordination airlock. Pods reaching this stage already passed filtering/scoring — resources reserved, placement decided. The plugin holds each pod in suspension until siblings arrive. Once the gang completes, all doors open simultaneously.

This mechanism eliminates the classic pathology: 15 of 16 PyTorch DDP workers successfully reserve GPUs, the 16th starves eternally, and the cluster remains deadlocked on incomplete jobs, consuming resources without progressing.

Priority and Preemption for Cost Optimization

GPU time hemorrhages money — 8xA100 nodes burn roughly $24/hour on major clouds. Interrupt a 48-hour training run near completion, and you've torched $1,000+ in wasted cycles. Strategic preemption logic slashes this waste:

    Go
   
 

   import (
	"time"
)

type CostAwarePreemptor struct {
	handle framework.Handle
}

func (c *CostAwarePreemptor) evaluateVictims(pod *v1.Pod, nodeInfo *framework.NodeInfo) []*v1.Pod {
	incomingPodMetadata := extractPodCharacteristics(pod)

	var candidateVictims []*v1.Pod
	for _, existingPod := range nodeInfo.Pods {
		victimMetadata := extractPodCharacteristics(existingPod.Pod)

		// Sacred cows: never preempt high-priority training
		if victimMetadata.workloadType == "training" && victimMetadata.priority >= 1000 {
			continue
		}

		// Low-hanging fruit: preempt stateless inference (restarts cheaply)
		if victimMetadata.workloadType == "inference" {
			candidateVictims = append(candidateVictims, existingPod.Pod)
			continue
		}

		// Training victims: surgical decisions required
		if victimMetadata.workloadType == "training" {
			elapsedHours := time.Since(victimMetadata.startTime).Hours()

			// Proximity to completion? Hands off
			if victimMetadata.estimatedCompletion > 0 &&
				victimMetadata.estimatedCompletion-elapsedHours < 2 {
				continue
			}

			// Checkpointing absent? Don't torch irreplaceable work
			if !hasCheckpointingEnabled(existingPod.Pod) {
				continue
			}

			// Lower priority + recent checkpoint? Acceptable victim
			if victimMetadata.priority < incomingPodMetadata.priority &&
				victimMetadata.timeSinceCheckpoint < 30*time.Minute {
				candidateVictims = append(candidateVictims, existingPod.Pod)
			}
		}
	}

	return candidateVictims
}
  

This preemption logic encodes workload economics: Inference services carry no state between invocations and restart instantly; training jobs accumulate state, and bleeding one mid-run destroys significant investment. By interrogating checkpoint freshness and elapsed time, the scheduler minimizes squandered compute.

Deployment Architecture

Custom scheduler plugins demand thoughtful deployment. Recommended pattern: Secondary scheduler running in parallel to the default:

    YAML
   
 

   apiVersion: v1
kind: ConfigMap
metadata:
  name: ai-scheduler-config
  namespace: kube-system
data:
  scheduler-config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration
    profiles:
      - schedulerName: ai-scheduler
        plugins:
          filter:
            enabled:
              - name: TopologyAwareGPUGate
          score:
            enabled:
              - name: WorkloadStrategyScorer
                weight: 10
          permit:
            enabled:
              - name: AtomicReplicaCoordinator
        pluginConfig:
          - name: TopologyAwareGPUGate
            args:
              requiredTopologies:
                - nvlink
                - nvswitch
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-scheduler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      component: ai-scheduler
  template:
    metadata:
      labels:
        component: ai-scheduler
    spec:
      serviceAccountName: ai-scheduler
      containers:
        - name: scheduler
          image: your-registry/ai-scheduler:v1.0
          command:
            - /usr/local/bin/kube-scheduler
            - --config=/etc/kubernetes/scheduler-config.yaml
          volumeMounts:
            - name: config
              mountPath: /etc/kubernetes
      volumes:
        - name: config
          configMap:
            name: ai-scheduler-config
  

Workloads opt-in via schedulerName: ai-scheduler in pod specs. This enables incremental rollout and instant rollback if things blow up.

Real World Impact

Organizations deploying custom scheduler plugins witness dramatic shifts:

GPU utilization: Increased from 30-40% to 60-80% through enhanced bin-packing and topology awareness
Queue time: Reduced from 4.6 hours to an average of 18 minutes by eliminating head-of-line blocking
Cost savings: $180K annually from smarter preemption in one large ML org
Developer productivity: Ticket floods about "stuck jobs" evaporate when gang scheduling guarantees atomic placement

Monitoring and Observability

Critical metrics that are worth tracking:

    Go
   
 

   var (
	scheduleAttempts = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "ai_scheduler_attempts_total",
		},
		[]string{"workload_type", "result"},
	)

	gangSchedulingWaitTime = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "gang_scheduling_wait_seconds",
			Buckets: []float64{10, 30, 60, 300, 600},
		},
		[]string{"job_name"},
	)
)
  

Limitations and Drawbacks

Operational complexity: You're now maintaining critical control plane machinery. One team invested 2 engineer-months hunting race conditions in gang scheduling.
State management: Schedulers require state (gang membership, topology cache) while appearing stateless to Kubernetes. Most systems use in-memory caches, which need regular updates.
Testing complexity: Unit testing is easy, but integration tests, where mocking tricky and complex cluster states and cascading failures can be hard.

Performance impact: Each plugin may add up to 10-50ms of latency.

      Go
     
 

     func (t *TopologyAwareGPUGate) Filter(
	ctx context.Context,
	state *framework.CycleState,
	pod *v1.Pod,
	nodeInfo *framework.NodeInfo,
) *framework.Status {
	startTime := time.Now()
	defer func() {
		if duration := time.Since(startTime); duration > 50*time.Millisecond {
			klog.Warningf("Filter consumed %v for pod %s", duration, pod.Name)
		}
	}()
	// ... filter logic
}
    

Version compatibility: Scheduler framework APIs evolve. Plugins targeting K8s 1.24 may require surgery for 1.28.

When NOT to Use Custom Scheduler Plugins

Small GPU footprint: Under 10 GPU nodes, operational overhead dwarfs benefits. Standard node labels and priority classes suffice.
Workload homogeneity: Pure inference or pure training? Default scheduling with quotas typically handles this gracefully.
Existing solutions available: Verify whether your cloud provider or NVIDIA GPU Operator already solves your problem.
Insufficient observability: Custom schedulers starve without real-time GPU telemetry. Build observability infrastructure first.

Alternate Approaches

Volcano scheduler: An open-source batch scheduler that provides gang scheduling and is inherently aware of GPU topology:

      YAML
     
 

     apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: distributed-training
spec:
  minAvailable: 4 # Gang scheduling
  schedulerName: volcano
  tasks:
    - replicas: 4
      template:
        spec:
          containers:
            - name: trainer
              resources:
                limits:
                  nvidia.com/gpu: 1
    

Kueue: CNCF project delivering job queueing with multi-tenancy and quotas, cooperates with the default scheduler.
Ray/Kubeflow/Flyte: Higher-level ML orchestrators managing resources above the Kubernetes abstraction layer.
Multi-cluster: For massive GPU fleets, federate smaller clusters rather than engineering monolithic single-cluster complexity.

Making the Decision

Build custom schedulers when:

GPU fleet exceeds 20 nodes with mixed workloads
Measurable waste stems from topology bottlenecks
Team possesses Kubernetes control plane expertise
Robust GPU observability already operational

Use existing solutions when:

Small footprint (under 10 nodes)
Homogeneous workloads
Limited SRE capacity
Cloud provider solutions address majority requirements

Hybrid approach: Launch with Volcano/Kueue, inject targeted custom plugins only where gaps materialize.

Conclusion

Custom Kubernetes scheduler plugins unlock dormant GPU infrastructure potential for AI/ML workloads. By encoding domain expertise about topology constraints, workload characteristics, and economic realities into scheduling logic, SREs transform utilization from abysmal to exceptional.

Custom schedulers aren't a perfect answer. They can make things difficult to manage, need constant upkeep, and understanding Kubernetes is a must. Before starting any custom project, businesses should really see if Volcano, Kueue, or any other cloud services already work for what they need.

For teams proceeding: Start minimal — implement topology filtering first, measure impact, then layer scoring logic, finally tackle gang scheduling and preemption. Every enhancement builds upon itself, and the return on investment is swiftly realized through reduced cloud expenses and heightened developer productivity when executed at the appropriate scale.

The ideal scenario for custom scheduler plugins involves organizations managing over 20 GPU nodes with a combination of training and inference tasks, established observability systems, and committed platform engineering teams.

For these organizations, custom schedulers transition from optional optimization to essential infrastructure for serious ML platforms.

Start with the current tools, accurately assess the pain points, and develop tailored solutions solely in areas where clear deficiencies are evident. The landscape of GPU infrastructure is changing swiftly — what is custom code today may turn into the baseline requirements of open-source tomorrow.

AI Kubernetes pods

Opinions expressed by DZone contributors are their own.

Related

Trending