NVIDIA GPU Operator Explained: Simplifying GPU Workloads on Kubernetes

Learn how NVIDIA GPU Operator simplifies GPU management in Kubernetes. Explore features, setup steps, and best practices for AI/ML workloads.

Sagar Parmar

Nov. 18, 25 · Tutorial

Likes (1)

Comment

Save

3.3K Views

While GPUs have long been a staple in industries like gaming, video editing, CAD, and 3D rendering, their role has evolved dramatically over the years. Originally designed to handle graphics-intensive tasks, GPUs have proven to be powerful tools for a wide range of computationally demanding applications. Today, their ability to perform massive parallel processing has made them indispensable in modern fields such as data science, artificial intelligence and machine learning (AI/ML), robotics, cryptocurrency mining, and scientific computing. This shift was catalysed by the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA in 2007, which unlocked the potential of GPUs for general-purpose computing. As a result, GPUs are no longer just graphics accelerators; they’re now at the heart of cutting-edge innovation across industries.

In this blog post, we will discuss the NVIDIA GPU operator on Kubernetes and how to deploy it on the Kubernetes Cluster.

Why Run GPU Workload on Kubernetes?

Running GPU workload on Kubernetes offers a significant advantage because it enables developers to seamlessly schedule and run GPU-powered applications, and it simplifies the deployment and scaling of these workloads. With Kubernetes, workloads can be easily scaled up or down based on demand, while features like role-based access control (RBAC) provide isolation and multi-tenancy for secure, shared environments. Additionally, Kubernetes supports the creation of multi-cloud GPU clusters, allowing organizations to leverage GPU resources across different cloud providers with consistent orchestration and control.

In this article, we’ll explore the GPU-Kubernetes integration stack in depth with the help of NVIDIA GPU Operator. From the host operating system to the Kubernetes control plane, we’ll peel back each layer to understand the components required to make GPUs work seamlessly within a Kubernetes environment. More importantly, we’ll uncover why each component matters and how they interact with one another.

How GPUs Are Integrated in Kubernetes Without a GPU Operator

Kubernetes excels at managing standard compute workloads, but orchestrating high-performance hardware like GPUs introduces unique challenges. Before diving into the GPU Operator, it’s important to understand the three foundational layers required to run GPU workloads in Kubernetes. Think of it as a recipe; each step must be correctly configured for the GPU to function seamlessly within the cluster.

Step 1: The Host Operating System

CUDA Toolkit Version

Everything begins at the host level. The NVIDIA device driver is the critical software that communicates directly with the GPU hardware. A key requirement here is version compatibility between the driver and the CUDA toolkit embedded in your container image. This compatibility matrix must be accurate; any mismatch can break GPU functionality.

Step 2: The Container Runtime (Docker, Containerd, CRI-O, RunC, etc.)

Next, we need a bridge between the container runtime (e.g., Docker, containerd, CRI-O) and the host GPU. This is where the NVIDIA Container Toolkit comes in.

Core Functions of the Toolkit

GPU access enablement: Provides essential libraries like libnvidia-container and nvidia-container-cli to configure runtimes for GPU access.
Runtime configuration: Injects GPU device files, drivers, and environment variables into containers via runtime hooks (e.g., updates to /etc/containerd/config.toml).
Device plugin dependency: The NVIDIA Device Plugin relies on the toolkit to expose GPU resources to Kubernetes.
Abstraction layer: Allows containers to use GPUs without bundling drivers or CUDA libraries inside the image, keeping containers lightweight and portable.

Without this toolkit, containers remain unaware of the GPU hardware on the node.

Step 3: The Kubernetes Orchestration Layer

Finally, Kubernetes needs to recognize and schedule GPU resources. This is achieved through the NVIDIA Device Plugin, which runs as a DaemonSet on GPU-enabled nodes.

Core Functions of the Device Plugin

GPU discovery and advertising: Detects available GPUs and registers them with the Kubelet as extended resources (e.g., nvidia.com/gpu).
Resource allocation: When a pod requests a GPU, the plugin ensures the container receives the correct device files, drivers, and environment variables.
Health monitoring: Continuously checks GPU health and updates Kubernetes to prevent scheduling on faulty devices.
GPU sharing and partitioning: It maximizes utilization via advanced features:
- Time-slicing: Allows multiple containers to share a single GPU’s compute power.
- Multi-instance GPU (MIG): Partitions high-end GPUs (like the A100) into multiple, fully isolated hardware instances.
- Virtual GPU (vGPU ): Enables the sharing of a single GPU among multiple virtual machines.

Why Scaling GPU Workloads in Kubernetes Is Hard and How Operators Help

The three-layer setup we discussed works well on a single machine. But things get complicated when you scale to a production-grade Kubernetes cluster with hundreds or thousands of nodes. That’s when the manual approach starts to fall apart and the real operational pain begins.

Manually managing an entire fleet introduces a massive operational challenge that can bring projects to a grinding halt. You’re navigating a minefield of issues:

Driver compatibility: Different GPU models require different, specific driver versions.
Configuration drift: Nodes inevitably fall out of sync over time.
Risky upgrades: The upgrade process becomes a high-risk nightmare.
Doubled workload: You often end up managing two completely separate software stacks, one for CPU nodes and another for GPU nodes, effectively doubling your workload.

To solve these scaling challenges, the Kubernetes community embraced a powerful cloud-native pattern: ‘The Operator.’ Think of it as an automated expert, a robotic administrator that continuously monitors your cluster and handles all the tedious, error-prone tasks for you. It brings consistency, reliability, and automation to GPU management at scale.

The GPU Operator works in a control loop, constantly observing the state of your nodes and ensuring they match the desired configuration you’ve defined. This means no more manual setup, no more configuration drift, and no more juggling separate software stacks for CPU and GPU nodes. Instead, you get consistency, reliability, and automation at scale.

This shift from manual management to automated orchestration is what makes the Operator pattern so transformative. It turns GPU infrastructure from a fragile, high-maintenance setup into a resilient, self-healing system.

How NVIDIA GPU Operator Works

The Operator establishes a consistent, automated workflow for every node in your cluster. It eliminates manual intervention through a streamlined process. It begins by: -

Discovery: It first identifies which nodes physically possess GPUs.
Installation and configuration: In the required order, it automatically installs the necessary containerised drivers, configures the Container Toolkit, and deploys the device plugin along with monitoring tools.
Validation: This final step is critical: the Operator validates that every component is working perfectly before allowing Kubernetes to schedule any AI workloads on that node.

This process guarantees reliability and prevents misconfigured nodes from disrupting GPU-intensive applications.

Installing the NVIDIA GPU Operator

Installing the NVIDIA GPU Operator in Kubernetes is straightforward with Helm. The Operator automates the deployment and configuration of all essential GPU components, including drivers, the container toolkit, and device plugins across your cluster. To ensure a smooth setup, follow a step-by-step approach.

Prerequisites

Before proceeding, please make sure that you have met the following prerequisites:

Operating System Requirements for the GPU Operator:
- To use the NVIDIA GPU Driver container for your workloads, all GPU-enabled worker nodes must share the same operating system version.
- If you need to mix different operating systems across GPU nodes, you must pre-install the NVIDIA GPU Driver manually on each respective node instead of using the containerized driver.
- CPU-only nodes have no OS restrictions, as the GPU Operator does not manage or configure them.
Helm is installed.
You have permission to execute kubectl commands against the target cluster.

Installation Steps

1. Add NVIDIA Helm Repository.

    Shell
   
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
 && helm repo update

2. Install GPU Operator.

    Shell
   
   helm install --wait --generate-name \
 -n gpu-operator --create-namespace \
 nvidia/gpu-operator \
 --version=v25.10.0

If the NVIDIA driver or toolkit is already installed on your nodes, you can disable either or both during GPU Operator deployment by using the following flags:

    Shell
   
   --set driver.enabled=false
--set toolkit.enabled=false

3. Verify the installation by checking the status of the deployed resources.

    Shell
   
   kubectl get pods -n gpu-operator

You should see the GPU operator components running in the namespace.

4. We can also check the configuration of the node to check if the nodes with the GPU are configured correctly

    Shell
   
   kubectl describe nodes

Shell

Name:               [email protected]
Roles:              worker
Labels:             node-role.kubernetes.io/worker=true
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-PCIe
                    nvidia.com/gpu.replicas=1
...
Annotations:        nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 10.*.*.*/*
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.*.*.*
...
Capacity:
  cpu:                64
  ephemeral-storage:  32758Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                64
  ephemeral-storage:  32631789953
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110

We can see that the node with a GPU hardware attached has GPU-related labels and annotations added to it. Additionally, the GPU resources are visible under the Capacity and Allocatable sections.

Verification by Running Sample GPU Application

We can test the setup by deploying the CUDA vectoradd application provided by NVIDIA on our cluster. This image is an NVIDIA CUDA sample that demonstrates vector addition, a basic GPU computation.

Under the resources → limits section of this manifest, you’ll notice nvidia.com/gpu: 1. This instructs Kubernetes to schedule the Pod on a node equipped with an NVIDIA GPU.

    Shell
   
 

   cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF
  

    Shell
   
   pod/cuda-vectoradd created

Now we can check the logs:

    Shell
   
   kubectl logs pod/cuda-vectoradd

Logs output:

    Plain Text
   
 

   [Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
  

Now our cluster is ready to deploy the GPU workload.

GPU Sharing/Maximizing GPU Utilization

GPUs are expensive, high-performance hardware, and leaving them idle is a waste of valuable resources. Once your GPUs are up and running in Kubernetes, the real challenge becomes efficient sharing. The goal is to extract maximum value from every single card.

The GPU Operator makes this easy by allowing you to configure advanced sharing strategies declaratively. For example:

MIG (Multi-instance GPU): Physically partitions a single GPU into multiple, fully isolated instances, each with dedicated memory and compute.
MPS (Multi-process service): Enables concurrent execution of multiple GPU processes.
Time-slicing: Ideal for development workloads that only need occasional GPU access.

GPU Sharing Strategy

The optimal GPU sharing strategy depends entirely on your specific workload requirements and operational goals. The choice involves balancing factors like performance isolation, dynamic flexibility, and raw utilization. A workload that demands predictable performance in a multi-tenant cluster has very different needs than an interactive development workload.

Optional GPU Operator Components: Streamlining Data Movement

The GPU Operator includes additional components that are not enabled by default, such as GPUDirect RDMA and GPUDirect Storage. These tools are designed to streamline data movement between GPUs and other system components, effectively bypassing traditional bottlenecks like the CPU and system memory.

GPUDirect RDMA (Remote Direct Memory Access)

GPUDirect RDMA enables direct memory access between GPUs and PCIe devices (such as NICs or storage adapters), without involving the CPU or system RAM. This is ideal for High-Performance Computing (HPC) and AI training, where latency is critical.

Direct Communication between NVIDIA GPUs

Benefits:

Lower latency: Data moves directly between the GPU and the device.
Reduced CPU load: Frees up CPU cycles for compute tasks.
Higher bandwidth: Enables faster data transfer for distributed workloads.

Use cases:

GPU-to-GPU communication across nodes
Real-time inference at the edge
High-speed networking in HPC clusters

GPUDirect Storage

GPUDirect Storage allows GPUs to read data directly from NVMe or other storage devices again bypassing the CPU and system memory. This is essential for AI/ML workloads that need access to and process large datasets quickly.

A Direct Path Between Storage and GPU Memory

Benefits:

Faster data ingestion: Minimizes I/O bottlenecks during training or inference.
Efficient data pipeline: Direct flow from storage to GPU memory.
Simplified architecture: Eliminates unnecessary memory copies and CPU involvement.

Use cases:

Large-scale deep learning training
Data analytics pipelines
Scientific simulations with massive datasets

Both technologies are part of NVIDIA’s strategy to optimize data movement for GPU workloads. By enabling direct communication paths between GPUs and external devices, they unlock higher performance and lower latency, better resource utilization in Kubernetes environments where scalability and efficiency are critical.

Summary

Integrating NVIDIA GPUs into Kubernetes typically involves a complex, three-layer manual setup: host drivers, the container toolkit, and the Kubernetes device plugin. This approach works for single machines but creates massive operational challenges like configuration drift and incompatible drivers at scale.

The NVIDIA GPU Operator is the solution. It uses the Operator pattern to automate the entire lifecycle, acting as a “robotic administrator” that discovers GPUs, installs the necessary software stack in the correct order, validates the setup, and streamlines maintenance.

The core benefit? Simplifying your infrastructure so you can focus on AI workloads, not operational headaches.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so feel free to start a conversation on Twitter or LinkedIn.

Kubernetes Operator (extension) AI

Published at DZone with permission of Sagar Parmar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending