DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Implementing EKS Multi-Tenancy Using Capsule (Part 3)
  • Establishing a Highly Available Kubernetes Cluster on AWS With Kops
  • Automate Cluster Autoscaler in EKS
  • Creating an AWS EKS Cluster and Providing Access to Developer

Trending

  • Designing a Java Connector for Software Integrations
  • AI’s Role in Everyday Development
  • AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
  • Agile and Quality Engineering: A Holistic Perspective
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. How to Use GPU Nodes in Amazon EKS

How to Use GPU Nodes in Amazon EKS

In this article, we'll run GPU nodes in AWS EKS in seven simple steps via nvidia-driver and will check basic methods to debug it after the deployment.

By 
Alexander Sharov user avatar
Alexander Sharov
·
Mar. 11, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.6K Views

Join the DZone community and get the full member experience.

Join For Free

Running GPU workloads on Amazon EKS requires configuring GPU-enabled nodes, installing necessary drivers, and ensuring proper scheduling. Follow these steps to set up GPU nodes in your EKS cluster.

1. Create an Amazon EKS Cluster

First, create an EKS cluster without worker nodes using eksctl (for simplicity, we don’t use Terraform/OpenTofu.):

Plain Text
 
eksctl create cluster --name kvendingoldo–eks-gpu-demo --without-nodegroup


2. Create a Default CPU Node Group

A separate CPU node group ensures that:

  • Kubernetes system components (kube-system pods) have a place to run.
  • The GPU Operator and its dependencies will be deployed successfully.
  • Non-GPU workloads don’t end up on GPU nodes.

Create at least one CPU node to maintain cluster stability:

Plain Text
 
eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name cpu-nodes \
 --node-type t3.medium \
 --nodes 1 \
 --nodes-min 1 \
 --nodes-max 3 \
 --managed


3. Create a GPU Node Group

GPU nodes should have appropriate taints to prevent non-GPU workloads from running on them. Use an NVIDIA-compatible instance type (you can check all options at instances.vantage.sh, but typically it’s g4dn.xlarge or p3.2xlarge) for such nodes:

Plain Text
 
eksctl create nodegroup --cluster kvendingoldo–eks-gpu-demo \
 --name gpu-nodes \
 --node-type g4dn.xlarge \
 --nodes 1 \
 --node-taints only-gpu-workloads=true:NoSchedule \
 --managed


A custom taint only-gpu-workloads=true:NoSchedule guarantees that only pods with the same toleration configuration are scheduled on these nodes.

4. Install the NVIDIA GPU Operator

The NVIDIA GPU Operator installs drivers, CUDA, toolkit, and monitoring tools. To install it, use the following steps:

1. Create gpu-operator-values.yaml:

Plain Text
 
tolerations:
- key: "only-gpu-workloads"
  value: "true"
  effect: "NoSchedule"


2. Deploy the gpu-operator via Helm:

Plain Text
 
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator -f gpu-operator-values.yaml


Pay attention to two things:

  1. YAML deployment of k8s-device-plugin shouldn’t be used for production.
  2. By using gpu-operator-values.yaml values, we set up tolerations for gpu-operator daemonset; without that, nodes will not work, and you won’t be able to schedule GPU workloads there.

5. Verify GPU Availability

After deploying the GPU Operator, check if NVIDIA devices are correctly detected on GPU by the following command:

Plain Text
 
kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia


Check GPU Status on the Node Using AWS SSM (In Case of Issues)

If you need to manually debug a GPU node, connect using AWS SSM (Systems Manager Session Manager) instead of SSH.

Step 1: Attach SSM IAM Policy

Ensure your EKS worker nodes have the AmazonSSMManagedInstanceCore policy:

Plain Text
 
aws iam attach-role-policy --role-name <NodeInstanceRole> \
 --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore


Step 2: Start an SSM Session

Find the Instance ID of your GPU node:

Plain Text
 
aws ec2 describe-instances --filters "Name=tag:eks:nodegroup-name,Values=gpu-nodes" \
 --query "Reservations[].Instances[].InstanceId" --output text


Start AWS SSM session:

Plain Text
 
aws ssm start-session --target <Instance-ID>


Inside the node, check the GPU state:

  • lspci | grep -i nvidia to check if the GPU hardware is detected
  • nvidia-smi to verify the NVIDIA driver and GPU status

If nvidia-smi fails or the GPU is missing, it may indicate that:

  • The GPU Operator is not installed correctly.
  • The node does not have an NVIDIA GPU.
  • The NVIDIA driver failed to load.

Check the official Nvidia documentation to solve these issues.

6. Schedule a GPU Pod

Deploy a test pod to verify GPU scheduling. This pod:

  • Requests a GPU.
  • Uses tolerations to run on GPU nodes.
  • Runs nvidia-smi to confirm GPU access.
YAML
 
---
apiVersion: v1
kind: Pod
metadata:
  name: kvendingoldo-gpu-test
spec:
  tolerations:
  - key: "only-gpu-workloads"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    nvidia.com/gpu: "true"
  containers:
    - name: cuda-container
      image: nvidia/cuda:12.0.8-base
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1


7. Handling "Insufficient nvidia.com/gpu" Errors

Typically, users may face failing pods and errors like:

Plain Text
 
0/2 nodes are available: 1 Insufficient nvidia.com/gpu


This means that all GPUs are already allocated, or Kubernetes does not recognize available GPUs. Here, the following fixes may help.

Check GPU Allocations

Plain Text
 
kubectl describe node <gpu-node-name> | grep "nvidia.com/gpu"


If you don’t see any nvidia.com labels on your GPU node, it means the operator isn’t working, and you should debug it. It is typically caused by taints or tolerations. Pay attention that the nvidia-device-plugin pod should exist on each GPU node.

Verify the GPU Operator

Check the status of operator pods:

Plain Text
 
kubectl get pods -n gpu-operator


If some pods are stuck in Pending or CrashLoopBackOff, restart the operator:

Plain Text
 
kubectl delete pod -n gpu-operator — all


Restart kubectl

Sometimes the kubelet gets stuck. In such cases, logging into a node and restarting Kubelet may be helpful.

Scale Up GPU Nodes

Increase GPU node count:

Plain Text
 
eksctl scale nodegroup --cluster=kvendingoldo–eks-gpu-demo --name=gpu-nodes --nodes=3


Conclusion

Congrats! Your EKS cluster is all set to tackle GPU workloads. Whether you’re running AI models, processing videos, or crunching data, you’re ready to go. Happy deploying! 

AWS CUDA Kubernetes cluster

Published at DZone with permission of Alexander Sharov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Implementing EKS Multi-Tenancy Using Capsule (Part 3)
  • Establishing a Highly Available Kubernetes Cluster on AWS With Kops
  • Automate Cluster Autoscaler in EKS
  • Creating an AWS EKS Cluster and Providing Access to Developer

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!