Secure Multi-Tenant GPU-as-a-Service on Kubernetes: Architecture, Isolation, and Reliability at Scale
GPU-as-a-Service makes it easier to share accelerators, but it also raises concerns about isolation and security. This introduces a secure Kubernetes architecture.
Join the DZone community and get the full member experience.
Join For FreeGPUs are a core feature of modern cloud platforms, used to support a wide range of machine learning training, inference, analytics, and simulation workloads. To support this diverse demand, GPUs can no longer be dedicated to a single team or application. Dedicated GPU solutions have quickly become infeasible and very expensive.
To meet this demand, organizations are increasingly looking to shared platforms, where many teams can directly consume GPU resources from a shared Kubernetes cluster. GPU-as-a-Service (GPUaaS) platforms provide this capability.
GPUs, however, are not generic compute. GPUs expose device memory and depend on privileged drivers that can be misconfigured and introduce large blast radii. Running GPU workloads using traditional shared infrastructure practices introduces risk and likely results in security gaps, noisy-neighbor problems, and brittle operations.
In this article, we explore what it takes to design a secure, multi-tenant GPU-as-a-Service platform on Kubernetes. We will discuss several important architectural, isolation, and reliability considerations that are far beyond just tooling.
Why Multi-Tenant GPUs Are Hard
We know how to schedule multi-tenant CPUs. GPUs are another story.
The problem areas are:
- GPU sharing provides little hardware isolation by default
- Device drivers operate in privileged mode
- GPU workloads are long-lived and stateful
- Failures are amplified across pods
- Starvation and unfairness can occur
Due to these factors, GPU sharing needs to be addressed as a security and architecture concern, in addition to a scheduling issue.
Architecture Overview
A secure GPU-as-a-Service platform needs well-defined architectural layers, each with a clear responsibility. The layers are:
- Tenant Isolation Layer – to isolate teams and workloads
- GPU Control Layer – to control GPU allocation
- Security and Governance Layer – to enforce guardrails
- Infrastructure Layer – to contain failures, performance impact

A layered, security-first architecture for running multi-tenant GPU-as-a-Service on Kubernetes, showing tenant isolation, GPU control, governance, and infrastructure boundaries.
Diagram Description
Tenant Layer
- One namespace per tenant
- Clear ownership boundaries
- Per-tenant quotas and limits
GPU Control Plane
- Kubernetes scheduler
- GPU device plugin
- Controlled allocation of GPU devices
Security and Governance
- Node isolation
- Policy-as-Code enforcement
- Resource quotas
Infrastructure Layer
- Dedicated GPU node pools
- Isolated runtimes
- Physical GPU hardware
This layered approach to resource management also means that no single control point can be a single point of failure.
Design Principles
1. Node-Level Isolation Is Required
GPU nodes must NOT run:
- System workloads
- Control-plane components
- General-purpose application pods
Enforce this using taints.
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
GPU workloads must explicitly tolerate the taint:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
This ensures only validated workloads are scheduled on GPU nodes.
2. Controlled GPU Access With Device Plugins
Kubernetes does not automatically expose GPUs. Access to GPUs is granted through device plugins, a key component of the extended resource model.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-device-plugin
namespace: kube-system
spec:
template:
spec:
containers:
- name: device-plugin
image: gpu-device-plugin:latest
securityContext:
privileged: true
Workloads must explicitly request GPUs:
resources:
limits:
gpu.example.com/device: 1
Pods that do not request a GPU are not given access to one.
3. Apply Fairness With Quotas
GPU starvation is the natural state of affairs without quotas.
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: tenant-a
spec:
hard:
limits.gpu.example.com/device: "4"
Quotas turn GPU consumption into a bounded commitment, rather than a best-effort sharing.
4. Hardened Pod Security
GPU workloads must not be permitted to run as elevated or privileged.
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
This reduces the risk of:
- Driver abuse
- Container escape
- Cross-tenant interference
5. Policy-as-Code Guardrails
Policies can intercept unsafe workloads before they are scheduled.
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
input.request.object.spec.containers[_].securityContext.privileged == true
msg := "Privileged GPU workloads are not allowed"
}
This helps turn GPU security from reactive to preventive.
Reliability at Scale
Isolating Cross-Tenant Interference
Sharing GPUs among workloads without clear boundaries allows one workload to interfere with others. Maintaining predictable performance requires:
- Restrict the number of GPUs per pod
- Isolate high-throughput, long-running workloads
- Minimize GPU oversubscription without guardrails
Isolating Failures
GPU node failures must be localized and remediated fast to prevent cascading effects across the platform:
- Create dedicated GPU node pools to isolate GPU workloads
- Automatically cordon and drain failed nodes
- Fastly reschedule workloads onto healthy GPU nodes
Observability for the Platform
Enforcing fairness and reliability requires visibility into the platform. A multi-tenant GPU platform needs to be able to track and measure:
- GPU utilization by tenant
- GPU memory consumption
- Scheduling and queue latency
- Preemption and eviction events
GPU sharing is no longer guesswork but a controllable and reliable system through clear observability.
GPU-as-a-Service Anti-Patterns
- Trying to use GPUs as if they were CPUs
- Sharing GPU nodes without setting quotas
- Running GPU pods as privileged
- Trusting users instead of enforcing policy
- Running GPU and system workloads together
When to Use GPU-as-a-Service
GPU-as-a-Service is ideal for environments where:
- Teams share infrastructure
- Workloads are bursty in nature
- Platform teams own the governance
- Cost savings are a priority
GPU-as-a-Service is not well-suited to tightly coupled, single-tenant, ultra-low-latency systems.
Conclusion
Secure multi-tenant GPU-as-a-Service is not a scheduling problem; it is an architecture problem.
By integrating the following, platform teams can safely deliver GPU acceleration as a shared service without sacrificing security or reliability:
- Node isolation
- Controlled device access
- Tenant quotas
- Policy guardrails
- Strong observability
In modern cloud platforms, GPUs are infrastructure, and infrastructure demands careful design.
Opinions expressed by DZone contributors are their own.
Comments