DeepSeek on Kubernetes: AI-Powered Reasoning at Scale
Deploy DeepSeek-R1 on Kubernetes using Ollama for inference and Open WebUI for seamless interaction. Supports local setups like KIND or cloud deployment.
Join the DZone community and get the full member experience.
Join For FreeAs artificial intelligence continues to evolve, deploying AI-powered applications efficiently and at scale has become critical. Kubernetes, the de facto orchestration platform, plays a crucial role in managing containerized AI workloads, ensuring scalability, resilience, and ease of management.
In this article, we explore DeepSeek on Kubernetes, a deployment that integrates DeepSeek-R1, a powerful reasoning AI model, with Open WebUI for seamless interaction.
Why Kubernetes for DeepSeek?
DeepSeek is an advanced reasoning model that naturally benefits from containerization and orchestration provided by Kubernetes. Kubernetes stands out from alternatives like Docker Swarm and Apache Mesos due to its mature ecosystem and extensive features tailored specifically for complex AI workloads. Here's why Kubernetes is ideal for deploying DeepSeek:
Scalability
Kubernetes simplifies scaling AI workloads with tools like Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler. Imagine a scenario where DeepSeek faces a sudden surge in inference requests — Kubernetes seamlessly scales the pods and nodes automatically, ensuring consistent performance without manual intervention.
Resilience
Kubernetes ensures high resilience through automated pod rescheduling and self-healing capabilities. If a DeepSeek pod encounters issues such as resource constraints or node failures, Kubernetes quickly detects and redeploys the affected pod to a healthy node, minimizing downtime and maintaining continuous availability.
Service Discovery
Kubernetes provides built-in DNS-based service discovery and seamless management of microservices. DeepSeek’s inference services can effortlessly discover and connect to supporting microservices, like preprocessing modules or logging services, without the need for complex manual configuration, enhancing maintainability and flexibility.
Persistent Storage
Kubernetes PersistentVolumeClaims (PVCs) effectively handle AI model storage, training datasets, and checkpoints. This ensures critical data remains consistent and available even during updates, pod restarts, or node failures. For example, updating DeepSeek models or scaling inference pods becomes seamless and non-disruptive.
Load Balancing
Kubernetes offers intrinsic load-balancing capabilities, distributing workloads efficiently across multiple replicas. This capability is critical for DeepSeek to evenly distribute inference requests among multiple instances, optimizing resource utilization and significantly reducing response latency.
While alternatives like Docker Swarm offer simplicity, Kubernetes uniquely delivers comprehensive features essential for managing sophisticated AI models like DeepSeek, ensuring scalability, robustness, and operational ease.
Deploying DeepSeek on Kubernetes
1. Kubernetes Cluster Setup
In our setup, we have a three-node Kubernetes cluster with the following nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
deepseek-control-plane Ready control-plane 6d5h v1.32.0
deepseek-worker Ready <none> 6d5h v1.32.0
deepseek-worker2 Ready <none> 6d5h v1.32.0
Even if Kubernetes nodes are not powered using GPU, DeepSeek-R1 will still function, although response times may be slower. GPU acceleration is recommended for optimal performance, especially for complex reasoning tasks.
Kubernetes clusters can be set up locally using tools like:
- KIND (Kubernetes IN Docker)
- Minikube
- MicroK8s
If deployed on a cloud provider, the setup can be made securely accessible using an Ingress object to expose services through a web interface with proper authentication and TLS security.
2. Deploying DeepSeek-R1 With Ollama
DeepSeek-R1 is deployed within Kubernetes using Ollama, which handles AI model inference. Below is the Kubernetes manifest for the Ollama deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- mountPath: /root/.ollama
name: ollama-storage
env:
- name: OLLAMA_MODEL
value: deepseek-r1:1.5b
- name: OLLAMA_KEEP_ALIVE
value: "-1"
- name: OLLAMA_NO_THINKING
value: "true"
- name: OLLAMA_SYSTEM_PROMPT
value: "You are DeepSeek-R1, a reasoning model. Provide direct answers without detailed reasoning steps or <think> tags."
volumes:
- name: ollama-storage
emptyDir: {}
3. Exposing Ollama as a Service
To allow other services to communicate with Ollama, we define a NodePort service:
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: NodePort
4. Deploying Open WebUI
For an interactive experience, we integrate Open WebUI, which connects to Ollama and provides a user-friendly interface. The deployment is as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
name: openweb-ui
labels:
app: openweb-ui
spec:
replicas: 1
selector:
matchLabels:
app: openweb-ui
template:
metadata:
labels:
app: openweb-ui
spec:
containers:
- name: openweb-ui
image: ghcr.io/open-webui/open-webui:main
env:
- name: WEBUI_NAME
value: "DeepSeek India - Hardware Software Gheware"
- name: OLLAMA_BASE_URL
value: "http://ollama-service:11434"
- name: OLLAMA_DEFAULT_MODEL
value: "deepseek-r1:1.5b"
ports:
- containerPort: 8080
volumeMounts:
- name: openweb-data
mountPath: /app/backend/data
volumes:
- name: openweb-data
persistentVolumeClaim:
claimName: openweb-ui-pvc
5. Running Inference on DeepSeek-R1
To test the deployment, we can execute a command within the Ollama container:
kubectl exec -it deploy/ollama -- bash
ollama run deepseek-r1:1.5b
This command starts an interactive session with the AI model, allowing direct input queries.
Accessing Open WebUI
After deployment, Open WebUI is accessible by creating an ingress object pointing to the URL.
http://deepseek.gheware.com/auth
This interface allows users to interact with DeepSeek-R1 through a chat-based environment.
Conclusion
By deploying DeepSeek on Kubernetes, we achieve a scalable, resilient, and production-ready AI reasoning system. Kubernetes efficiently orchestrates DeepSeek-R1, ensuring smooth model execution and user interaction through Open WebUI. This architecture can be further extended by adding GPU acceleration, auto-scaling, and monitoring with Prometheus and Grafana.
For AI practitioners, Kubernetes offers an excellent foundation for deploying and managing reasoning models like DeepSeek-R1.
Published at DZone with permission of Rajesh Gheware. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments