DeepSeek on Kubernetes: AI-Powered Reasoning at Scale

Deploy DeepSeek-R1 on Kubernetes using Ollama for inference and Open WebUI for seamless interaction. Supports local setups like KIND or cloud deployment.

Rajesh Gheware

CORE ·

Mar. 14, 25 · Analysis

Likes (3)

Comment

Save

4.4K Views

As artificial intelligence continues to evolve, deploying AI-powered applications efficiently and at scale has become critical. Kubernetes, the de facto orchestration platform, plays a crucial role in managing containerized AI workloads, ensuring scalability, resilience, and ease of management.

In this article, we explore DeepSeek on Kubernetes, a deployment that integrates DeepSeek-R1, a powerful reasoning AI model, with Open WebUI for seamless interaction.

Why Kubernetes for DeepSeek?

DeepSeek is an advanced reasoning model that naturally benefits from containerization and orchestration provided by Kubernetes. Kubernetes stands out from alternatives like Docker Swarm and Apache Mesos due to its mature ecosystem and extensive features tailored specifically for complex AI workloads. Here's why Kubernetes is ideal for deploying DeepSeek:

Scalability

Kubernetes simplifies scaling AI workloads with tools like Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler. Imagine a scenario where DeepSeek faces a sudden surge in inference requests — Kubernetes seamlessly scales the pods and nodes automatically, ensuring consistent performance without manual intervention.

Resilience

Kubernetes ensures high resilience through automated pod rescheduling and self-healing capabilities. If a DeepSeek pod encounters issues such as resource constraints or node failures, Kubernetes quickly detects and redeploys the affected pod to a healthy node, minimizing downtime and maintaining continuous availability.

Service Discovery

Kubernetes provides built-in DNS-based service discovery and seamless management of microservices. DeepSeek’s inference services can effortlessly discover and connect to supporting microservices, like preprocessing modules or logging services, without the need for complex manual configuration, enhancing maintainability and flexibility.

Persistent Storage

Kubernetes PersistentVolumeClaims (PVCs) effectively handle AI model storage, training datasets, and checkpoints. This ensures critical data remains consistent and available even during updates, pod restarts, or node failures. For example, updating DeepSeek models or scaling inference pods becomes seamless and non-disruptive.

Load Balancing

Kubernetes offers intrinsic load-balancing capabilities, distributing workloads efficiently across multiple replicas. This capability is critical for DeepSeek to evenly distribute inference requests among multiple instances, optimizing resource utilization and significantly reducing response latency.

While alternatives like Docker Swarm offer simplicity, Kubernetes uniquely delivers comprehensive features essential for managing sophisticated AI models like DeepSeek, ensuring scalability, robustness, and operational ease.

Deploying DeepSeek on Kubernetes

1. Kubernetes Cluster Setup

In our setup, we have a three-node Kubernetes cluster with the following nodes:

    Plain Text
   
 

   $ kubectl get nodes
NAME                       STATUS   ROLES           AGE    VERSION
deepseek-control-plane     Ready    control-plane   6d5h   v1.32.0
deepseek-worker            Ready    <none>          6d5h   v1.32.0
deepseek-worker2           Ready    <none>          6d5h   v1.32.0
  

Even if Kubernetes nodes are not powered using GPU, DeepSeek-R1 will still function, although response times may be slower. GPU acceleration is recommended for optimal performance, especially for complex reasoning tasks.

Kubernetes clusters can be set up locally using tools like:

KIND (Kubernetes IN Docker)
Minikube
MicroK8s

If deployed on a cloud provider, the setup can be made securely accessible using an Ingress object to expose services through a web interface with proper authentication and TLS security.

2. Deploying DeepSeek-R1 With Ollama

DeepSeek-R1 is deployed within Kubernetes using Ollama, which handles AI model inference. Below is the Kubernetes manifest for the Ollama deployment:

    YAML
   
 

   apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        volumeMounts:
        - mountPath: /root/.ollama
          name: ollama-storage
        env:
        - name: OLLAMA_MODEL
          value: deepseek-r1:1.5b
        - name: OLLAMA_KEEP_ALIVE
          value: "-1"  
        - name: OLLAMA_NO_THINKING
          value: "true"
        - name: OLLAMA_SYSTEM_PROMPT
          value: "You are DeepSeek-R1, a reasoning model. Provide direct answers without detailed reasoning steps or <think> tags."
      volumes:
      - name: ollama-storage
        emptyDir: {}
  

3. Exposing Ollama as a Service

To allow other services to communicate with Ollama, we define a NodePort service:

    YAML
   
 

   apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
    - protocol: TCP
      port: 11434
      targetPort: 11434
  type: NodePort
  

4. Deploying Open WebUI

For an interactive experience, we integrate Open WebUI, which connects to Ollama and provides a user-friendly interface. The deployment is as follows:

    YAML
   
 

   apiVersion: apps/v1
kind: Deployment
metadata:
  name: openweb-ui
  labels:
    app: openweb-ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: openweb-ui
  template:
    metadata:
      labels:
        app: openweb-ui
    spec:
      containers:
      - name: openweb-ui
        image: ghcr.io/open-webui/open-webui:main
        env:
        - name: WEBUI_NAME
          value: "DeepSeek India - Hardware Software Gheware"        
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service:11434"  
        - name: OLLAMA_DEFAULT_MODEL
          value: "deepseek-r1:1.5b"             
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: openweb-data
          mountPath: /app/backend/data
      volumes:
      - name: openweb-data
        persistentVolumeClaim:
          claimName: openweb-ui-pvc

  

5. Running Inference on DeepSeek-R1

To test the deployment, we can execute a command within the Ollama container:

    Shell
   
   kubectl exec -it deploy/ollama -- bash
ollama run deepseek-r1:1.5b

This command starts an interactive session with the AI model, allowing direct input queries.

Accessing Open WebUI

After deployment, Open WebUI is accessible by creating an ingress object pointing to the URL.

    Plain Text
   
   http://deepseek.gheware.com/auth

This interface allows users to interact with DeepSeek-R1 through a chat-based environment.

Conclusion

By deploying DeepSeek on Kubernetes, we achieve a scalable, resilient, and production-ready AI reasoning system. Kubernetes efficiently orchestrates DeepSeek-R1, ensuring smooth model execution and user interaction through Open WebUI. This architecture can be further extended by adding GPU acceleration, auto-scaling, and monitoring with Prometheus and Grafana.

For AI practitioners, Kubernetes offers an excellent foundation for deploying and managing reasoning models like DeepSeek-R1.

AI Kubernetes app

Published at DZone with permission of Rajesh Gheware. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending