Containerized Intelligence: Running LLMs at Scale Using Docker and Kubernetes

Deploy powerful LLMs easily using Docker and Kubernetes. Scale with GPUs, fine-tune models, and automate inference workloads effortlessly.

Aug. 20, 25 · Tutorial

Likes (5)

Comment

Save

11.8K Views

Large Language Models (LLMs) such as GPT, LLaMA, and Mistral have transformed the way applications interpret and generate natural language, driving innovation across a wide range of industries. Yet, operationalizing these models at scale introduces a host of technical challenges, including dependency management, GPU integration, orchestration, and auto-scaling.

The rapid evolution of LLMs presents immense opportunities for building intelligent, language-aware applications. However, deploying and managing these compute-intensive models in production environments requires a reliable and scalable infrastructure. This is where containerization with Docker and orchestration with Kubernetes come into play—offering a powerful combination to streamline LLM deployment, ensure reproducibility, and support horizontal scaling.

In this article, we explore how Docker and Kubernetes can be effectively utilized to containerize, deploy, and manage LLMs at scale. We will examine real-world use cases, share best practices, and provide sample code snippets to illustrate core deployment strategies and concepts.

The Challenges of Scaling LLMs

Before diving into the solutions, it's crucial to understand the hurdles involved in scaling LLMs:

Resource Intensity: LLMs demand substantial computational resources, including significant GPU power, large amounts of RAM, and fast storage.
Dependency Management: LLM inference often relies on specific software dependencies, libraries, and even hardware configurations (like CUDA versions). Managing these consistently across different environments can be error-prone.
Scalability and Availability: Handling fluctuating user demand requires the ability to scale the inference infrastructure up or down seamlessly while maintaining high availability and fault tolerance.
Deployment Complexity: Manually deploying and managing LLM inference services across multiple machines can be a complex and time-consuming process.
Reproducibility: Ensuring consistent performance and behavior across different deployments and updates is vital for reliable AI applications.

Docker: The Foundation for Reproducible LLM Environments

Docker provides a platform to package an LLM application and all its dependencies into a portable container image. This image encapsulates everything needed to run the application, including the operating system, libraries, model weights, and the inference code itself.

Benefits of Docker for LLMs

Isolation: Containers provide isolated environments, preventing dependency conflicts between different LLM deployments or other applications running on the same infrastructure.
Reproducibility: Docker images ensure consistent execution across different environments, from development to production.
Simplified Deployment: Deploying a containerized LLM application becomes as simple as running the Docker image.
Portability: Docker images can be easily shared and run on any Docker-compatible infrastructure.

Sample Dockerfile for an LLM Inference Service (Using a Hypothetical Python-Based Service)

    Dockerfile
   
 

   FROM python:3.9-slim-buster

WORKDIR /app

# Install system dependencies (e.g., CUDA if needed)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1-mesa-glx \
    # Add other necessary system packages
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the LLM model weights (assuming they are in a 'models' directory)
COPY models /app/models

# Copy the inference service code
COPY inference_service.py .

# Expose the inference service port
EXPOSE 8080

# Command to run the inference service
CMD ["python", "inference_service.py"]
  

Let me explain the above Docker file:

We start with a base Python image.
We set the working directory inside the container.
We install any necessary system-level dependencies.
We copy the requirements.txt file and install the Python dependencies.
We copy the pre-trained LLM model weights into the container.
We copy the Python script (inference_service.py) that handles the LLM inference.
We expose port 8080 for the inference service.
We define the command to run the inference service when the container starts.

Kubernetes: Orchestrating LLM Deployments at Scale

Kubernetes (K8s) is a powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications. For LLM deployments, Kubernetes provides the necessary infrastructure to handle the resource demands and ensure high availability.

Key Kubernetes Concepts for LLM Scaling

Pods: The smallest deployable units in Kubernetes, typically containing one or more Docker containers.An LLM inference service would run within a Pod.
Deployments: Provide declarative updates for Pods and ReplicaSets.They ensure that a specified number of Pod replicas are running at all times and facilitate rolling updates and rollbacks.
Services: Abstract away the underlying Pods, providing a stable network endpoint to access the LLM inference service.Different service types (e.g., ClusterIP, NodePort, LoadBalancer) cater to various access requirements.
Horizontal Pod Autoscaler (HPA): Automatically scales the number of Pod replicas based on observed CPU utilization, memory consumption, or custom metrics.This is crucial for handling fluctuating LLM inference requests.
Node Pools with GPU Support: Kubernetes allows you to create specific node pools equipped with GPUs.You can then configure your LLM Pods to be scheduled on these GPU-enabled nodes.
Resource Management (Requests and Limits): You can specify the resource requirements (CPU, memory, GPU) for your LLM Pods, ensuring that they are allocated sufficient resources and preventing them from consuming excessive resources on the nodes.

Sample Kubernetes Deployment YAML for an LLM Inference Service

    YAML
   
 

   apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-deployment
spec:
  replicas: 2 # Start with 2 replicas
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      # Specify a nodeSelector to target GPU nodes (if applicable)
      # nodeSelector:
      #   nvidia.com/gpu.present: "true"
      containers:
        - name: llm-inference-container
          image: your-dockerhub-username/llm-inference-image:latest # Replace with your Docker image
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              # nvidia.com/gpu: "1" # Request 1 GPU (if applicable)
            limits:
              cpu: "8"
              memory: "32Gi"
              # nvidia.com/gpu: "1" # Limit to 1 GPU (if applicable)
  

Sample Kubernetes Service YAML for Exposing the LLM Inference Service

    YAML
   
 

   apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
spec:
  selector:
    app: llm-inference
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: LoadBalancer # Use LoadBalancer for external access in cloud environments
  

Sample Kubernetes Horizontal Pod Autoscaler (HPA) YAML

    YAML
   
 

   apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-deployment
  minReplicas: 2
  maxReplicas: 10 # Scale up to 10 replicas
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70 # Scale up when CPU utilization exceeds 70%
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80 # Scale up when memory utilization exceeds 80%
  

Explanation:

The Deployment defines the desired state of the LLM inference application, including the number of replicas and the container image to use.
The Service provides a stable IP address and DNS name to access the LLM inference Pods. The LoadBalancer type exposes the service externally in cloud environments.
The HPA automatically adjusts the number of Pod replicas based on CPU and memory utilization, ensuring the service can handle varying loads.

Why Containerize LLMs?

Large Language Models (LLMs) are powerful, but they are also complex, resource-intensive, and highly sensitive to their runtime environments. From managing large dependencies and specialized hardware requirements (like GPUs), to ensuring consistent behavior across dev, test, and prod — running LLMs in production isn't straightforward. This is where containerization becomes a game-changer.

Let’s explore the key advantages in detail:

Reproducibility

LLMs often require specific versions of libraries, frameworks (like PyTorch or TensorFlow), CUDA drivers, and other dependencies to function correctly.

Problem: A minor change in the environment (e.g., Python version mismatch) can break the model or degrade its performance.
Solution: Containerization allows you to package the model, its dependencies, configurations, and environment settings into a single, consistent unit.
Benefit: Guarantees that the model runs the same way on every system, eliminating “it works on my machine” issues.

Portability

Once containerized, an LLM can run anywhere — on a local machine, in an on-premise data center, or across multiple cloud providers.

Problem: Without containers, migrating models between environments can require significant reconfiguration.
Solution: Containers encapsulate everything needed to run the application, including code, dependencies, and runtime.
Benefit: Enables teams to develop locally and deploy seamlessly to production, supporting multi-cloud and hybrid-cloud strategies.

Modularity

Modern AI applications are not just about the model — they involve APIs, UIs, databases, vector stores, retrievers, and more.

Problem: Managing complex, tightly coupled components can slow development and increase the risk of system failures.
Solution: With containers, each component (e.g., model server, vector DB, frontend) can be deployed independently and interact via well-defined APIs.
Benefit: Encourages microservice architecture, simplifies maintenance, and allows independent scaling or updates of specific services.

Orchestration (With Kubernetes)

Containers by themselves are powerful, but managing them at scale requires orchestration tools like Kubernetes.

Problem: LLMs demand dynamic scaling, GPU scheduling, and robust failover/restart capabilities.
Solution: Kubernetes provides automated deployment, scaling, load balancing, and lifecycle management for containers.
Benefit: Ensures high availability, fault tolerance, and efficient resource utilization, especially critical when deploying multiple LLM instances across nodes

Use Cases for Containerized LLM Inference

The combination of Docker and Kubernetes enables various compelling use cases for deploying LLMs at scale:

Real-time Conversational AI: Powering chatbots and virtual assistants that require low-latency responses. Kubernetes can automatically scale the inference service based on the number of concurrent users.
Large-Scale Content Generation: Generating articles, marketing copy, or code snippets in parallel. Kubernetes can manage a fleet of LLM inference Pods to handle high throughput.
Personalized Recommendations: Providing tailored recommendations based on user behavior and preferences. Containerization ensures consistent performance for each user request.
Sentiment Analysis at Scale: Processing large volumes of text data for sentiment analysis in real-time. Kubernetes can dynamically scale the processing capacity based on the data volume.
Code Completion and Generation in IDEs: Integrating LLMs into development environments for intelligent code suggestions. Docker ensures consistent dependencies across developer machines and production environments.
Multimodal AI Applications: Deploying LLMs that process both text and images or other modalities. Kubernetes can manage the diverse resource requirements of such applications.

Step-by-Step: Deploying LLMs Using Docker + Kubernetes

Let’s deploy a HuggingFace Transformers model using FastAPI, Docker, and Kubernetes.

1. FastAPI App (llm_app.py)

    Python
   
 

   # llm_app.py
from fastapi import FastAPI, Request
from transformers import pipeline

app = FastAPI()
generator = pipeline("text-generation", model="gpt2")

@app.post("/generate")
async def generate_text(request: Request):
    data = await request.json()
    prompt = data.get("prompt", "")
    output = generator(prompt, max_length=50, num_return_sequences=1)
    return {"response": output[0]['generated_text']}

  

2. Dockerfile

    Dockerfile
   
   # Dockerfile
FROM python:3.9-slim

WORKDIR /app

COPY llm_app.py .

RUN pip install fastapi uvicorn transformers torch

EXPOSE 8000

CMD ["uvicorn", "llm_app:app", "--host", "0.0.0.0", "--port", "8000"]

Build & tag:

    Shell
   
   docker build -t llm-api:latest .

3. Kubernetes Deployment

deployment.yaml

    YAML
   
 

   apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
      - name: llm-api
        image: llm-api:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            memory: "4Gi"
            cpu: "2"

  

service.yaml

    YAML
   
 

   apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-api
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
  type: LoadBalancer

  

Apply with:

    Shell
   
   kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Monitoring & Scaling

Use Prometheus + Grafana for monitoring and HorizontalPodAutoscaler (HPA) for dynamic scaling:

    YAML
   
 

   apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

  

Containerizing LLMs isn't just a best practice — it's a necessity for robust, scalable, and maintainable AI infrastructure. Lets sail into few use cases,

Use Case 1: Self-Hosted LLM API Using `text-generation-webui`

Let’s create a containerized API for LLaMA 2 using text-generation-webui, a popular UI and REST server.

Step 1: Dockerfile Example

    Dockerfile
   
 

   FROM nvidia/cuda:12.2.0-cudnn8-devel-ubuntu22.04

# Python & dependencies
RUN apt update && apt install -y git python3-pip
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RUN pip3 install transformers accelerate

# Clone and set up web UI
RUN git clone https://github.com/oobabooga/text-generation-webui /app
WORKDIR /app
RUN pip3 install -r requirements.txt

EXPOSE 7860
CMD ["python3", "server.py", "--model", "llama-2-7b-chat"]

  

Step 2: Build and Run Docker Locally

    Shell
   
   docker build -t llama-api .
docker run --gpus all -p 7860:7860 llama-api

You can now hit http://localhost:7860 for the UI or connect via API.

Use Case 2: Scaling LLM Inference with Kubernetes

Once the container is tested, the next step is running it in a GPU-enabled Kubernetes cluster using NVIDIA’s device plugin.

Prerequisite

Install NVIDIA drivers and enable the NVIDIA Device Plugin.

    Shell
   
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Step 1: Kubernetes Deployment YAML

    YAML
   
 

   apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama-api
  template:
    metadata:
      labels:
        app: llama-api
    spec:
      containers:
      - name: llama
        image: llama-api:latest
        ports:
        - containerPort: 7860
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
spec:
  selector:
    app: llama-api
  ports:
    - protocol: TCP
      port: 80
      targetPort: 7860
  type: LoadBalancer

  

Deploy with:

    Shell
   
   kubectl apply -f llama-deployment.yaml

This creates 2 replicas behind a LoadBalancer, each using a GPU for inference.

Use Case 3: Fine-Tuning LLMs with Hugging Face + Docker

Want to fine-tune LLaMA on your own dataset? Containerize training scripts using Hugging Face's Transformers.

    Dockerfile
   
   FROM huggingface/transformers-pytorch-gpu

COPY . /trainer
WORKDIR /trainer

CMD ["python", "finetune.py"]

Example finetune.py:

    Python
   
   from transformers import LlamaTokenizer, LlamaForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

dataset = load_dataset("csv", data_files="your-data.csv")

tokenized = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding="max_length"), batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=1
)

trainer = Trainer(model=model, args=training_args, train_dataset=tokenized["train"])
trainer.train()

Run it on Kubernetes or local with GPU.

Conclusion

Containerization with Docker and orchestration with Kubernetes offer a robust and flexible foundation for deploying and scaling Large Language Models (LLMs) in production environments. These technologies address key operational challenges—such as resource management, dependency consistency, scalability, and deployment complexity—enabling organizations to fully harness the potential of LLMs.

By encapsulating models and their environments into containers, teams gain reproducibility, portability, and modularity, while Kubernetes facilitates automated deployment, dynamic scaling, and resilient orchestration across infrastructure. This combination empowers teams to deliver intelligent, AI-driven applications that are production-ready and capable of meeting real-world demands.

The included sample code snippets provide a practical starting point for implementing containerized LLM inference services on Kubernetes. As the field of LLMs continues to evolve rapidly, adopting containerization and orchestration will be essential for maintaining efficiency, scalability, and agility in AI development and deployment workflows.

Deploying LLMs at scale is inherently complex—but with containers and Kubernetes, much of that complexity is abstracted away. Whether you’re a researcher exploring cutting-edge models, an MLOps engineer automating deployments, or an enterprise architect designing robust AI infrastructure, containerized LLMs represent the path forward for scalable, sustainable AI systems.

Kubernetes Docker (software) large language model

Opinions expressed by DZone contributors are their own.

Related

Trending