Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
GPUStack is an open-source tool that turns a bunch of scattered GPU machines into one managed cluster for deploying AI models behind an OpenAI-compatible API.
Join the DZone community and get the full member experience.
Join For FreeThe Problem Nobody Warned You About
You bought the GPUs. Maybe you've got a couple of NVIDIA A100s in a rack, some RTX 4090s under desks, or a Kubernetes cluster with mixed hardware. You've got the compute. Congratulations!
Now what?
Here's the part that catches most teams off guard: having GPUs is the easy part. Managing them is where things go sideways. You need to figure out which models fit on which cards, how to balance load across machines, how to handle a node going down at 2 AM, and how to expose all of this as a clean API your application team can actually call.
Most teams end up building a brittle collection of Python scripts and crontab entries that haven't been updated since 2022. It works until it doesn't, and then someone's paging you on a Saturday.
This is the problem GPUStack was built to solve.
What Is GPUStack, Exactly?
GPUStack is an open-source tool for managing GPU clusters. Think of it as Kubernetes for your inference workloads, except you don't need to spend three days debugging a whitespace error in a Helm chart.
At its core, GPUStack does three things well:
It aggregates your GPUs. Whether your hardware is spread across bare-metal servers, Kubernetes pods, or cloud instances, GPUStack sees them all as a single pool of compute. One dashboard, full visibility.
It orchestrates inference engines. GPUStack doesn't try to reinvent the inference wheel. It plugs into engines like vLLM, SGLang, and TensorRT-LLM, picks the right one for the job, configures it, and manages the lifecycle so you don't have to.
It serves models through an OpenAI-compatible API. Once a model is deployed, your application team gets a familiar REST endpoint. No custom client libraries. No new protocols to learn. Swap out the base URL, and you're talking to your own infrastructure.
Getting Started in Under 5 Minutes
I'm not exaggerating on the timeline. Here's how you go from zero to a running GPUStack server.
Step 1: Fire Up the Server
You need one machine to act as your control plane. It doesn't even need a GPU. A basic CPU-only box works fine for the server role.
sudo docker run -d --name gpustack \
--restart unless-stopped \
-p 80:80 \
--volume gpustack-data:/var/lib/gpustack \
gpustack/gpustack
That's it. Open your browser, navigate to http://<your-server-ip>, and you'll see the GPUStack dashboard. The first time you log in, you'll set up your admin credentials.
Step 2: Add Your GPU Workers
Now for the fun part. On each worker node, make sure you have the NVIDIA driver and NVIDIA Container Toolkit installed, then run:
sudo docker run -d --name gpustack-worker \
--restart unless-stopped \
--gpus all \
-e GPUSTACK_SERVER_URL=http://<your-server-ip> \
-e GPUSTACK_TOKEN=<your-token> \
gpustack/gpustack
Replace the server URL and token (grab the token from the GPUStack dashboard). Within seconds, your worker appears in the cluster view with GPU model info, VRAM capacity, and health status.
Rinse and repeat for every GPU machine you want to add. Got 3 machines? Three commands. Got 30? Thirty commands, or one Ansible playbook if you're smart about it.
Running the worker command is actually the easiest part. The real final boss of GPU clusters is usually getting the drivers and toolkit installed correctly on the host.
Step 3: Deploy a Model
Head over to the model catalog in the web UI. GPUStack supports pulling models from Hugging Face and the Ollama Library. Pick a model and click deploy.
Here's where the scheduler really excels. It reads the model's metadata, computes the resource requirements for VRAM, compute, and memory, then figures out which workers can handle it. If the model is too big for a single GPU, it can shard it across multiple cards.
You don't have to manually calculate whether a 70B parameter model fits on your hardware. GPUStack does the math for you.
Step 4: Call the API
Once the model is running, you get an OpenAI-compatible endpoint. Grab an API key from the dashboard and test it:
curl http://<your-server-ip>/v1/chat/completions \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [
{"role": "user", "content": "Explain GPU cluster management in one paragraph."}
]
}'
If you're already using the OpenAI Python SDK, switching to your GPUStack endpoint is a one-line change:
from openai import OpenAI
client = OpenAI(
base_url="http://<your-server-ip>/v1",
api_key="<your-api-key>"
)
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello from my own GPU cluster!"}]
)
print(response.choices[0].message.content)
Your application code stays the same. Your infrastructure is now fully under your control.
Why This Actually Matters
Let me break down the features that make GPUStack more than a nice-looking dashboard.
Multi-Backend Flexibility
GPUStack supports vLLM, SGLang, and TensorRT-LLM out of the box. This matters because no single engine is best for every workload. vLLM is great at high-throughput batch processing. TensorRT-LLM squeezes out every last drop of performance on NVIDIA hardware. SGLang shines with structured generation. GPUStack lets you pick the right tool for each deployment, or lets the scheduler pick for you.
Built-In Monitoring
GPUStack integrates with Grafana and Prometheus, giving you real-time dashboards for GPU utilization, VRAM usage, token throughput, and API request rates. No need to bolt on a separate monitoring stack (which usually ends up being three half-finished Grafana dashboards anyway). When something breaks at 2 AM, you'll know exactly which GPU on which machine is the problem.
Automated Failure Recovery
We’ve all been there - a node drops off the map because of a weird PCIe bus error or a driver mismatch that only appears under heavy load. Normally, that means your inference API just returns 500s until you manually intervene. GPUStack handles the panic phase for you.
When Should You Use GPUStack?
GPUStack isn't the right fit for every scenario. Here's a quick way to think about it:
Use GPUStack if:
You have 2+ GPU machines and want to serve LLMs or other AI models behind a unified API. Especially if your team doesn't want to become full-time infrastructure engineers just to keep models running.
You want to run inference on your own hardware instead of paying per-token to a cloud provider. The cost savings at scale are real, and GPUStack removes the operational overhead that usually makes self-hosting painful.
Maybe skip GPUStack if:
You have a single GPU and just want to run a model locally for personal use. Tools like Ollama are simpler for that use case.
You're already deep into a custom Kubernetes-based ML platform with KubeFlow or similar. GPUStack can work alongside Kubernetes, but if you've already invested heavily in that ecosystem, the overlap might not be worth it.
The Bigger Picture
The AI infrastructure landscape is shifting. A year ago, most teams defaulted to API providers for inference. Today, with open-weight models getting better every month and GPU costs coming down, self-hosted inference is becoming a real option. Not just for Big Tech, but for startups and mid-size companies too.
The bottleneck isn't hardware anymore. It's operations. It's the glue code between "we have GPUs" and "our application can reliably call a model." GPUStack is a serious attempt at solving that gap, and it's open source under the Apache 2.0 license, so you can inspect, modify, and deploy it without vendor lock-in.
If you’re sitting on a pile of hardware that’s currently just acting as expensive space heaters, or if you’re tired of seeing cloud inference bills that look like mortgage payments, give this a shot. You might find that self-hosting is actually viable again!
Opinions expressed by DZone contributors are their own.
Comments