DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • KV Cache Implementation Inside vLLM
  • Scalable Support Request Analysis Using Embeddings, HDBSCAN, and Tiny LLMs
  • The Bill You Didn't See Coming
  • Respecting robots.txt in Web Scraping

Trending

  • Evaluating SOC Effectiveness Using Detection Coverage and Response Metrics
  • Retesting Best Practices for Agile Teams: A Quick Guide to Bug Fix Verification
  • Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
  • Catching Data Perimeter Drift Before It Reaches Production
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM

From 0.68 to 10 Requests/Second: Optimizing LLM Serving With vLLM

This article demonstrates how vLLM is a game-changer for efficient GPU memory utilization and what makes it a high-throughput serving and inference engine.

By 
Rajashree Mandaogane user avatar
Rajashree Mandaogane
·
Sep. 30, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
5.2K Views

Join the DZone community and get the full member experience.

Join For Free

GPUs are essential for running large language models (LLMs). But when companies deploy LLMs to production, having powerful GPUs alone isn't enough.

What becomes equally important is handling available GPUs efficiently, so that multiple concurrent user requests can be served within sub-second response times. This requires a software layer above GPUs that can provide request batching, memory optimization, and dynamic resource management. That's exactly what vLLM strives to provide. In this article, we will explore how vLLM achieves this and examine the performance improvements it delivers.

Before diving into vLLM, let's first understand why efficient utilization of GPUs is crucial.

LLM Inference Costs

Modern GPUs for LLM inference can be quite expensive. The best GPUs for large-scale enterprises running LLM inference, such as the NVIDIA H100, can range from $30,000-$40,000 per GPU. Large-scale companies deploy fleets of such GPUs. Inefficient memory management would lead to such a powerful and expensive fleet serving only around 5 users instead of 50 users.

Let's look into why and how memory management is crucial for high-throughput serving.

Memory Layout

Model weights are expected to take up around 70% of the memory of the GPU.

A small percentage (5-10%) of memory is allocated for temporary activations, which are intermediate computations done for the prompts.

The remaining approximately 20-25% is available for the KV cache, which is a key-value cache for tokens, allowing the model to 'remember' previous tokens without recomputing them. 

Inference

LLM inference is a sequential process wherein a model generates tokens one at a time, based on the input prompt and the previously generated output tokens. This process is repeated until the model outputs an end-of-sequence token. This process utilizes the KV cache introduced in the above section. Which means the serving throughput is dependent on how efficiently this critical segment of memory is managed. 

LLMs allocate contiguous chunks of memory from the KV cache for each request. Each request can generate varying outputs, leading to more memory allocation per request. Traditional approaches pre-allocate large contiguous memory blocks for each request's maximum possible sequence length. For example, if a system reserves memory for 2048 tokens per request but most conversations only use 200-500 tokens, 75% of the allocated memory sits unused. This waste severely limits how many requests can be batched together, which in turn prevents higher throughput.

PagedAttention

To address the above-described issue of KV cache management, UC Berkeley researchers introduced a new algorithm called PagedAttention, which is based on a crucial Operating System's memory management technique: paging. The main concept behind OS paging is to divide the program into fixed-sized blocks called pages, allowing non-contiguous memory allocation for the program, and more importantly, allowing loading only the currently-needed blocks into memory. This way, multiple programs can be simultaneously loaded in memory and processed.

PagedAttention applies the same principles to KV cache, and this is the core foundation of vLLM.

vLLM

vLLM is the high-throughput serving and inference engine. It is fast and efficient due to:

  1. PagedAttention for efficient KV cache management
  2. Prefix caching support to avoid recomputations
  3. Continuous batching of incoming requests for higher throughput

It is an easy-to-use and complete inference stack that includes:

  1. An HTTP server compatible with OpenAI APIs
  2. Integration with popular Hugging Face models
  3. Support for various and frequently used compute infrastructure (such as NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, and TPU)

Now that we understand the advantages of vLLM, it is time to see it in action!

Benchmarking With vLLM

Here's the configuration and setup I used for the testing:

Configuration

This is the server configuration:

Plain Text
 
Tesla V100 Tensor Core GPU (2) 
10 core CPU
45 GB RAM
32 GB GPU VRAM


The model I chose for this test is "gpt2-large." I found this to be the perfect size for the server, and choosing the right model size was crucial, as we need to download and deploy the model for inference from the server.

I downloaded and used the following packages and any required dependencies for my test:

pip install transformers flask accelerate requests vllm

vLLM is an inference stack complete with its HTTP server. To demonstrate vLLM's advantages, I created a baseline comparison using a standard approach: a Flask HTTP server that accepts requests and processes them through the LLM individually. This represents how most developers would initially implement LLM serving, without any model-level optimization.

While this Flask server can receive multiple requests simultaneously, each request must wait for its own individual forward pass through the model. This creates a bottleneck at the LLM inference layer, regardless of whether HTTP concurrency can be enabled.

This is the simple Flask-based HTTP server that accepts requests on an endpoint called `/generate` on port 6000 and uses those as prompts for the model:

Python
 
# custom_server.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

model_name = "gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data['prompt']

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return jsonify({'response': response})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=6000, threaded=False)


Note: The endpoint name and port can be changed; this is just what I chose.

I then started this sequential server on one tab of the terminal:


On the second tab of the terminal, I started the vLLM server by running:

Shell
 
python3 -m vllm.entrypoints.openai.api_server \
--model gpt2-large \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.7 \
--port 8000


For the benchmark tests, I chose to write a small Python script that sends requests to both the sequential server and vllm server:

Python
 
import time
import requests
import threading

def send_request(url):
    if "6000" in url:  # Custom server
        response = requests.post(f"{url}/generate",
                               json={"prompt": "Tell me about efficient GPU utilization"})
    else:  # vLLM server
        response = requests.post(f"{url}/v1/completions",
                               json={"model": "gpt2-large",
                                    "prompt": "Tell me about efficient GPU utilization",
                                    "max_tokens": 50})
    return response.status_code

def test_server(url, name, num_requests=20):
    print(f"\nTesting {name} with {num_requests} requests...")

    start_time = time.time()

    threads = []
    for i in range(num_requests):
        thread = threading.Thread(target=send_request, args=(url,))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    total_time = time.time() - start_time
    print(f"{name}: Completed {num_requests} requests in {total_time:.2f} seconds")
    print(f"Throughput: {num_requests/total_time:.2f} requests/second")

# Run tests
test_server("http://localhost:6000", "Custom server (No vLLM)", 50)
test_server("http://localhost:8000", "vLLM", 50)


The results speak for themselves!


The requests to the custom sequential server took close to 74 seconds, whereas for vLLM, it took 5 seconds to process them. Showing almost 14x improvement in request processing speed and throughput.

From my testing, I've seen this improvement range anywhere from 7-15x. 

Conclusion

vLLM's combination of PagedAttention, prefix caching, and continuous batching makes it a tremendously efficient LLM serving and inference engine, delivering a 14x throughput improvement in our testing: 0.68 to 10 req/sec on the same hardware. While our GPU configuration was modest compared to enterprise production deployments, these results demonstrate vLLM's potential impact at scale.

For enterprises running fleets of expensive H100 GPUs, this efficiency translates directly to ROI: instead of serving a handful of concurrent users per GPU, vLLM enables serving dozens of users on the same infrastructure. In production environments where GPU costs can exceed millions annually, vLLM's memory optimization proves to be a game-changer.

Cache (computing) Requests large language model

Opinions expressed by DZone contributors are their own.

Related

  • KV Cache Implementation Inside vLLM
  • Scalable Support Request Analysis Using Embeddings, HDBSCAN, and Tiny LLMs
  • The Bill You Didn't See Coming
  • Respecting robots.txt in Web Scraping

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook