DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How To Build AI-Powered Prompt Templates Using the Salesforce Prompt Builder
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  • Traditional Testing and RAGAS: A Hybrid Strategy for Evaluating AI Chatbots
  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding

Trending

  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • The Smart Way to Talk to Your Database: Why Hybrid API + NL2SQL Wins
  • How To Build Resilient Microservices Using Circuit Breakers and Retries: A Developer’s Guide To Surviving
  • Chat With Your Knowledge Base: A Hands-On Java and LangChain4j Guide
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Build a DIY AI Model Hosting Platform With vLLM

Build a DIY AI Model Hosting Platform With vLLM

vLLM is an open-source tool that helps with hosting AI models using an efficient inference engine. It can be used to power LLMs in real products and services.

By 
Bhala Ranganathan user avatar
Bhala Ranganathan
DZone Core CORE ·
Mar. 12, 25 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
4.0K Views

Join the DZone community and get the full member experience.

Join For Free

One of the biggest challenges that developers and researchers face is deploying models for AI inference at scale. Traditionally, this involves relying on cloud services or complex server setups that can be expensive and resource intensive. However, with innovations like the vLLM AI Inference engine, Do-It-Yourself (DIY) model hosting is becoming more accessible and efficient. One can build cost-effective model-serving solutions for their machine learning needs.

vLLM

vLLM is an AI inference engine designed to efficiently serve large language models (LLMs) at scale. It is a robust, high-performance engine that provides a streamlined approach to serving AI models. It stands out in its ability to optimize resources and maintain low latency and high throughput even with large-scale models. The vLLM engine allows for faster inference times, improved memory management, and optimized execution, all of which are crucial for hosting models effectively on a DIY setup.

DIY Model Hosting

DIY model hosting provides several advantages, mainly:

  • Cost savings. By setting up one's own infrastructure, one can eliminate the recurring costs of cloud service subscriptions. It gives the flexibility to scale the system as needed and only pay for the hardware in use. 
  • Data privacy and security. For sensitive applications, hosting models in-house ensures that private data doesn’t need to leave the premises. This can be a major selling point for organizations handling private or proprietary information.
  • Control and customization. Self-hosting models allow complete control over the environment in which the models run. This includes custom configurations for optimal performance and integration with existing systems.

Benefits of DIY Model Hosting With vLLM

  • Low latency and high throughput. One of the most compelling reasons to choose vLLM is its ability to provide low latency and high throughput, even for large models. In the context of AI and machine learning, latency refers to the time it takes for an AI model to process input and return a result. High throughput refers to the number of queries or operations that can be processed in a given time frame.
  • Scalability. A key aspect of AI hosting is ensuring that the infrastructure can scale according to demand. vLLM provides scalability, allowing scale up or down based on traffic. Whether running a single small model or running several large language models, vLLM can efficiently distribute the load across multiple machines, ensuring consistent performance. 

Getting Started With vLLM

vLLM is open source, and installation is relatively simple. The below sections cover installing and running it on Ubuntu without any GPU. Other detailed installation guides are available on vLLM’s GitHub repository. Once the model is up and running, settings can be tweaked to optimize performance and scale based on demand. vLLM allows tracking performance, adjusting resources, and updating models as needed. 

Prerequisites

  • Ubuntu (No GPU needed) and Docker 
  • Set up Conda

Steps

1. Set up a Conda virtual environment.

Shell
 
conda create -n vllm python=3.12 -y
conda activate vllm


2. Clone the vLLM repository.

Shell
 
git clone https://github.com/vllm-project/vllm.git
cd vllm


3. Build and run a vLLM server for CPU.

Shell
 
docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
docker run --rm -it --network=host vllm-cpu-env --model Qwen/Qwen2.5-1.5B-Instruct

...
INFO 02-17 06:10:39 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.


4. Run a simple curl request against the vLLM server and check the response:

Input: "What is your name?"
Output: "I am a computer program, so I don't have a physical name. Is there anything specific you would like to know or discuss? I'm here to help with any questions you might have! How can I assist you today?"

Shell
 
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "What is your name ?",
        "max_tokens": 50,
        "temperature":0
    }'
    
{
    "id": "cmpl-d04ccc0a035c4e27a263b42217524299",
    "object": "text_completion",
    "created": 1739774404,
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "choices": [
        {
            "index": 0,
            "text": " I am a computer program, so I don't have a physical name. Is there anything specific you would like to know or discuss? I'm here to help with any questions you might have! How can I assist you today?",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 151643,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 5,
        "total_tokens": 52,
        "completion_tokens": 47,
        "prompt_tokens_details": null
    }
}


5. Server-side logs for this request give a lot of information around throughput, cache usage, etc.

Shell
 
INFO 02-17 06:40:04 logger.py:39] Received request cmpl-d04ccc0a035c4e27a263b42217524299-0: prompt: 'What is your name ?', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=50, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [3838, 374, 697, 829, 937], lora_request: None, prompt_adapter_request: None.
INFO 02-17 06:40:04 engine.py:275] Added request cmpl-d04ccc0a035c4e27a263b42217524299-0.
INFO 02-17 06:40:04 metrics.py:455] Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.


6. More important metrics are available by looking into http://localhost:8000/metrics. The server produces the below metrics, which can be used for performance optimization. My previous article also covers more information about inference time compute. 

Metric
Description
 time_to_first_token_seconds
The time it takes for the first output token to be produced
 time_per_output_token_seconds
The time it takes for each output token to be produced
 e2e_request_latency_seconds
The overall time taken for a request
request_queue_time_seconds
The time a request waits in the queue
request_inference_time_seconds
The time it takes for inference
cache_config_info
Cache configuration information
num_requests_running
Total count of requests that are running
num_requests_waiting
Total count of requests that are waiting
gpu_cache_usage_perc
GPU cache memory consumption
cpu_cache_usage_perc
CPU cache memory consumption
cpu_prefix_cache_hit_rate
Hit rate of CPU prefix cache
gpu_prefix_cache_hit_rate
Hit rate of GPU prefix cache
request_success_total
Total count of successful requests


Shell
 
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_sum{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.5297608375549316
vllm:time_to_first_token_seconds_count{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 5.0

# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
vllm:time_per_output_token_seconds_sum{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 8.178969383239746
vllm:time_per_output_token_seconds_count{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 196.0

# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
vllm:e2e_request_latency_seconds_sum{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 8.710693597793579
vllm:e2e_request_latency_seconds_count{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 5.0

# HELP vllm:request_queue_time_seconds Histogram of time spent in WAITING phase for request.
# TYPE vllm:request_queue_time_seconds histogram
vllm:request_queue_time_seconds_sum{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.005610227584838867
vllm:request_queue_time_seconds_count{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 5.0

# HELP vllm:request_inference_time_seconds Histogram of time spent in RUNNING phase for request.
# TYPE vllm:request_inference_time_seconds histogram
vllm:request_inference_time_seconds_sum{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 8.70508337020874
vllm:request_inference_time_seconds_count{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 5.0

# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_kvcache_space_bytes="4294967296",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="0",num_gpu_blocks="9362",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0

# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.0

# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.0

# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.0

# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="Qwen/Qwen2.5-1.5B-Instruct"} 0.0

# HELP vllm:cpu_prefix_cache_hit_rate CPU prefix cache block hit rate.
# TYPE vllm:cpu_prefix_cache_hit_rate gauge
vllm:cpu_prefix_cache_hit_rate{model_name="Qwen/Qwen2.5-1.5B-Instruct"} -1.0

# HELP vllm:gpu_prefix_cache_hit_rate GPU prefix cache block hit rate.
# TYPE vllm:gpu_prefix_cache_hit_rate gauge
vllm:gpu_prefix_cache_hit_rate{model_name="Qwen/Qwen2.5-1.5B-Instruct"} -1.0

# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="Qwen/Qwen2.5-1.5B-Instruct"} 2.0
vllm:request_success_total{finished_reason="stop",model_name="Qwen/Qwen2.5-1.5B-Instruct"} 3.0


7. Grafana dashboards can also be configured to consume the above metrics by navigating to http://localhost:3000. The below picture shows such a dashboard showing vllm:e2e_request_latency_seconds_sum.

A dashboard showing vllm:e2e_request_latency_seconds_sum


Conclusion

The power, flexibility, and scalability offered by vLLM make it an invaluable tool for anyone looking to take their AI projects to the next level without the financial strain of relying on expensive cloud services. vLLM supports several types of hardware accelerators and also has an OpenAI-compatible API server that simplifies deployment and integration. By being fast, efficient, and simple, it opens up new possibilities for using LLMs in all sorts of applications. It makes it easier for anyone to use the power of large AI models. 

References

  1. Welcome to vLLM
  2. Understanding Inference Time Compute
AI Engine Inference engine

Opinions expressed by DZone contributors are their own.

Related

  • How To Build AI-Powered Prompt Templates Using the Salesforce Prompt Builder
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  • Traditional Testing and RAGAS: A Hybrid Strategy for Evaluating AI Chatbots
  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!