Beyond Traditional Load Balancers: The Role of Inference Routers in AI Systems
Inference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc.
Join the DZone community and get the full member experience.
Join For FreeInference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc. Unlike simple round robin-based routing found in traditional load balancers, factors such as request complexity, cost constraints, and GPU resource availability are considered in the decision-making layer. It acts as a layer that ensures requests are served by the optimal model for the given request, improving efficiency and performance in multi-model environments. A few examples of inference routers are vLLM router, Azure Inference router, OpenRouter, etc.
Selecting the Correct Model for the Current Use Case
Selecting the correct model for a use case involves benchmarking and evaluating models against well-defined criteria, as illustrated in Azure AI Foundry’s model benchmarks approach. This process starts by identifying the request type, such as text generation, summarization, or reasoning, and then comparing candidate models on metrics like accuracy, latency, throughput, and cost. Benchmarks provide standardized tests that simulate real-world use cases, enabling developers to assess trade-offs between performance and efficiency.
For example, a model with high accuracy but slower response time may suit complex reasoning requests, while a faster, lower-cost model might be ideal for simple queries. Incorporating these benchmark insights ensures that model selection aligns with business priorities, delivering reliable and optimized outcomes for the current use case. Below is a model leaderboard example as shown in the Azure AI Foundry portal, based on their scoring methodology, intended to guide model selection.

Importance of Inference Routing
Considering the above complex and tricky model selection process, inference routing is critical because modern AI systems often involve multiple models with varying capabilities and costs. Without routing, all requests might go to a single large model, leading to unnecessary cost and latency.
Intelligent inference routing optimizes by routing simple requests to smaller, cost-effective models, reducing latency through cache-aware and utilization-aware strategies, and distributing requests across diverse cloud and edge resources for scalability. This dynamic approach ensures high availability, cost efficiency, and low latency, making it ideal for real-time AI applications.
Comparison With Traditional Load Balancers
Inference routing differs from typical load balancers because it makes AI aware, requests context-driven decisions rather than simply distributing traffic based on request and network load. While a load balancer focuses on balancing server load using algorithms like round robin, inference routing evaluates request semantics, model capabilities, latency requirements, cost constraints, and even cache state to select the most suitable model for each request.
| Inference routers | Load balancers | |
|---|---|---|
|
Traffic routing logic |
May use metrics specific to inference requests like complexity, model capability, cost, latency, response quality, etc. |
May use traffic and network-level metrics like pending requests, CPU load, connection count, etc. |
|
Optimization goal |
Accuracy, cost efficiency, quality, GPU resource utilization, etc. |
General service goals like throughput, availability, latency, etc. |
|
Request context awareness |
May be aware of things like the type of request (reasoning vs non-reasoning request), GPU usage, CPU usage, etc., specific to AI inferencing. |
Not typically aware of request context beyond high-level request-related metrics. |
Third-Party vs. Per-Cloud Inference Routers
Third-party inference routers and cloud provider native routers differ significantly in four key areas, namely model diversity, cost optimization, enterprise support, and service ecosystem. Cloud provider routers integrate seamlessly with their own platforms, offering native monitoring, auto scaling, and compliance, while third-party routers require custom integration but can operate across multiple cloud environments. Cloud providers typically offer robust SLAs and enterprise-grade support, whereas third-party routers vary in support quality and responsiveness.
|
Third party inference routers
|
Cloud provider native inference routers | |
|---|---|---|
| Example |
vLLM router, OpenRouter
|
Azure inference router |
|
Model diversity |
Supports multiple cloud providers and open source models |
Competing models may not be supported by the provider |
|
Cost optimization |
Can route across providers for best pricing |
May be limited to optimizations within the provider’s pricing tiers |
|
Enterprise support |
Depends and may vary |
May be better compared to open source community support |
|
Services ecosystem |
May requires custom integration with cloud services and APIs like storage, monitoring, etc. |
May provide native integration tools |
Example Usage for Better Understanding
The example below is based on the vllm-semantic-router installation steps.
Step 1: Start vLLM server for TinyLlama/TinyLlama-1.1B-Chat-v1.0.
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 11434 --served-model-name TinyLlama/TinyLlama-1.1B-Chat-v1.0
INFO 10-10 22:34:02 api_server.py:937] Starting vLLM API server on http://0.0.0.0:11434
Step 2: Modify the highlighted sections below in config/config.yaml to point to the model in use.
vllm_endpoints:
- name: "endpoint1"
address: "127.0.0.1" # IPv4 address - REQUIRED format
port: 8000
models:
- "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
weight: 1
model_config:
"TinyLlama/TinyLlama-1.1B-Chat-v1.0":
reasoning_family: "gpt-oss" # This model uses GPT-OSS reasoning syntax
preferred_endpoints: ["endpoint1"]
pii_policy:
allow_by_default: true
categories:
- name: math
system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way."
model_scores:
- model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
score: 1.0
use_reasoning: true # Enable reasoning for complex math
Step 3: Start vllm-semantic-router.
make run-envoy
[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:932] runtime: {}
[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:765] Starting admin HTTP server at 127.0.0.1:19000
make run-router
{"level":"info","ts":"2025-10-10T21:37:26.501979783Z","caller":"observability/logging.go:140","msg":"Starting vLLM Semantic Router ExtProc with config: config/config.yaml"}
{"level":"info","ts":"2025-10-10T21:37:26.502091557Z","caller":"observability/logging.go:140","msg":"Starting insecure LLM Router ExtProc server on port 50051..."}
{"level":"info","ts":"2025-10-10T21:37:26.502190752Z","caller":"observability/logging.go:140","msg":"Starting Classification API server on port 8080"}
{"level":"info","ts":"2025-10-10T21:37:26.50408922Z","caller":"observability/logging.go:140","msg":"Found global classification service on attempt 1/5"}
{"level":"info","ts":"2025-10-10T21:37:26.504332195Z","caller":"observability/logging.go:140","msg":"System prompt configuration endpoints enabled"}
{"level":"info","ts":"2025-10-10T21:37:26.504360331Z","caller":"observability/logging.go:140","msg":"Classification API server listening on port 8080"}
Step 4: Issue a simple request against the vLLM semantic router.
curl -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the derivative of x^2?"}
]
}'
The section below shows the logs emitted by the vLLM Semantic Router, illustrating how the logic works from receiving a request to routing it to the correct endpoint that hosts the model.
{"level":"info","ts":"2025-10-10T22:49:30.84348937Z","caller":"observability/logging.go:140","msg":"Started processing a new request"}
{"level":"info","ts":"2025-10-10T22:49:30.844178952Z","caller":"observability/logging.go:140","msg":"Received request headers"}
{"level":"info","ts":"2025-10-10T22:49:30.844804801Z","caller":"observability/logging.go:140","msg":"Received request body {\n \"model\": \"auto\",\n \"messages\": [\n {\"role\": \"user\", \"content\": \"What is the derivative of x^2?\"}\n ]\n }"}
...
{"level":"info","ts":"2025-10-10T22:49:31.801283439Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.801302404Z","caller":"observability/logging.go:140","msg":"Selected model TinyLlama/TinyLlama-1.1B-Chat-v1.0 for category math with score 1.0000"}
{"level":"info","ts":"2025-10-10T22:49:31.971759021Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004"}
{"level":"info","ts":"2025-10-10T22:49:31.971831455Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.971877997Z","caller":"observability/logging.go:140","msg":"Routing to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.115533183Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004, entropy_available=true"}
{"level":"info","ts":"2025-10-10T22:49:32.115686518Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math), reasoning_decision: use=true, confidence=0.630, reason=low_uncertainty_trust_classification"}
{"level":"info","ts":"2025-10-10T22:49:32.115719018Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision: category='math', confidence=0.700, use_reasoning=true, reason=low_uncertainty_trust_classification, strategy=trust_top_category"}
{"level":"info","ts":"2025-10-10T22:49:32.115776879Z","caller":"observability/logging.go:140","msg":"Top predicted categories: [{math 0.7003756} {physics 0.10189054} {chemistry 0.09165735}]"}
{"level":"info","ts":"2025-10-10T22:49:32.115788663Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision for this query: true on [TinyLlama/TinyLlama-1.1B-Chat-v1.0] model (confidence: 0.630, reason: low_uncertainty_trust_classification)"}
{"level":"info","ts":"2025-10-10T22:49:32.115832364Z","caller":"observability/logging.go:140","msg":"Selected endpoint address: 127.0.0.1:8000 for model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116305725Z","caller":"observability/logging.go:140","msg":"Applied reasoning mode (enabled: true) with effort (high) to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116349513Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt to the beginning of messages (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116364889Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt for category: math (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116376391Z","caller":"observability/logging.go:140","msg":"System prompt injection completed for category: math, body size: 320 bytes"}
{"level":"info","ts":"2025-10-10T22:49:32.116383792Z","caller":"observability/logging.go:140","msg":"Use new model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116406184Z","caller":"observability/logging.go:136","msg":"routing_decision","event":"routing_decision","reason_code":"auto_routing","request_id":"a7d7c8c7-bba4-4374-b997-d9b6c6d1a827","reasoning_enabled":true,"selected_endpoint":"127.0.0.1:8000","routing_latency_ms":1271,"original_model":"auto","selected_model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","category":"math","reasoning_effort":"high"}
Conclusion
In conclusion, inference routing is a critical component for modern AI systems, enabling intelligent routing of requests to the most suitable models. Unlike traditional load balancers, it may optimize for cost, latency, and accuracy of models. Whether using cloud native solutions for seamless integration or third-party routers for multi-cloud flexibility and model diversity, as enterprises adopt multi-model architectures, inference routing will continue to play a pivotal role in maximizing resource utilization and ensuring high-quality user experiences.
References
- https://github.com/LLM-inference-router/vllm-router
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-kubernetes-inference-routing-azureml-fe?view=azureml-api-2
- https://openrouter.ai/
- https://ai.azure.com/doc/azure/ai-foundry/concepts/model-benchmarks?tid=e74feb03-7237-44ac-88d8-fe3d3b8e4d9b
- https://blog.vllm.ai/2025/09/11/semantic-router.html
Opinions expressed by DZone contributors are their own.
Comments