DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • AI Agents in Java: Architecting Intelligent Health Data Systems
  • Improving DAG Failure Detection in Airflow Using AI Techniques
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Hallucination Has Real Consequences — Lessons From Building AI Systems

Trending

  • [closed] DZone's 2025 Developer Community Survey
  • AI in Software Development: A Mirror, Not a Magic Wand
  • Reactive Kafka With Spring Boot
  • The Developer's Guide to Context-Aware AI: When Your Code Documentation Becomes Intelligent
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Beyond Traditional Load Balancers: The Role of Inference Routers in AI Systems

Beyond Traditional Load Balancers: The Role of Inference Routers in AI Systems

Inference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc.

By 
Bhala Ranganathan user avatar
Bhala Ranganathan
DZone Core CORE ·
Oct. 13, 25 · Analysis
Likes (1)
Comment
Save
Tweet
Share
1.9K Views

Join the DZone community and get the full member experience.

Join For Free

Inference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc. Unlike simple round robin-based routing found in traditional load balancers, factors such as request complexity, cost constraints, and GPU resource availability are considered in the decision-making layer. It acts as a layer that ensures requests are served by the optimal model for the given request, improving efficiency and performance in multi-model environments. A few examples of inference routers are vLLM router, Azure Inference router, OpenRouter, etc.

Selecting the Correct Model for the Current Use Case

Selecting the correct model for a use case involves benchmarking and evaluating models against well-defined criteria, as illustrated in Azure AI Foundry’s model benchmarks approach. This process starts by identifying the request type, such as text generation, summarization, or reasoning, and then comparing candidate models on metrics like accuracy, latency, throughput, and cost. Benchmarks provide standardized tests that simulate real-world use cases, enabling developers to assess trade-offs between performance and efficiency. 

For example, a model with high accuracy but slower response time may suit complex reasoning requests, while a faster, lower-cost model might be ideal for simple queries. Incorporating these benchmark insights ensures that model selection aligns with business priorities, delivering reliable and optimized outcomes for the current use case. Below is a model leaderboard example as shown in the Azure AI Foundry portal, based on their scoring methodology, intended to guide model selection.

Source: Azure AI Foundry model catalog model leaderboards


Importance of Inference Routing

Considering the above complex and tricky model selection process, inference routing is critical because modern AI systems often involve multiple models with varying capabilities and costs. Without routing, all requests might go to a single large model, leading to unnecessary cost and latency. 

Intelligent inference routing optimizes by routing simple requests to smaller, cost-effective models, reducing latency through cache-aware and utilization-aware strategies, and distributing requests across diverse cloud and edge resources for scalability. This dynamic approach ensures high availability, cost efficiency, and low latency, making it ideal for real-time AI applications.

Comparison With Traditional Load Balancers

Inference routing differs from typical load balancers because it makes AI aware, requests context-driven decisions rather than simply distributing traffic based on request and network load. While a load balancer focuses on balancing server load using algorithms like round robin, inference routing evaluates request semantics, model capabilities, latency requirements, cost constraints, and even cache state to select the most suitable model for each request. 


Inference routers Load balancers

Traffic routing logic

May use metrics specific to inference requests like complexity, model capability, cost, latency, response quality, etc.

May use traffic and network-level metrics like pending requests, CPU load, connection count, etc.

Optimization goal

Accuracy, cost efficiency, quality, GPU resource utilization, etc.

General service goals like throughput, availability, latency, etc.

Request context awareness

May be aware of things like the type of request (reasoning vs non-reasoning request), GPU usage, CPU usage, etc., specific to AI inferencing.

Not typically aware of request context beyond high-level request-related metrics.

 

Third-Party vs. Per-Cloud Inference Routers

Third-party inference routers and cloud provider native routers differ significantly in four key areas, namely model diversity, cost optimization, enterprise support, and service ecosystem. Cloud provider routers integrate seamlessly with their own platforms, offering native monitoring, auto scaling, and compliance, while third-party routers require custom integration but can operate across multiple cloud environments. Cloud providers typically offer robust SLAs and enterprise-grade support, whereas third-party routers vary in support quality and responsiveness.


Third party inference routers

Cloud provider native inference routers
Example
vLLM router, OpenRouter
Azure inference router

Model diversity

Supports multiple cloud providers and open source models

Competing models may not be supported by the provider

Cost optimization

Can route across providers for best pricing

May be limited to optimizations within the provider’s pricing tiers

Enterprise support

Depends and may vary

May be better compared to open source community support

Services ecosystem

May requires custom integration with cloud services and APIs like storage, monitoring, etc.

May provide native integration tools


Example Usage for Better Understanding

The example below is based on the vllm-semantic-router installation steps.

Step 1: Start vLLM server for TinyLlama/TinyLlama-1.1B-Chat-v1.0.

Shell
 
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
 
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 11434 --served-model-name TinyLlama/TinyLlama-1.1B-Chat-v1.0
Shell
 
INFO 10-10 22:34:02 api_server.py:937] Starting vLLM API server on http://0.0.0.0:11434


Step 2: Modify the highlighted sections below in config/config.yaml to point to the model in use. 

YAML
 
vllm_endpoints:
  - name: "endpoint1"
    address: "127.0.0.1"  # IPv4 address - REQUIRED format
    port: 8000
    models:
      - "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    weight: 1

model_config:
  "TinyLlama/TinyLlama-1.1B-Chat-v1.0":
    reasoning_family: "gpt-oss"  # This model uses GPT-OSS reasoning syntax
    preferred_endpoints: ["endpoint1"]
    pii_policy:
      allow_by_default: true

categories:    
  - name: math
    system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way."
    model_scores:
      - model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
        score: 1.0
        use_reasoning: true  # Enable reasoning for complex math


Step 3: Start vllm-semantic-router.

Shell
 
make run-envoy

[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:932] runtime: {}
[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:765] Starting admin HTTP server at 127.0.0.1:19000
Shell
 
make run-router

{"level":"info","ts":"2025-10-10T21:37:26.501979783Z","caller":"observability/logging.go:140","msg":"Starting vLLM Semantic Router ExtProc with config: config/config.yaml"}
{"level":"info","ts":"2025-10-10T21:37:26.502091557Z","caller":"observability/logging.go:140","msg":"Starting insecure LLM Router ExtProc server on port 50051..."}
{"level":"info","ts":"2025-10-10T21:37:26.502190752Z","caller":"observability/logging.go:140","msg":"Starting Classification API server on port 8080"}
{"level":"info","ts":"2025-10-10T21:37:26.50408922Z","caller":"observability/logging.go:140","msg":"Found global classification service on attempt 1/5"}
{"level":"info","ts":"2025-10-10T21:37:26.504332195Z","caller":"observability/logging.go:140","msg":"System prompt configuration endpoints enabled"}
{"level":"info","ts":"2025-10-10T21:37:26.504360331Z","caller":"observability/logging.go:140","msg":"Classification API server listening on port 8080"}


Step 4: Issue a simple request against the vLLM semantic router.

Shell
 
curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the derivative of x^2?"}
    ]
  }'


The section below shows the logs emitted by the vLLM Semantic Router, illustrating how the logic works from receiving a request to routing it to the correct endpoint that hosts the model.

Shell
 
{"level":"info","ts":"2025-10-10T22:49:30.84348937Z","caller":"observability/logging.go:140","msg":"Started processing a new request"}
{"level":"info","ts":"2025-10-10T22:49:30.844178952Z","caller":"observability/logging.go:140","msg":"Received request headers"}
{"level":"info","ts":"2025-10-10T22:49:30.844804801Z","caller":"observability/logging.go:140","msg":"Received request body {\n    \"model\": \"auto\",\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"What is the derivative of x^2?\"}\n    ]\n  }"}
...
{"level":"info","ts":"2025-10-10T22:49:31.801283439Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.801302404Z","caller":"observability/logging.go:140","msg":"Selected model TinyLlama/TinyLlama-1.1B-Chat-v1.0 for category math with score 1.0000"}
{"level":"info","ts":"2025-10-10T22:49:31.971759021Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004"}
{"level":"info","ts":"2025-10-10T22:49:31.971831455Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.971877997Z","caller":"observability/logging.go:140","msg":"Routing to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.115533183Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004, entropy_available=true"}
{"level":"info","ts":"2025-10-10T22:49:32.115686518Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math), reasoning_decision: use=true, confidence=0.630, reason=low_uncertainty_trust_classification"}
{"level":"info","ts":"2025-10-10T22:49:32.115719018Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision: category='math', confidence=0.700, use_reasoning=true, reason=low_uncertainty_trust_classification, strategy=trust_top_category"}
{"level":"info","ts":"2025-10-10T22:49:32.115776879Z","caller":"observability/logging.go:140","msg":"Top predicted categories: [{math 0.7003756} {physics 0.10189054} {chemistry 0.09165735}]"}
{"level":"info","ts":"2025-10-10T22:49:32.115788663Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision for this query: true on [TinyLlama/TinyLlama-1.1B-Chat-v1.0] model (confidence: 0.630, reason: low_uncertainty_trust_classification)"}
{"level":"info","ts":"2025-10-10T22:49:32.115832364Z","caller":"observability/logging.go:140","msg":"Selected endpoint address: 127.0.0.1:8000 for model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116305725Z","caller":"observability/logging.go:140","msg":"Applied reasoning mode (enabled: true) with effort (high) to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116349513Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt to the beginning of messages (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116364889Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt for category: math (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116376391Z","caller":"observability/logging.go:140","msg":"System prompt injection completed for category: math, body size: 320 bytes"}
{"level":"info","ts":"2025-10-10T22:49:32.116383792Z","caller":"observability/logging.go:140","msg":"Use new model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116406184Z","caller":"observability/logging.go:136","msg":"routing_decision","event":"routing_decision","reason_code":"auto_routing","request_id":"a7d7c8c7-bba4-4374-b997-d9b6c6d1a827","reasoning_enabled":true,"selected_endpoint":"127.0.0.1:8000","routing_latency_ms":1271,"original_model":"auto","selected_model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","category":"math","reasoning_effort":"high"}


Conclusion

In conclusion, inference routing is a critical component for modern AI systems, enabling intelligent routing of requests to the most suitable models. Unlike traditional load balancers, it may optimize for cost, latency, and accuracy of models. Whether using cloud native solutions for seamless integration or third-party routers for multi-cloud flexibility and model diversity, as enterprises adopt multi-model architectures, inference routing will continue to play a pivotal role in maximizing resource utilization and ensuring high-quality user experiences.

References

  1. https://github.com/LLM-inference-router/vllm-router
  2. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-kubernetes-inference-routing-azureml-fe?view=azureml-api-2
  3. https://openrouter.ai/
  4. https://ai.azure.com/doc/azure/ai-foundry/concepts/model-benchmarks?tid=e74feb03-7237-44ac-88d8-fe3d3b8e4d9b
  5. https://blog.vllm.ai/2025/09/11/semantic-router.html
AI Requests systems

Opinions expressed by DZone contributors are their own.

Related

  • AI Agents in Java: Architecting Intelligent Health Data Systems
  • Improving DAG Failure Detection in Airflow Using AI Techniques
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Hallucination Has Real Consequences — Lessons From Building AI Systems

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook