Beyond Traditional Load Balancers: The Role of Inference Routers in AI Systems

Inference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc.

Bhala Ranganathan

CORE ·

Oct. 13, 25 · Analysis

Likes (1)

Comment

Save

2.0K Views

Inference routing is the process of routing AI inference requests to the most suitable model based on cost, latency, quality, etc. Unlike simple round robin-based routing found in traditional load balancers, factors such as request complexity, cost constraints, and GPU resource availability are considered in the decision-making layer. It acts as a layer that ensures requests are served by the optimal model for the given request, improving efficiency and performance in multi-model environments. A few examples of inference routers are vLLM router, Azure Inference router, OpenRouter, etc.

Selecting the Correct Model for the Current Use Case

Selecting the correct model for a use case involves benchmarking and evaluating models against well-defined criteria, as illustrated in Azure AI Foundry’s model benchmarks approach. This process starts by identifying the request type, such as text generation, summarization, or reasoning, and then comparing candidate models on metrics like accuracy, latency, throughput, and cost. Benchmarks provide standardized tests that simulate real-world use cases, enabling developers to assess trade-offs between performance and efficiency.

For example, a model with high accuracy but slower response time may suit complex reasoning requests, while a faster, lower-cost model might be ideal for simple queries. Incorporating these benchmark insights ensures that model selection aligns with business priorities, delivering reliable and optimized outcomes for the current use case. Below is a model leaderboard example as shown in the Azure AI Foundry portal, based on their scoring methodology, intended to guide model selection.

Source: Azure AI Foundry model catalog model leaderboards

Importance of Inference Routing

Considering the above complex and tricky model selection process, inference routing is critical because modern AI systems often involve multiple models with varying capabilities and costs. Without routing, all requests might go to a single large model, leading to unnecessary cost and latency.

Intelligent inference routing optimizes by routing simple requests to smaller, cost-effective models, reducing latency through cache-aware and utilization-aware strategies, and distributing requests across diverse cloud and edge resources for scalability. This dynamic approach ensures high availability, cost efficiency, and low latency, making it ideal for real-time AI applications.

Comparison With Traditional Load Balancers

Inference routing differs from typical load balancers because it makes AI aware, requests context-driven decisions rather than simply distributing traffic based on request and network load. While a load balancer focuses on balancing server load using algorithms like round robin, inference routing evaluates request semantics, model capabilities, latency requirements, cost constraints, and even cache state to select the most suitable model for each request.

	Inference routers	Load balancers
Traffic routing logic	May use metrics specific to inference requests like complexity, model capability, cost, latency, response quality, etc.	May use traffic and network-level metrics like pending requests, CPU load, connection count, etc.
Optimization goal	Accuracy, cost efficiency, quality, GPU resource utilization, etc.	General service goals like throughput, availability, latency, etc.
Request context awareness	May be aware of things like the type of request (reasoning vs non-reasoning request), GPU usage, CPU usage, etc., specific to AI inferencing.	Not typically aware of request context beyond high-level request-related metrics.

Third-Party vs. Per-Cloud Inference Routers

Third-party inference routers and cloud provider native routers differ significantly in four key areas, namely model diversity, cost optimization, enterprise support, and service ecosystem. Cloud provider routers integrate seamlessly with their own platforms, offering native monitoring, auto scaling, and compliance, while third-party routers require custom integration but can operate across multiple cloud environments. Cloud providers typically offer robust SLAs and enterprise-grade support, whereas third-party routers vary in support quality and responsiveness.

	Third party inference routers	Cloud provider native inference routers
Example	vLLM router, OpenRouter	Azure inference router
Model diversity	Supports multiple cloud providers and open source models	Competing models may not be supported by the provider
Cost optimization	Can route across providers for best pricing	May be limited to optimizations within the provider’s pricing tiers
Enterprise support	Depends and may vary	May be better compared to open source community support
Services ecosystem	May requires custom integration with cloud services and APIs like storage, monitoring, etc.	May provide native integration tools

Example Usage for Better Understanding

The example below is based on the vllm-semantic-router installation steps.

Step 1: Start vLLM server for TinyLlama/TinyLlama-1.1B-Chat-v1.0.

    Shell
   
   git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
 
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --port 11434 --served-model-name TinyLlama/TinyLlama-1.1B-Chat-v1.0

    Shell
   
   INFO 10-10 22:34:02 api_server.py:937] Starting vLLM API server on http://0.0.0.0:11434

Step 2: Modify the highlighted sections below in config/config.yaml to point to the model in use.

    YAML
   
 

   vllm_endpoints:
  - name: "endpoint1"
    address: "127.0.0.1"  # IPv4 address - REQUIRED format
    port: 8000
    models:
      - "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    weight: 1

model_config:
  "TinyLlama/TinyLlama-1.1B-Chat-v1.0":
    reasoning_family: "gpt-oss"  # This model uses GPT-OSS reasoning syntax
    preferred_endpoints: ["endpoint1"]
    pii_policy:
      allow_by_default: true

categories:    
  - name: math
    system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way."
    model_scores:
      - model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
        score: 1.0
        use_reasoning: true  # Enable reasoning for complex math
  

Step 3: Start vllm-semantic-router.

    Shell
   
   make run-envoy

[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:932] runtime: {}
[2025-10-10 21:36:51.564][1673981][info][main] [source/server/server.cc:765] Starting admin HTTP server at 127.0.0.1:19000

    Shell
   
 

   make run-router

{"level":"info","ts":"2025-10-10T21:37:26.501979783Z","caller":"observability/logging.go:140","msg":"Starting vLLM Semantic Router ExtProc with config: config/config.yaml"}
{"level":"info","ts":"2025-10-10T21:37:26.502091557Z","caller":"observability/logging.go:140","msg":"Starting insecure LLM Router ExtProc server on port 50051..."}
{"level":"info","ts":"2025-10-10T21:37:26.502190752Z","caller":"observability/logging.go:140","msg":"Starting Classification API server on port 8080"}
{"level":"info","ts":"2025-10-10T21:37:26.50408922Z","caller":"observability/logging.go:140","msg":"Found global classification service on attempt 1/5"}
{"level":"info","ts":"2025-10-10T21:37:26.504332195Z","caller":"observability/logging.go:140","msg":"System prompt configuration endpoints enabled"}
{"level":"info","ts":"2025-10-10T21:37:26.504360331Z","caller":"observability/logging.go:140","msg":"Classification API server listening on port 8080"}
  

Step 4: Issue a simple request against the vLLM semantic router.

    Shell
   
 

   curl -X POST http://localhost:8801/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the derivative of x^2?"}
    ]
  }'
  

The section below shows the logs emitted by the vLLM Semantic Router, illustrating how the logic works from receiving a request to routing it to the correct endpoint that hosts the model.

    Shell
   
 

   {"level":"info","ts":"2025-10-10T22:49:30.84348937Z","caller":"observability/logging.go:140","msg":"Started processing a new request"}
{"level":"info","ts":"2025-10-10T22:49:30.844178952Z","caller":"observability/logging.go:140","msg":"Received request headers"}
{"level":"info","ts":"2025-10-10T22:49:30.844804801Z","caller":"observability/logging.go:140","msg":"Received request body {\n    \"model\": \"auto\",\n    \"messages\": [\n      {\"role\": \"user\", \"content\": \"What is the derivative of x^2?\"}\n    ]\n  }"}
...
{"level":"info","ts":"2025-10-10T22:49:31.801283439Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.801302404Z","caller":"observability/logging.go:140","msg":"Selected model TinyLlama/TinyLlama-1.1B-Chat-v1.0 for category math with score 1.0000"}
{"level":"info","ts":"2025-10-10T22:49:31.971759021Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004"}
{"level":"info","ts":"2025-10-10T22:49:31.971831455Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math)"}
{"level":"info","ts":"2025-10-10T22:49:31.971877997Z","caller":"observability/logging.go:140","msg":"Routing to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.115533183Z","caller":"observability/logging.go:140","msg":"Classification result: class=9, confidence=0.7004, entropy_available=true"}
{"level":"info","ts":"2025-10-10T22:49:32.115686518Z","caller":"observability/logging.go:140","msg":"Classified as category: math (mmlu=math), reasoning_decision: use=true, confidence=0.630, reason=low_uncertainty_trust_classification"}
{"level":"info","ts":"2025-10-10T22:49:32.115719018Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision: category='math', confidence=0.700, use_reasoning=true, reason=low_uncertainty_trust_classification, strategy=trust_top_category"}
{"level":"info","ts":"2025-10-10T22:49:32.115776879Z","caller":"observability/logging.go:140","msg":"Top predicted categories: [{math 0.7003756} {physics 0.10189054} {chemistry 0.09165735}]"}
{"level":"info","ts":"2025-10-10T22:49:32.115788663Z","caller":"observability/logging.go:140","msg":"Entropy-based reasoning decision for this query: true on [TinyLlama/TinyLlama-1.1B-Chat-v1.0] model (confidence: 0.630, reason: low_uncertainty_trust_classification)"}
{"level":"info","ts":"2025-10-10T22:49:32.115832364Z","caller":"observability/logging.go:140","msg":"Selected endpoint address: 127.0.0.1:8000 for model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116305725Z","caller":"observability/logging.go:140","msg":"Applied reasoning mode (enabled: true) with effort (high) to model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116349513Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt to the beginning of messages (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116364889Z","caller":"observability/logging.go:140","msg":"Added category-specific system prompt for category: math (mode: replace)"}
{"level":"info","ts":"2025-10-10T22:49:32.116376391Z","caller":"observability/logging.go:140","msg":"System prompt injection completed for category: math, body size: 320 bytes"}
{"level":"info","ts":"2025-10-10T22:49:32.116383792Z","caller":"observability/logging.go:140","msg":"Use new model: TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
{"level":"info","ts":"2025-10-10T22:49:32.116406184Z","caller":"observability/logging.go:136","msg":"routing_decision","event":"routing_decision","reason_code":"auto_routing","request_id":"a7d7c8c7-bba4-4374-b997-d9b6c6d1a827","reasoning_enabled":true,"selected_endpoint":"127.0.0.1:8000","routing_latency_ms":1271,"original_model":"auto","selected_model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","category":"math","reasoning_effort":"high"}
  

Conclusion

In conclusion, inference routing is a critical component for modern AI systems, enabling intelligent routing of requests to the most suitable models. Unlike traditional load balancers, it may optimize for cost, latency, and accuracy of models. Whether using cloud native solutions for seamless integration or third-party routers for multi-cloud flexibility and model diversity, as enterprises adopt multi-model architectures, inference routing will continue to play a pivotal role in maximizing resource utilization and ensuring high-quality user experiences.

References

AI Requests systems

Opinions expressed by DZone contributors are their own.

Related

Trending