DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Intent-Driven AI Frontends: AI Assistance to Enterprise Angular Architecture
  • Reliable AI Agent Architecture for Mobile: Timeouts, Retries, and Idempotent Tool Calls
  • Compose Architecture, Done Right: MVI’s Unidirectional State vs. MVVM
  • Agent-to-Agent Protocol: Implementation and Architecture With Strands Agents

Trending

  • Kafka and Spark Structured Streaming in Enterprise: The Patterns That Hold Up Under Pressure
  • Master-Class: Understanding Database Replication (Single, Multi, and Leaderless)
  • How to Build and Optimize AI Models for Real-World Applications
  • Navigating the Complexities of AI-Driven Integration in Multi-Cloud Environments: A Veteran’s Insights
  1. DZone
  2. Data Engineering
  3. Data
  4. Cost-Aware GenAI Architecture: Caching, Model Routing, and Token Budgets That Don’t Explode

Cost-Aware GenAI Architecture: Caching, Model Routing, and Token Budgets That Don’t Explode

Keep GenAI cheap and fast: cache aggressively, route models by confidence, cap tokens and tools, compress context, and monitor cost per successful outcome.

By 
Mohan Sankaran user avatar
Mohan Sankaran
·
Jan. 27, 26 · Analysis
Likes (5)
Comment
Save
Tweet
Share
2.7K Views

Join the DZone community and get the full member experience.

Join For Free

Shipping GenAI is easy. Shipping it without a surprise bill, latency spikes, and “why did it call the big model for that?” incidents is the hard part.

This article is a practical architecture pattern for cost control as a first-class system requirement — built around three levers:

  1. Caching (don’t pay twice)
  2. Model routing (use the cheapest model that meets quality)
  3. Token budgets (keep context and outputs bounded, always)

The Core Problem: “GenAI Cost” Is Mostly Architecture, Not Pricing

Teams often treat token costs as an API detail. In production, cost comes from:

  • Re-sending context (same policy text, same user history, same retrieved chunks)
  • Overusing high-tier models (because routing doesn’t exist)
  • Unbounded retrieval (stuffing prompts with irrelevant chunks)
  • Retries and tool loops (agents calling tools repeatedly)
  • Verbose outputs (“explain everything” becomes 2k tokens)

If your architecture doesn’t bound these, cost will drift upward forever.

GenAI Architecture


Reference Architecture (High Level)

Request Flow

  1. Policy Gate: validate feature, user tier, risk level
  2. Budget Manager: allocates token and cost budgets
  3. Cache Layer: response, semantic, and retrieval caches
  4. Context Builder: composes minimal, budgeted context
  5. Model Router: selects model tier and parameters
  6. LLM Call: streaming and stop rules
  7. Post-Processor: validate output schema, store summaries
  8. Telemetry: cost, quality, cache, and fallback reasons

Think of it as “FinOps meets SRE meets prompt engineering.”

Reference Architecture (High Level)


1) Caching: Stop Paying for the Same Intelligence Twice

Cache #1: Response Cache (Exact Match)

Best for

  • Deterministic prompts
  • Stable system instructions
  • Frequent repeat questions (“How do I reset my password?”)

Key = hash of

  • System prompt version
  • User prompt (normalized)
  • Feature flag set
  • Locale and app version (optional)

TTL: minutes to hours (depends on freshness needs)

Rule: If you can make prompts stable, you can make caching high hit-rate.

Cache #2: Semantic Cache (Near Match)

Exact-match misses are common because users rephrase. Semantic caching uses embeddings:

Key = embedding(user intent + feature + locale)

Look up nearest neighbors; if similarity > threshold, reuse or adjust the response.

When it’s safe

  • Non-sensitive content
  • Non-transactional responses
  • “Advice” or “explanation” output

When it’s risky

  • Anything that must be factually grounded in fresh data
  • Personalized or regulated content unless you segment by user and policy

Practical approach

  • Use the semantic cache as a candidate response
  • Run a cheap verifier (small model) to confirm it still applies

Cache #3: Retrieval Cache (RAG)

Embedding and retrieval costs hide in the shadows. Cache:

  • Query → top-k document IDs
  • Document ID → chunk text and metadata
  • Chunk → embedding

This reduces repeated embedding calls and repeated vector DB hits.

Android note: A local vector cache (Room + approximate search or lightweight index) can store:

  • Frequent policy chunks
  • Product FAQs
  • Recent user-visible help content

2) Model Routing: The Cheapest Model That Meets Quality

The Routing Principle

Default to small. Escalate only when signals justify it.

If you don’t encode this rule in a router, your system will drift toward “always large.”

A Simple 3-Tier Routing Policy

  • Small model: classification, rewriting, summarization, extraction, tool-argument shaping
  • Medium model: normal Q&A and grounded responses with limited context
  • Large model: complex reasoning, multi-step synthesis, ambiguous tasks, high-stakes language

Signals That Should Trigger Escalation

  • Low confidence from the small model (self-rated or classifier-based)
  • High ambiguity (multiple intents detected)
  • High complexity (multi-step reasoning or long synthesis)
  • User-visible failure (retry after dissatisfaction)
  • Safety-sensitive content (sometimes you want better, sometimes stricter guardrails — decide explicitly)

Signals That Should Prevent Escalation

  • Feature is in cost-saving mode
  • User exceeded daily budget
  • Request is non-critical
  • Retrieval grounding already provides a direct answer

Router Output Should Be Typed, Not “Prompt-y”

Example routing decision object:

JSON
 
{
  "model_tier": "small|medium|large",
  "max_output_tokens": 350,
  "temperature": 0.2,
  "reason": "low_confidence_small + multi_doc_synthesis",
  "fallback_plan": ["medium", "small_template"]
}


This makes routing auditable and testable.

3) Token Budgets: Treat Tokens Like CPU and Memory

Budgeting Is Not “set max_tokens”

You need three budgets:

  1. Context budget (input tokens)
  2. Output budget (output tokens)
  3. Tool budget (calls/iterations)

And they should exist at multiple scopes:

  • Per request
  • Per session
  • Per user per day
  • Per feature per day

The Practical Formula

Define a request budget based on feature tier:

  • B_in = max input tokens
  • B_out = max output tokens
  • B_tools = max tool calls
  • B_cost = max $ cost per request

Then make the Context Builder “spend” B_in intentionally.

Context Building That Stays Inside Budget

Step 1: Start with a “context envelope”

  • System instructions: fixed
  • Policy snippets: fixed minimal set
  • User state summary: short
  • Retrieval: top-k, capped
  • Conversation history: summarized, capped

Step 2: Prefer summaries over raw logs

  • session_summary (updated each turn)
  • user_profile_summary (opt-in, policy-controlled)
  • tool_result_summary (store the gist, not raw data)

Step 3: Retrieval must be budget-aware

If your budget allows 1,200 tokens of retrieval, don’t retrieve 4,000.

Technique:

  • Retrieve more candidates (IDs), but only materialize chunks until the budget is hit.

Step 4: Enforce hard stop rules

  • Cap number of chunks
  • Cap total retrieved tokens
  • Cap history tokens
  • Cap output tokens with “be concise” instructions and stop sequences

Kotlin-ish Skeleton: Budget Manager + Router

Kotlin
 
data class Budgets(
  val maxInputTokens: Int,
  val maxOutputTokens: Int,
  val maxToolCalls: Int,
  val maxCostMicros: Long
)

enum class ModelTier { SMALL, MEDIUM, LARGE }

data class RouteDecision(
  val tier: ModelTier,
  val maxOutputTokens: Int,
  val temperature: Double,
  val reason: String,
  val fallback: List<ModelTier>
)

class BudgetManager {
  fun allocate(feature: String, userPlan: String): Budgets {
    return when (feature) {
      "help_qna" -> Budgets(1800, 300, 0, 2_000)      // cheap
      "rag_answer" -> Budgets(2400, 450, 0, 6_000)    // moderate
      "agent_task" -> Budgets(2600, 500, 4, 12_000)   // pricey
      else -> Budgets(1600, 250, 0, 2_000)
    }
  }
}

class ModelRouter {
  fun route(signals: Map<String, Any>, budgets: Budgets): RouteDecision {
    val complexity = signals["complexity"] as? Int ?: 1
    val confidence = signals["confidence"] as? Double ?: 0.7

    return when {
      confidence < 0.45 || complexity >= 4 ->
        RouteDecision(ModelTier.LARGE, budgets.maxOutputTokens, 0.2,
          reason = "low_confidence_or_high_complexity", fallback = listOf(ModelTier.MEDIUM, ModelTier.SMALL))

      complexity >= 2 ->
        RouteDecision(ModelTier.MEDIUM, budgets.maxOutputTokens, 0.2,
          reason = "moderate_complexity", fallback = listOf(ModelTier.SMALL))

      else ->
        RouteDecision(ModelTier.SMALL, minOf(250, budgets.maxOutputTokens), 0.1,
          reason = "default_small", fallback = emptyList())
    }
  }
}


The “Prompt Budget” Patterns That Actually Work

Pattern A: Answer in constraints

Add explicit constraints that reduce verbosity:

  • “Use at most 6 bullets.”
  • “No preamble.”
  • “If unsure, ask exactly one question.”
  • “Cite only the top 2 reasons.”

Pattern B: Two-pass with a cheap first pass

  • Small model: classify intent, decide what to retrieve, outline response
  • Medium model: generate final answer using only the outline and retrieval

This reduces large-model calls and reduces thrash.

Pattern C: Compress, then reason

When context is big:

  • Compress retrieval (small-model summarizer)
  • Feed summary to medium model

You pay tokens once to compress, then keep future turns cheap.

Guardrails: Prevent “Infinite Tool Loops” and “Retry Storms”

Cost spikes often come from tool retries:

  • Transient network errors
  • Ambiguous tool arguments
  • Agent planning loops

Controls

  • Hard cap on maxToolCalls
  • Tool-level idempotency keys
  • Exponential backoff
  • “If tool fails twice, degrade gracefully”
  • Cache safe tool results

Telemetry You Must Log (Uncontrolled Cost)

At minimum per request:

  • model_tier
  • input_tokens, output_tokens
  • cache_hit_type (none/response/semantic/retrieval)
  • retrieval_tokens, retrieval_chunks
  • tool_calls
  • latency_ms
  • fallback_reason
  • cost_estimate_micros

Then build two dashboards:

  1. Cost per feature per day
  2. Cost per successful outcome (not per request)

Cost per success is the one that changes behavior.

Failure Modes (What to Do)

  • Cache hit-rate is low → prompts aren’t stable; normalize inputs; version templates
  • Large model dominates → router missing signals; add small-first + confidence gating
  • RAG prompts are huge → retrieval not budget-aware; cap chunk tokens; compress
  • Outputs are verbose → output constraints + schema-first responses
  • Retry storms → idempotency, max retries, degrade mode, circuit breaker

Closing Thought

Most teams try to “optimize prompts” when their real problem is missing system controls.

If you implement:

  • Budget manager
  • Caching layer
  • Router with typed decisions
  • Token-aware context builder
  • Cost telemetry

…you’ll get stable spend, predictable latency, and a system you can actually operate.

Architecture Cache (computing) Data Types

Opinions expressed by DZone contributors are their own.

Related

  • Intent-Driven AI Frontends: AI Assistance to Enterprise Angular Architecture
  • Reliable AI Agent Architecture for Mobile: Timeouts, Retries, and Idempotent Tool Calls
  • Compose Architecture, Done Right: MVI’s Unidirectional State vs. MVVM
  • Agent-to-Agent Protocol: Implementation and Architecture With Strands Agents

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook