The Art of Token Frugality in Generative AI Applications
Trim your prompts, cache aggressively, route simple queries to smaller models, constrain your outputs, and swap JSON for TOON to lower token bills.
Join the DZone community and get the full member experience.
Join For FreeThere was a time when token costs felt like rounding errors. A prototype making a few hundred calls a day, with a few cents here and there. That changes fast. When a generative AI (GenAI) application scales to thousands of users making multiple requests daily, token costs stop being a footnote and start being a line item that competes with infrastructure. The question is not whether to manage token consumption. It is whether you do so deliberately or by accident.
This article organizes some of the methods for reducing token consumption in production GenAI and agentic AI applications. Though not an exhaustive list, it is an actionable set of principles to apply directly and generative enough to spark further ideas. After all, frugality is the mother of invention, and in the age of AI transformation, thinking carefully about where tokens go is not an optimization. It is a discipline.
I. Prompt Engineering for Leanness
The prompt is the most immediate cost lever available. Most production system prompts carry significant dead weight: verbose framing, restating the model's context, and an instructional preamble that adds tokens without improving precision. System prompts are sent on every call. A 2,000-token system prompt across 100,000 daily calls consumes 200 million tokens per month on instructions alone. Some of the following methods can be applied to make your prompts lean.
Compressing System Prompts
Remove preamble, collapse prose instructions into declarative directives, and delete anything that restates model defaults. A well-audited system prompt often compresses to 20 to 30 percent of its original size without any change to compliance behavior.
Practical Test
Remove a sentence, run the eval, and observe whether output quality changes. If it does not, the sentence does not belong.
Bullet-Form Compression
Replace paragraph-form instructions with short declarative bullets. A 400-word prose instruction block describing tone, format, and constraints can typically compress to 8 to 12 directive bullets at under 100 tokens. The compression ratio is high because prose carries connective tissue that the model does not need.
Output Format Instructions Reduce Output Tokens
Output tokens are priced at a higher rate than input tokens across most major providers. Asking for a detailed response invites verbosity that costs disproportionately. Specifying a concrete output contract instead. For example, a JSON object with three named fields or a response in under 100 words reduces output volume and improves downstream parseability simultaneously.
Accuracy-Per-Token Trade-Off
Each few-shot example costs 100–400 tokens, depending on its complexity. For tasks with well-established model priors, such as sentiment classification or JSON extraction, zero-shot prompting frequently matches few-shot accuracy at no example cost. When examples are necessary, one well-chosen example consistently outperforms three mediocre ones.
II. Context Window Management
Every token in the context window is a paid token. For context management, the goal is to maximize the information density of what you send, not the completeness of what you have. Most applications leak tokens through three sources: over-retrieved RAG chunks, unbounded conversation history, and noisy document content.
Precision Retrieval in RAG Pipelines
A naive top-k retrieval strategy retrieves ten chunks and injects all of them. A two-stage approach retrieves broadly and then reranks to the two or three most relevant passages. The token reduction is proportional. Injecting two passages instead of ten reduces context by roughly 80 percent on that component. Teams should benchmark quality at each retrieval depth on representative queries before choosing a threshold.
Conversation History Pruning
The appropriate strategy depends on how much conversational context the task genuinely requires. In multi-turn applications, history grows linearly with each turn. A 20-turn conversation accumulates several thousand tokens of prior context before any new message is processed. Rolling window truncation keeps the last N turns. Summarisation compresses older turns into a compact memory block. Hybrid architectures combine a short-term buffer with a compressed long-term summary.
Document Pre-Processing
Raw documents injected into prompts carry significant noise. Repeated headers, blank lines, navigation artifacts from PDF extraction, and formatting structure that carries no semantic content. Stripping this noise before injection is a low-effort and high-yield operation. The gains depend on document quality but are often substantial in enterprise pipelines processing PDFs and web content.
III. Model Selection and Routing
Not every query requires a frontier model. The cost difference between a large frontier model and a smaller model in the same provider family can exceed one order of magnitude per token. Routing queries to the appropriate model tier based on complexity is one of the highest-leverage optimizations available in production systems.
Complexity-Based Routing
A lightweight classifier scores incoming queries on a complexity dimension. Simple queries, such as FAQ lookups, field extraction, and straightforward classification, route to smaller and cheaper models. Complex queries involving multi-step reasoning, synthesis across sources, or ambiguous intent route to the frontier model. The classifier itself may be a small fine-tuned model, a rule-based system, or a lightweight prompt-based scorer.
In typical enterprise workloads, a substantial fraction of queries are straightforward and do not warrant frontier model compute. The exact proportion depends on the application. Specific percentage savings from model routing are highly application-dependent. Teams should profile their actual query distribution before projecting cost reduction. While the principle of model routing is desirable to reduce token cost, the magnitude must be measured empirically to assess the impact.
Fine-Tuned Models for Narrow Tasks
For high-volume and narrowly defined tasks, a fine-tuned model on a smaller base frequently outperforms a general frontier model at a fraction of the cost. The upfront investment in fine-tuning amortizes quickly once daily call volumes are significant. Evaluation against quality baselines is essential before declaring a fine-tuned model production-ready.
IV. Caching Strategies
The cheapest API call is the one that never happens. Caching operates at multiple levels, and each level addresses a distinct pattern of token waste.
Semantic Caching
Exact-match caching is limited. Paraphrased queries miss the cache even when the answer is identical. Semantic caching uses embedding similarity to match incoming queries against cached query-response pairs. If an incoming query is semantically equivalent to a prior query above a similarity threshold, the cached response is returned without an API call. Suitable hit rates depend on application query diversity and cannot be stated generally.
Prompt Prefix Caching
Prompt prefix caching, offered by several providers, including Anthropic, reuses the key-value cache for the static prefix of a prompt across multiple calls. Where the system prompt and any shared context are constant, only the variable portion is processed from scratch. This is particularly effective for applications with long, stable system prompts or shared document contexts.
Deterministic Result Caching
For tasks that produce stable outputs for a given input, such as document summarisation or entity extraction from fixed content, caching the result and serving it to all subsequent queries for the same input eliminates redundant processing entirely. Content-addressed storage keyed on a hash of the input is a practical implementation pattern.
VI. Architectural Patterns
Individual prompt optimizations are necessary but not sufficient. Application architecture determines the structural constraints on token consumption. Several patterns merit attention in production GenAI systems. Agentic loops that delegate all decisions to an LLM, including routing and trivial judgments that could be handled deterministically, accumulate tool-call overhead quickly. Imposing a hard step budget per session and replacing LLM-driven routing with deterministic rule engines for well-understood intents removes a large category of unnecessary calls. Hybrid architectures reserve LLM calls for tasks that genuinely require language understanding and route everything else to cheaper compute.
VII. TOON: Token-Oriented Object Notation
There is a category of token waste in prompt engineering that is not typically addressed: the overhead of the data serialization format itself. JSON is the default format for structured data in LLM prompts. It was designed for machine-to-machine data exchange and not for LLM input. Its grammar is verbose. Every object repeats every field name. Every string value is double-quoted. Structural delimiters, including braces, brackets, colons, and commas, each consume tokens. In a prompt, injecting a list of 50 uniform records, JSON repeats the field names 50 times. Token-Oriented Object Notation (TOON) addresses this directly.
TOON is a compact, human-readable serialization format designed for LLM input. It was released in late 2025 and is a lossless, drop-in replacement for JSON. Its design merges YAML's indentation-based structure for nested objects with a CSV-style tabular row layout for uniform arrays. Field names are declared once in a header. Data rows follow without repeating keys. The format is available at toonformat.dev (Schopplich, 2025), where the specification, benchmarks, and multi-language implementations are documented.
Syntax Overview
TOON uses three primary structural elements. A scalar object uses YAML-style key-value pairs. A uniform array uses a single header declaration of the form arrayName[N]{field1,field2,...} followed by CSV-style data rows. Nested structures combine both, with child objects indented under parent records. The array declaration encodes both the length and the field schema that provides the model with a structural contract it can use to validate completeness.
-- Simple object --
JSON: { "name": "Alice", "age": 30, "city": "Hyderabad" }
TOON:
name: Alice
age: 30
city: Hyderabad
-- Uniform array (TOON primary use case) --
JSON (repeats field names on every row):
[ { "id": 1, "name": "Alice", "role": "admin", "dept": "Eng" },
{ "id": 2, "name": "Bob", "role": "viewer", "dept": "Ops" } ]
TOON (fields declared once, rows are data only):
users[2]{id,name,role,dept}:
1, Alice, admin, Eng
2, Bob, viewer, Ops
-- Nested object with embedded array --
order:
id: ORD-9821
customer: Sibanjan Das
items[2]{sku,qty,price}:
SKU-01, 2, 499.00
SKU-07, 1, 1299.00
total: 2297.00
Benchmark Results
The TOON project publishes a benchmark suite at toonformat.dev/guide/benchmarks.html that measures retrieval accuracy and token efficiency across 209 data retrieval questions on four LLMs: Claude Haiku, Gemini 3 Flash, GPT-5 Nano, and Grok 4.1 Fast. The benchmarks test 11 datasets covering uniform, semi-uniform, nested, and deeply nested data structures (Schopplich, 2025).
The key efficiency finding is that TOON achieves 76.4 percent retrieval accuracy against JSON's 75.0 percent while consuming 39.9 percent fewer tokens in the mixed-structure track. On a per-efficiency-unit basis, defined as accuracy percentage per 1,000 tokens, TOON scores 27.7 against JSON's 16.4. These figures are from the official benchmark suite and apply to the specific datasets and question types tested.
Source: toonformat.dev/guide/benchmarks.html (accessed April 2026)
|
76.4% TOON retrieval accuracy vs JSON 75.0% |
−39.9% Token reduction vs pretty-printed JSON |
27.7 Acc % per 1K tokens vs JSON's 16.4 |
Format Comparison by Data Shape
TOON's advantage concentrates in uniform arrays of objects. For deeply nested or irregular data, JSON or YAML may be more efficient because TOON's header declarations add overhead to small arrays. This trade-off is documented in the official specification.
|
Data Shape |
Recommended Format |
TOON vs. JSON (tokens) |
Note |
|
Uniform arrays (catalogs, rosters) |
TOON |
Up to −39.9% (benchmark) |
TOON primary use case |
|
Semi-uniform arrays |
TOON |
Moderate reduction |
Depends on field overlap |
|
Simple flat objects (single records) |
TOON or YAML |
Moderate reduction |
YAML also competitive |
|
Deeply nested irregular structures |
JSON or YAML |
+5–10% overhead |
Header cost exceeds savings |
|
Pure flat tables (no nesting) |
CSV |
TOON slightly larger than CSV |
CSV has no structural overhead |
Format guidance based on toonformat.dev specification and benchmark documentation (Schopplich, 2025).
The Boundary Conversion Pattern
The recommended implementation pattern is to keep JSON as the internal data format for application code, API calls, and database operations. Convert to TOON only at the LLM boundary, immediately before constructing the prompt. Parse the LLM's TOON output back to JSON immediately after the response is received. This isolates the format change to a thin middleware layer and requires no changes to the application data model.
# Boundary conversion pattern
from toon import to_toon, from_toon
def call_llm(system_prompt: str, data: list[dict]) -> dict:
toon_data = to_toon(data) # convert at boundary only
prompt = system_prompt + '\n\nData:\n' + toon_data
raw = llm.call(prompt, output_format='toon')
return from_toon(raw) # convert back i
TOON is most effective in three scenarios:
- RAG pipelines injecting large structured reference datasets
- Agentic tool responses returning uniform record sets
- Bulk classification or extraction pipelines where the same schema appears across many rows
TOON is less suitable for deeply nested irregular JSON and should be benchmarked on latency-critical local or quantized model deployments before adoption. The official documentation at toonformat.dev provides a playground for testing token counts on specific payloads.
Token frugality is a mark of engineering maturity. The discipline of deciding what to send, when to cache, which model to invoke, and how to serialize structured data is exactly the discipline that separates production-grade AI systems from well-intentioned prototypes. Start with the highest-yield changes such as prompt compression, explicit output constraints, and format conversion to TOON for uniform structured data. Build observability before tuning further. Measure quality at every step. The methods described here are applicable today. The tooling, particularly for semantic caching, model routing, and TOON adoption, is maturing rapidly. We are in the early phases of production GenAI cost engineering, and the practice will only become more systematic as the field develops.
Opinions expressed by DZone contributors are their own.
Comments