Cost Control in AI Systems Is an Architectural Problem

In AI systems, rising costs are often architectural, not pricing. Retries, latency, and duplicate work multiply usage. Idempotency and boundaries control cost.

Aditya Gupta

Mar. 12, 26 · Analysis

Likes (0)

Comment

Save

2.9K Views

For any working system, whether AI or non-AI, operating costs play a significant role throughout the product lifecycle. In the case of AI systems, these costs are calculated by estimating future usage and concurrency. Usually, these cost estimates determine the pricing that end users are expected to pay. But the critical problem arises when these estimates turn out to be way off from the actuals. In that case, operating costs rise significantly, hurting margins, and the business suffers overall.

When such things happen, people assume it's because of the expensive model, which is not true. These systems become expensive because their architectures multiply costs.

At the beginning of most AI projects, the cost-estimation algorithm appears simple and easy to calculate. People usually estimate tokens per request, multiply by the model’s pricing, and projected monthly traffic. The numbers feel manageable, predictable, and controlled. Then development ends, and the system moves to production, where the actual problem begins.

A system forecasted to cost $3,000 per month suddenly costs $9,000 per month. When investigated for triggers, all parameters looked normal: traffic stayed within range, model version and cost remained the same, yet billing tripled. The main culprit becomes the architecture.

Why AI Systems Become Expensive in Production

To better understand this, let's walk through a real-world scenario.

Imagine an AI product enrichment tool for a marketplace. Each product requires roughly 1,200 input tokens and generates about 700 output tokens. At $0.01 per 1,000 input tokens and $0.03 per 1,000 output tokens, each enrichment costs approximately $0.033. Processing 100,000 products per month should cost around $3,300.

From the system architecture, it’s a straightforward pipeline: an API receives product data, invokes a function, calls the model, stores the result, and displays it to the user.

But in production, this system rarely behaves in this order.

Common Architectural Cost Multipliers

The first cost multiplier reason is the retries without idempotency. Most teams add retries early in development, correctly assuming network delays and timeouts. A typical implementation looks like this:

    Python
   
 

   import requests
import time

MODEL_URL = "https://api.modelvendor.com/model_v3"

def call_model(payload: dict) -> dict:
    response = requests.post(MODEL_URL, json=payload, timeout=5)
    response.raise_for_status()
    return response.json()

def generate_description(payload: dict) -> dict:
    for attempt in range(3):
        try:
            return call_model(payload)
        except requests.exceptions.Timeout:
            time.sleep(2 ** attempt)
    raise Exception("Model failed after retries")

  

If you look closely, this code significantly improves the system's reliability. But it also doubles or triples spending.

Consider a transient network failure in an AI system. The model successfully processes the request, but the response is lost due to a dropped connection or network issue. The client times out, and the system retries the request to LLM. From the system’s perspective, this is resilience. From the billing perspective, it is two full inferences.

If even 15% of requests retry once, the additional cost is not small. At 100,000 requests, that is 15,000 extra model calls. At $0.033 each, that is nearly $500 in unplanned spend. If some requests retry twice, the multiplier grows. During peak latency spikes, it compounds further.

To resolve such things, the correction is architectural, not financial. Identical work should not be processed twice. This requires idempotency. A deterministic idempotency key ensures that the same logical request maps to the same stored result.

    Python
   
   import hashlib
import json

def idempotency_key(payload: dict) -> str:
    normalized = json.dumps(payload, sort_keys=True)
    return hashlib.sha256(normalized.encode()).hexdigest()

In this case, the system checks whether work has been completed before invoking the model:

    Python
   
 

   class ResultStore:
    def __init__(self):
        self._store = {}

    def get(self, key):
        return self._store.get(key)

    def put(self, key, value):
        self._store[key] = value

def generate_description(payload: dict, store: ResultStore):
    key = idempotency_key(payload)

    cached = store.get(key)
    if cached:
        return cached

    result = call_model(payload)
    store.put(key, result)
    return result

  

This change ensures that retries no longer generate duplicate spend due to network instability.

The second cost multiplier reason is the synchronous latency coupling. Many new-age applications are adopting serverless architectures to keep the compute cost low. Usually, in these architectures, the API call blocks until the model returns. If the inference took 6 seconds, the compute is occupied for 6 seconds, and the billing runs for 6 seconds. If latency increases under load in production, then the concurrent execution increases. If the timeout trigger retries, concurrency increases further. In that case, the cost and compute cost amplify together. Such scenarios are not visible during development; they usually surface when the traffic spikes.

The alternative to this is architectural decoupling. Instead of performing the inference inline, the system pushes a job to the queue and returns immediately. Workers process the job asynchronously.

Example:

    Python
   
 

   import uuid
import json

def enqueue_job(sqs_client, queue_url, payload):
    job_id = str(uuid.uuid4())
    message = {
        "job_id": job_id,
        "payload": payload
    }

    sqs_client.send_message(
        QueueUrl=queue_url,
        MessageBody=json.dumps(message)
    )

    return job_id

  

In this code, a worker processes the job independently, isolating latency from the user-facing request. Retries are bound to the number of workers. Hence, compute billing is no longer directly coupled to user latency.

The third and most common cost multiplier reason is the embedding regeneration. Embedding generation often accounts for a large share of the cost for most AI systems; it is generated at scale across entire catalogs. Let's understand this with an example: suppose embedding a product into an AI product enrichment system costs $0.002. For 10,000 products, a full re-embedding cycle costs $20. If this occurs nightly, that becomes $600 per month. Across multiple tenants, it can quietly reach thousands.

This mistake is architectural. Embeddings are deterministic. Identical content produces identical vectors. Regenerating them without the change will result in the same embedding and yield unnecessary spending. This can be avoided by a simple fingerprint.

    Python
   
   def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

Just storing the hash alongside the embedding and regenerating the embedding whenever the hash changes can significantly reduce regeneration cost. This will make the AI system incur an embedding cost that scales with content change rate, not the catalog size.

The final multiplier appears during the partial outages. In a scenario where the average latency increases from 2 seconds to 7 seconds, while the client timeout remains at 5 seconds. Every request will timeout. Every request will go in the retry loop. If in the system, 3 retries are configured, traffic to model triples, which in turn increases model usage, resulting in increased cost. This retry logic was developed as part of the reliability configuration, but in this scenario, it will act as a cost amplifier.

A proper failure classification reduces such risks.

    Python
   
 

    class RetryableError(Exception):
    pass

class NonRetryableError(Exception):
    pass

def safe_model_call(payload):
    try:
        return call_model(payload)
    except requests.exceptions.Timeout:
        raise RetryableError()
    except requests.exceptions.HTTPError as e:
        if 400 <= e.response.status_code < 500:
            raise NonRetryableError()
        raise RetryableError()

  

This code retries only failures that are likely to recover, such as timeouts or temporary upstream issues. It does not retry validation errors, authentication problems, or bad requests, because retrying those will not change the outcome. Retries are bounded, and backoff is controlled so the system does not enter an uncontrolled loop. When failure handling is disciplined and classification is explicit, the system remains predictable under load.

How to Design Cost-Efficient AI Systems

Mature AI systems treat idempotency as mandatory. They isolate the asynchronous boundaries deliberately. They fingerprint deterministic operations. They classify failures explicitly. They observe cost per job, not cost per service. They design the system to be cost-efficient from the beginning so that it behaves predictably.

AI systems

Opinions expressed by DZone contributors are their own.

Related

Trending