Idempotency in AI Tools: The Most Expensive Thing Teams Forget

An analysis of how retries cause duplicate inference in AI tools and how idempotency keeps production systems predictable and cost-controlled.

Aditya Gupta

Mar. 02, 26 · Analysis

Likes (1)

Comment

Save

2.1K Views

When AI tools move from a test environment to real-world use, the first “surprise” a developer encounters is rarely about accuracy. It’s usually something more problematic: the system behaves inconsistently, costs climb faster than expected, and the same job seems to run multiple times.

That’s not an AI problem. That’s a distributed systems problem. And in AI systems, this particular failure is extra problematic because every duplicate run has a direct dollar value impact. Idempotency is the fix. Not the only fix, but often the most impactful one.

What Does Idempotency Mean?

If the same request/job is processed more than once, the end result is the same as if it had been processed once. This matters because in a production environment, duplicate processing is normal. It happens mostly due to:

Network timeouts
User/client retries
Queue failure and redelivery
Mid-flight worker failures
Timeouts due to long-running tasks
Race conditions between parallel users

Impact of Non-Idempotency

In many non-AI systems, duplicate processing is problematic but not catastrophic. In AI tools, duplicate processing is catastrophic as it directly affects profitability and is often silent.

A duplicate inference call costs:

Extra tokens
Extra latency
Extra downstream load

A Day-to-Day Example to Explain the Impact

Think of a warehouse fulfilling orders.

A customer places an order.
The system prints a packing slip.
The packer ships the item.

Now imagine the print request times out. The system re-tries. Two slips are printed, and two items get shipped. Nothing “crashed.” But the cost just doubled, creating a chaos of issues.

AI tools behave the same way when a job is re-tried, and inference runs twice.

Reason for Duplicates

1. Client or API Retries

Sometimes a client sends a "start enrichment" request to an LLM and waits for a reply. If it doesn't hear back in time, it tries again, leading to multiple requests for the same thing.

2. Queue Redelivery

Message queues like SQS are designed for reliability, not for making sure something is delivered only once. If something goes wrong, that same message might get processed again.

3. Worker Fails Mid-Task

If a worker fails mid-task while processing a job, then the system might retry the job regardless of the status of the failed job.

Basic Implementation and Failure

Here’s a pattern early AI services adopt:

    Python
   
   def process_job(job_id: str, payload: dict) -> dict:
    # 1) Make a call to the LLM
    result = llm_enrich(payload)

    # 2) Save the LLM output
    db.save_result(job_id, result)

    return result

This works perfectly until something goes wrong between step 1 and step 2. If the worker crashes after inference but before saving, the system will retry the job and call the LLM again.

Improved approach:

A way to make AI jobs idempotent is:

Use a stable idempotency key (job_id).
Store state before inference.
If the job is already completed, return the stored result.
Ensure only one worker “owns” the right to run inference at a time.

Below is a Python example that demonstrates the pattern.

Data Model

A table keyed by job_id:

status: PENDING | IN_PROGRESS | COMPLETED | FAILED
result: optional
attempts: count
updated_at: timestamp

Code: Idempotent job execution

    Python
   
 

   import time
from dataclasses import dataclass
from typing import Any, Dict, Optional


@dataclass
class JobRecord:
    job_id: str
    status: str  # PENDING | IN_PROGRESS | COMPLETED | FAILED
    result: Optional[Dict[str, Any]] = None
    attempts: int = 0
    updated_at: float = 0.0


class InMemoryJobStore:
    """
    This is a stand-in for DynamoDB/Postgres/etc.
    The focus is on the logic, not the storage engine.
    """
    def __init__(self):
        self._store: Dict[str, JobRecord] = {}

    def get(self, job_id: str) -> Optional[JobRecord]:
        return self._store.get(job_id)

    def put_if_absent(self, job: JobRecord) -> bool:
        if job.job_id in self._store:
            return False
        self._store[job.job_id] = job
        return True

    def try_mark_in_progress(self, job_id: str) -> bool:
        """
        In real systems this should be atomic (conditional update).
        Here we simulate it.
        """
        rec = self._store.get(job_id)
        if rec is None:
            return False
        if rec.status in ("IN_PROGRESS", "COMPLETED"):
            return False
        rec.status = "IN_PROGRESS"
        rec.attempts += 1
        rec.updated_at = time.time()
        return True

    def mark_completed(self, job_id: str, result: Dict[str, Any]) -> None:
        rec = self._store[job_id]
        rec.status = "COMPLETED"
        rec.result = result
        rec.updated_at = time.time()

    def mark_failed(self, job_id: str, error: str) -> None:
        rec = self._store[job_id]
        rec.status = "FAILED"
        rec.result = {"error": error}
        rec.updated_at = time.time()


def llm_enrich(payload: Dict[str, Any]) -> Dict[str, Any]:
    """
    Replace this with your real LLM/API call.
    """
    time.sleep(0.2)
    return {
        "normalized_title": payload.get("title", "").strip().title(),
        "category": payload.get("category", "unknown"),
    }


def process_job_idempotent(job_store: InMemoryJobStore, job_id: str, payload: Dict[str, Any]) -> Dict[str, Any]:
    # 1) If already completed, return cached result
    existing = job_store.get(job_id)
    if existing and existing.status == "COMPLETED":
        return existing.result or {}

    # 2) Ensure the job record exists (idempotent create)
    job_store.put_if_absent(JobRecord(job_id=job_id, status="PENDING", updated_at=time.time()))

    # 3) Acquire the right to process (in real systems: conditional update)
    if not job_store.try_mark_in_progress(job_id):
        # Someone else is processing, or it already completed
        existing = job_store.get(job_id)
        if existing and existing.status == "COMPLETED":
            return existing.result or {}
        return {"status": "IN_PROGRESS", "job_id": job_id}

    # 4) Run inference once
    try:
        result = llm_enrich(payload)
        job_store.mark_completed(job_id, result)
        return result
    except Exception as e:
        job_store.mark_failed(job_id, str(e))
        raise

  

This code achieves:

If the same job is sent twice, it doesn’t automatically trigger two AI calls.
If a retry happens after completion, it returns the stored output.
If a job is already in progress, it avoids duplicate work and returns “in progress.”

Comparison With vs. Without Idempotency

Without idempotency	With idempotency
A retry can trigger a full inference run again	Retries resolve to state checks instead of inference calls
AI calls scale with delivery attempts, not logical jobs	AI calls map one-to-one with logical jobs
The same job may produce multiple outputs	Each job produces a single authoritative output
Logs contain repeated executions for the same job ID	Logs reflect a clear job lifecycle
Cost and behavior vary based on timing and failures	Cost and behavior remain predictable under retries

A common mistake developers make is to only check for COMPLETED. That helps, but it doesn’t stop two workers from both starting the job at the same time.

It needs two safeguards:

Result caching (return if completed)
A lock/claim mechanism (only one worker can execute)

If step 2 is skipped, it will still get duplicate inference during concurrency bursts.

Closing Thoughts

AI systems are expensive in ways traditional systems aren’t. When duplicates happen in a normal tool, we might waste compute. When duplicates happen in an AI tool, we often waste real money.

Idempotency won’t make the model smarter, but it will make the system survivable, cost-effective, and scalable.

AI Tool

Opinions expressed by DZone contributors are their own.

Related

Trending