Idempotency in AI Tools: The Most Expensive Thing Teams Forget
An analysis of how retries cause duplicate inference in AI tools and how idempotency keeps production systems predictable and cost-controlled.
Join the DZone community and get the full member experience.
Join For FreeWhen AI tools move from a test environment to real-world use, the first “surprise” a developer encounters is rarely about accuracy. It’s usually something more problematic: the system behaves inconsistently, costs climb faster than expected, and the same job seems to run multiple times.
That’s not an AI problem. That’s a distributed systems problem. And in AI systems, this particular failure is extra problematic because every duplicate run has a direct dollar value impact. Idempotency is the fix. Not the only fix, but often the most impactful one.
What Does Idempotency Mean?
If the same request/job is processed more than once, the end result is the same as if it had been processed once. This matters because in a production environment, duplicate processing is normal. It happens mostly due to:
- Network timeouts
- User/client retries
- Queue failure and redelivery
- Mid-flight worker failures
- Timeouts due to long-running tasks
- Race conditions between parallel users
Impact of Non-Idempotency
In many non-AI systems, duplicate processing is problematic but not catastrophic. In AI tools, duplicate processing is catastrophic as it directly affects profitability and is often silent.
A duplicate inference call costs:
- Extra tokens
- Extra latency
- Extra downstream load
A Day-to-Day Example to Explain the Impact
Think of a warehouse fulfilling orders.
- A customer places an order.
- The system prints a packing slip.
- The packer ships the item.
Now imagine the print request times out. The system re-tries. Two slips are printed, and two items get shipped. Nothing “crashed.” But the cost just doubled, creating a chaos of issues.
AI tools behave the same way when a job is re-tried, and inference runs twice.
Reason for Duplicates
1. Client or API Retries
Sometimes a client sends a "start enrichment" request to an LLM and waits for a reply. If it doesn't hear back in time, it tries again, leading to multiple requests for the same thing.
2. Queue Redelivery
Message queues like SQS are designed for reliability, not for making sure something is delivered only once. If something goes wrong, that same message might get processed again.
3. Worker Fails Mid-Task
If a worker fails mid-task while processing a job, then the system might retry the job regardless of the status of the failed job.
Basic Implementation and Failure
Here’s a pattern early AI services adopt:
def process_job(job_id: str, payload: dict) -> dict:
# 1) Make a call to the LLM
result = llm_enrich(payload)
# 2) Save the LLM output
db.save_result(job_id, result)
return result
This works perfectly until something goes wrong between step 1 and step 2. If the worker crashes after inference but before saving, the system will retry the job and call the LLM again.
Improved approach:
A way to make AI jobs idempotent is:
- Use a stable idempotency key (job_id).
- Store state before inference.
- If the job is already completed, return the stored result.
- Ensure only one worker “owns” the right to run inference at a time.
Below is a Python example that demonstrates the pattern.
Data Model
A table keyed by job_id:
- status: PENDING | IN_PROGRESS | COMPLETED | FAILED
- result: optional
- attempts: count
- updated_at: timestamp
Code: Idempotent job execution
import time
from dataclasses import dataclass
from typing import Any, Dict, Optional
@dataclass
class JobRecord:
job_id: str
status: str # PENDING | IN_PROGRESS | COMPLETED | FAILED
result: Optional[Dict[str, Any]] = None
attempts: int = 0
updated_at: float = 0.0
class InMemoryJobStore:
"""
This is a stand-in for DynamoDB/Postgres/etc.
The focus is on the logic, not the storage engine.
"""
def __init__(self):
self._store: Dict[str, JobRecord] = {}
def get(self, job_id: str) -> Optional[JobRecord]:
return self._store.get(job_id)
def put_if_absent(self, job: JobRecord) -> bool:
if job.job_id in self._store:
return False
self._store[job.job_id] = job
return True
def try_mark_in_progress(self, job_id: str) -> bool:
"""
In real systems this should be atomic (conditional update).
Here we simulate it.
"""
rec = self._store.get(job_id)
if rec is None:
return False
if rec.status in ("IN_PROGRESS", "COMPLETED"):
return False
rec.status = "IN_PROGRESS"
rec.attempts += 1
rec.updated_at = time.time()
return True
def mark_completed(self, job_id: str, result: Dict[str, Any]) -> None:
rec = self._store[job_id]
rec.status = "COMPLETED"
rec.result = result
rec.updated_at = time.time()
def mark_failed(self, job_id: str, error: str) -> None:
rec = self._store[job_id]
rec.status = "FAILED"
rec.result = {"error": error}
rec.updated_at = time.time()
def llm_enrich(payload: Dict[str, Any]) -> Dict[str, Any]:
"""
Replace this with your real LLM/API call.
"""
time.sleep(0.2)
return {
"normalized_title": payload.get("title", "").strip().title(),
"category": payload.get("category", "unknown"),
}
def process_job_idempotent(job_store: InMemoryJobStore, job_id: str, payload: Dict[str, Any]) -> Dict[str, Any]:
# 1) If already completed, return cached result
existing = job_store.get(job_id)
if existing and existing.status == "COMPLETED":
return existing.result or {}
# 2) Ensure the job record exists (idempotent create)
job_store.put_if_absent(JobRecord(job_id=job_id, status="PENDING", updated_at=time.time()))
# 3) Acquire the right to process (in real systems: conditional update)
if not job_store.try_mark_in_progress(job_id):
# Someone else is processing, or it already completed
existing = job_store.get(job_id)
if existing and existing.status == "COMPLETED":
return existing.result or {}
return {"status": "IN_PROGRESS", "job_id": job_id}
# 4) Run inference once
try:
result = llm_enrich(payload)
job_store.mark_completed(job_id, result)
return result
except Exception as e:
job_store.mark_failed(job_id, str(e))
raise
This code achieves:
- If the same job is sent twice, it doesn’t automatically trigger two AI calls.
- If a retry happens after completion, it returns the stored output.
- If a job is already in progress, it avoids duplicate work and returns “in progress.”
Comparison With vs. Without Idempotency
| Without idempotency | With idempotency |
|---|---|
|
A retry can trigger a full inference run again |
Retries resolve to state checks instead of inference calls |
|
AI calls scale with delivery attempts, not logical jobs |
AI calls map one-to-one with logical jobs |
|
The same job may produce multiple outputs |
Each job produces a single authoritative output |
|
Logs contain repeated executions for the same job ID |
Logs reflect a clear job lifecycle |
|
Cost and behavior vary based on timing and failures |
Cost and behavior remain predictable under retries |
A common mistake developers make is to only check for COMPLETED. That helps, but it doesn’t stop two workers from both starting the job at the same time.
It needs two safeguards:
- Result caching (return if completed)
- A lock/claim mechanism (only one worker can execute)
If step 2 is skipped, it will still get duplicate inference during concurrency bursts.
Closing Thoughts
AI systems are expensive in ways traditional systems aren’t. When duplicates happen in a normal tool, we might waste compute. When duplicates happen in an AI tool, we often waste real money.
Idempotency won’t make the model smarter, but it will make the system survivable, cost-effective, and scalable.
Opinions expressed by DZone contributors are their own.
Comments