DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • 5 Failure Patterns That Break AI Chatbots in Production
  • Logging What AI Agents Do in Salesforce: A Simple One-Object Audit Framework
  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance

Trending

  • DZone's Article Submission Guidelines
  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  • Agentic AI Has an Observability Blind Spot Nobody Is Talking About
  • Liquid Glass, Material 3, and a Lot of Plumbing
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. If You Can Survive a Toddler, You Can Ship LLMs in Production

If You Can Survive a Toddler, You Can Ship LLMs in Production

A silent provider update once invalidated months of LLM scores in a pipeline I owned. Here is what I changed after, and how parenting taught me the same lesson twice.

By 
Scarlett Attensil user avatar
Scarlett Attensil
·
Jun. 11, 26 · Opinion
Likes (0)
Comment
Save
Tweet
Share
122 Views

Join the DZone community and get the full member experience.

Join For Free

A few years back, I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was an LLM. Reviews rolled in continuously, ratings flowed into a dashboard the product team checked every Monday morning. Everything ran clean for months. Then one Monday, the chart had a step in it.

Reviews from the prior week averaged 6.4. The current week averaged 7.6. Same product. Same customers. The reviews themselves, when I went back to read them, looked indistinguishable from what we had been getting all year.

The model had changed. The provider had pushed a quiet update to the weights, and the LLM that gave us 6.4-equivalent scores last week was now giving 7.6-equivalent scores for the same content. Every historical comparison in that dashboard was silently invalid. The cleanup took a week. The harder conversation was about how much of our reporting had been real in the first place.

That kind of failure is the default behavior of LLMs in production. Trying to engineer it away with tighter parameters or pinned versions is a losing fight. The job is to design for it. I learned the lesson twice. Once from the reviews pipeline. Once from raising two kids.

What Parents of Small Children Already Know About Non-Determinism

If you have lived through the toddler years, you have run this experiment a few hundred times without calling it one. The lunch you packed all last week, the one that came home empty every day, suddenly gets pushed off the table on Tuesday with full commitment. The bedtime story that worked for six straight nights stops working on the seventh. The nap routine the babysitter swore was solid breaks the moment you start calling it a "rule."

Experienced parents eventually stop trying to force determinism on the kid. Patterns and trends still matter. But you stop expecting any individual input to produce any individual output, and you build a system that absorbs the variance instead of fighting it. This is the same shift production AI engineers make, usually after their first calibration regression.

The LLM-as-Judge Can Drift, Too

The reviews pipeline taught me that the judge can be the most fragile thing in the system. The model being evaluated can drift. The model doing the evaluation can drift too. Without something stable to anchor against, you cannot tell which one moved.

The pattern that works is a small held-out set of inputs with known, human-validated scores, and the habit of re-running it on a regular cadence. Call it the calibration set. 20 to 50 examples is plenty. You re-score the calibration set first. If the average jumps from 6.4 to 7.6 with no other changes, you know the judge moved, not the data. Without that anchor, the same diagnosis takes weeks of reading individual reviews and arguing about what changed.

This is where offline evaluations in AgentControl earn their keep. You upload your calibration set as a dataset, point a judge at it, and re-run on a cadence or before any variation change. The discipline I had to learn the hard way: keep the judge anchored, keep its inputs comparable, watch the distribution rather than any single response, becomes a property of the configuration instead of a script someone has to remember to run.

The parenting version is the pencil marks on a doorframe. The doorframe does not move. Every few months, you put the kid against it, shoes off, and back to the wall. If the line jumps three inches and you realize the kid is wearing sneakers, you take the shoes off and measure again before believing any of it. The doorframe is your held-out set. The shoes-off rule is the discipline that keeps re-runs comparable.

Temperature Zero Is a Comfort Blanket

Setting the temperature to zero feels like it should make a model deterministic. Within a single model version, it mostly does. The catch is that determinism within a version buys you nothing across versions.

My reviews pipeline had been running at temperature zero the whole time. The provider's swap underneath did not care. The judge changed, and greedy sampling kept producing the new, shifted scores with the same false confidence as before. Temperature zero compresses the variance you can see during testing, which makes you feel safer. It does nothing about the variance that actually breaks production. Design as if the model can produce a different valid output every time, because eventually it will.

Build the Fallback Before the Happy Path

The last move gets cut for time most often, which is why it matters. Before you ship the model that does the new thing well, ship the path that runs when the model does the new thing badly, slowly, or not at all.

For an LLM endpoint that usually looks like a cached response for known-bad inputs, a secondary model behind the primary that can take traffic when the primary fails an inline check, a circuit breaker that routes around the model entirely if error rates cross a threshold, and a logging path that captures failure cases instead of returning a stack trace to the user. None of this is exotic. All of it assumes the model will misbehave at some point and defines what good behavior looks like when it does.

The stronger version of the same idea is making the fallback adaptive instead of static. A static fallback still needs a human to notice something is wrong and pull the lever. An adaptive system watches the production signal itself and switches over without anyone in the loop. This is what configuration-driven LLM tooling is built for. With AgentControl by LaunchDarkly, model variations live as configuration rather than code, traffic shifts between them without a deploy, and a guarded rollout can tie an online evaluation score, or any AgentControl metric you care about, directly to whether a variation advances, pauses, or reverts. When the judge sees scores regress past a threshold, the rollout reverses itself. The fallback stops being a piece of code someone wrote in case of trouble. It becomes the architecture, watching itself.

Parents already operate this way. The grocery store meltdown will happen. The school will call at 11 a.m. about a "low-grade fever." You have a snack pre-stuffed in your bag and a backup babysitter on text. The fallback is the architecture. The happy path is the bonus.

What Changed for Me

After the reviews pipeline incident, my work changed. I started logging the model version returned in every API response. I built a calibration set. I stopped trusting any single eval run as a verdict. The boring fallback path now ships before the impressive demo path.

None of this is harder than the alternative. It is mostly a matter of accepting at the architecture stage what every parent of a small kid already knows. The interesting unit of measurement is the distribution, not the sample.

AI Production (computer science) large language model

Published at DZone with permission of Scarlett Attensil. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • 5 Failure Patterns That Break AI Chatbots in Production
  • Logging What AI Agents Do in Salesforce: A Simple One-Object Audit Framework
  • Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook