Stop Trusting Your RAG Pipeline: 5 Guardrails I Learned the Hard Way

RAG alone doesn’t stop hallucinations. I use five guardrails: relevance scoring, forced citations, NLI checks, staleness detection, and confidence scoring.

Mayur Vekariya

Mar. 20, 26 · Analysis

Likes (0)

Comment

Save

4.3K Views

A few months back, one of our internal QA engineers asked the AI assistant a straightforward question about overtime pay calculations for a specific state. The system retrieved the right document, generated a confident answer, and the answer was wrong. Not slightly wrong. It cited a tax withholding table that had been updated two quarters earlier, but our vector store was still serving the old version. Nobody noticed for three days.

That incident changed how I think about retrieval-augmented generation (RAG) systems. I’d been building retrieval-augmented generation pipelines for enterprise applications for a while at that point, and I thought retrieval grounding was enough. It’s not. RAG reduces hallucinations, sure. But “reduces” is doing a lot of heavy lifting in that sentence when you’re processing payroll for millions of people.

What follows are five guardrail patterns I now treat as non-negotiable in any production RAG deployment. These aren’t theoretical. Every single one exists because something went wrong without it.

Your Top-K Results Are Lying to You

Here’s something nobody talks about in RAG tutorials: vector similarity search returns the closest documents, not the correct documents. If your knowledge base doesn’t contain the answer, the vector DB will still happily return its best guesses. And those guesses get fed straight into the LLM prompt, where they become the foundation for a hallucinated answer.

The fix is embarrassingly simple. Re-score every retrieved chunk against the original query with a cross-encoder or even just cosine similarity, and drop anything below a threshold.

    Python
   
 

   import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def filter_relevant_chunks(query, chunks, threshold=0.72):
    query_emb = model.encode(query)
    kept = []

    for chunk in chunks:
        chunk_emb = model.encode(chunk["text"])
        sim = np.dot(query_emb, chunk_emb) / (
            np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
        )
        if sim >= threshold:
            kept.append({**chunk, "relevance_score": float(sim)})

    kept.sort(key=lambda x: x["relevance_score"], reverse=True)
    return kept
  

That 0.72 threshold isn’t magic. I arrived at it by running ~500 labeled query-document pairs through the scorer and finding the cutoff where precision hit 0.90. Your number will be different. The point is: measure it, don’t guess.

The important part is what happens when nothing passes the threshold:

    Python
   
 

   def build_rag_response(query, chunks):
    relevant = filter_relevant_chunks(query, chunks)

    if not relevant:
        return {
            "answer": None,
            "status": "insufficient_context",
            "message": "I couldn't find relevant information for this question."
        }

    return generate_with_context(query, relevant)
  

I know returning “I don’t know” feels like a failure. It’s not. It’s the system working correctly. The alternative is letting the LLM freestyle, and that’s how you end up explaining to a compliance team why the AI told someone the wrong withholding rate.

Force the LLM to Show Its Work

Prompt engineering gets a bad rap because people treat it like incantations. But one pattern I’ve found genuinely effective is forcing citation. Tell the model it must cite [Doc X] for every factual claim, and that if it can’t cite something, it should say so.

    Python
   
 

   SYSTEM_PROMPT = """Answer using ONLY the provided context documents.

Rules:
1. Every factual claim MUST cite its source as [Doc X].
2. If the context doesn't contain the answer, say so. Don't guess.
3. If documents contradict each other, state both with their citations.
4. Do not add information beyond what's in the documents.

Context Documents:
{context}

Question: {query}
"""
  

But here’s what I learned after shipping this: the LLM will sometimes fabricate citations. It’ll write [Doc 7] when you only gave it 4 documents. Or it’ll cite the right document but misstate what the document says. The prompt gets you maybe 80% of the way there. You need a validator to catch the rest.

    Python
   
 

   import re

def validate_citations(response, num_docs):
    citations = re.findall(r'\[Doc\s+(\d+)\]', response)

    bad_refs = [c for c in citations if int(c) < 1 or int(c) > num_docs]

    # find sentences that make factual claims but cite nothing
    sentences = [s.strip() for s in re.split(r'[.!?]', response) if s.strip()]
    uncited = []
    for sent in sentences:
        skip_phrases = ["i don't have", "based on the", "according to", "the context"]
        if any(p in sent.lower() for p in skip_phrases):
            continue
        if not re.search(r'\[Doc\s+\d+\]', sent) and len(sent.split()) > 5:
            uncited.append(sent)

    return {
        "valid": len(bad_refs) == 0 and len(uncited) == 0,
        "bad_refs": bad_refs,
        "uncited_claims": uncited
    }
  

When validation fails, we re-prompt once with stricter instructions. If it fails again, we skip the generative summary entirely and just return the raw retrieved passages. Users get less polish but zero hallucination risk. That tradeoff is worth it every time in an enterprise context.

Check What the LLM Actually Said Against the Source

This one caught me off guard. We had the citation enforcement running, the LLM was citing real documents, and it was still getting things wrong. It would cite [Doc 2] and then subtly rephrase the content in a way that changed the meaning. Not maliciously, obviously. But the effect was the same as a hallucination.

The solution is running an NLI (Natural Language Inference) model as a post-check. You take each claim the LLM made, pair it with the passage it cited, and ask a smaller model: Does the source actually support this claim?

    Python
   
 

   def verify_consistency(claims, sources, threshold=0.5):
    from transformers import pipeline
    nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")

    problems = []
    for claim, source in zip(claims, sources):
        result = nli(f"{source} [SEP] {claim}")
        if result[0]["label"] == "CONTRADICTION" and result[0]["score"] > threshold:
            problems.append({
                "claim": claim,
                "source": source,
                "confidence": result[0]["score"]
            })

    return problems
  

This is the most expensive guardrail. DeBERTa inference adds 300-500ms per response, depending on how many claim-source pairs you’re checking. For our latency-sensitive paths, I run it async. The user gets the response immediately, and if the NLI check flags a contradiction, a correction fires within a few seconds. For compliance-critical paths (anything touching payroll calculations, tax rates, regulatory guidance), it runs synchronously. The extra half-second is worth it.

Stale Documents Are Silent Killers

This is the one that bit us. I mentioned the tax table incident at the top. The root cause was simple: we re-index documents on a schedule, and the tax table update landed between cycles. The vector store was serving a document that was technically correct as of last quarter and completely wrong now.

The fix is metadata-level. Every document chunk carries a last_verified timestamp, and we check it before sending anything to the LLM.

    Python
   
 

   from datetime import datetime, timedelta

def find_stale_chunks(chunks, max_age=timedelta(days=90)):
    now = datetime.utcnow()
    stale = []

    for chunk in chunks:
        verified = datetime.fromisoformat(chunk["metadata"]["last_verified"])
        age = now - verified
        if age > max_age:
            stale.append({
                "chunk_id": chunk["id"],
                "source": chunk["metadata"]["source"],
                "age_days": age.days
            })

    return stale
  

When stale chunks are the only context available, we append a disclaimer to the response. Something like: “This information was last verified on [date]. Please confirm with your compliance team for the most current guidance.” Not glamorous. But it’s the difference between a user making an informed decision and a user trusting outdated information because the AI presented it with full confidence.

The staleness window depends on your domain. For us, 90 days works for most HR policy documents. For tax tables and regulatory content, it’s 30 days. For some compliance content during legislative sessions, it’s as low as 7 days. You know your data better than any default I could give you.

Teach the System to Say “I’m Not Sure”

LLMs don’t have calibrated uncertainty. A response that’s 95% likely to be correct looks identical to one that’s 40% likely to be correct. Same tone, same confidence, same fluency. This is the most dangerous property of language models in enterprise settings.

My approach is to aggregate signals from all the previous guardrails into a single confidence score and then use that score to decide what to do with the response.

    Python
   
 

   def score_confidence(relevance_scores, citation_check, nli_flags, stale_chunks, total_chunks):
    avg_relevance = sum(relevance_scores) / len(relevance_scores) if relevance_scores else 0
    cite_score = 1.0 if citation_check["valid"] else max(0, 1 - len(citation_check["uncited_claims"]) * 0.2)
    consistency = max(0, 1.0 - len(nli_flags) * 0.3)
    freshness = 1.0 - (len(stale_chunks) / total_chunks) if total_chunks else 0

    score = 0.30 * avg_relevance + 0.25 * cite_score + 0.25 * consistency + 0.20 * freshness

    if score >= 0.85:
        return {"confidence": score, "action": "serve"}
    elif score >= 0.60:
        return {"confidence": score, "action": "serve_with_disclaimer"}
    else:
        return {"confidence": score, "action": "abstain"}
  

Three tiers. Above 0.85, serve the answer. Between 0.60 and 0.85, serve it with a disclaimer that hedges the confidence. Below 0.60, don’t serve a generated answer at all. Return the raw retrieved documents and let the user read them directly.

The weights (0.30, 0.25, 0.25, 0.20) aren’t sacred. I started with equal weights and adjusted after analyzing a few hundred production responses where we had ground truth. Relevance ended up mattering the most, which makes sense. If you retrieved the wrong documents, nothing downstream can save you.

How These Fit Together

In production, the guardrails form a pipeline:

    Plain Text
   
 

   Query
  → Vector search
  → Relevance filter (drop low-scoring chunks)
  → Staleness check (flag old documents)
  → LLM generation with citation prompt
  → Citation validation (verify references are real)
  → NLI consistency check (verify claims match sources)
  → Confidence scoring (decide: serve / hedge / abstain)
  → Response
  

Pre-generation guardrails (relevance, staleness) clean up the input. The prompt handles generation-time grounding. Post-generation guardrails (citation validation, NLI, confidence) catch what slipped through. No single layer is sufficient. I’ve seen failures at every stage, which is why you need all of them.

Total latency overhead for everything except the NLI check: under 200ms. The NLI check adds 300-500ms on the synchronous path. For most use cases, that’s fine. For real-time conversational interfaces where every millisecond counts, run the NLI check async and correct after the fact.

What I’d Tell Someone Starting Out

If you’re building a RAG system and thinking “I’ll add guardrails later,” don’t. Build the relevance filter and the “I don’t know” path on day one. Those two things alone prevent the worst failures. Add citation enforcement next. Staleness detection and NLI checks can come as your system matures and you see where the remaining failures cluster.

The most counterintuitive lesson I’ve learned is that the systems users trust most are the ones that occasionally say, “I’m not confident enough to answer this.” In an enterprise setting, a wrong answer from an AI system creates more damage to user trust than a hundred correct ones can rebuild. Your guardrails aren’t just preventing hallucinations. They’re protecting the credibility of the entire system.

Pipeline (software) large language model RAG

Opinions expressed by DZone contributors are their own.

Related

Trending