Designing a Production-Grade Multi-Agent LLM Architecture for Structured Data Extraction

Find out how a multi-agent LLM system overcomes the unreliability of single-LLM extraction when processing large volumes of documents for structured data.

Haricharan Shivram Suresh Chandra Kumar

May. 01, 26 · Analysis

Likes (0)

Comment

Save

2.8K Views

Problem statement:

Many enterprise systems rely on large volumes of documents that are similar in purpose but inconsistent in structure. For example, in the field of medicare insurance, different carriers, vendors, or partners publish documents describing comparable offerings, but each uses its own format, terminology, layouts, unstructured conditional clauses, etc.

Many of these documents also contain tables, but are of different structures, sometimes within the same page. Another problem to call out is year-over-year and location-to-location variation, such as by state, county, and ZIP code.

As a result, critical data is trapped in these documents, which require extensive manual review. Traditional rule-based parsers break with eventual formatting shifts and need extensive code deployments, tests, and releases. Regex-based approaches fail under real-world conditions and need constant maintenance. In fact, when scaling, even a single prompt (per document) LLM extraction fails and would work only for proof of concepts or demos.

This is where a multi-agent LLM architecture becomes necessary.

Why Single Agent LLM Extraction Fails

A single agent would probably work for 5 documents, but it won’t work for 5,000. Some of the common failures which I’ve observed:

Hallucinated values for missing fields
Context window limits causing incorrect outcomes
No reliable confidence signal
Automatic assumptions in outcomes from previous memory

In high-impact production environments, silent errors can be worse than explicit failures. With the probabilistic outcomes, there is also a need for deterministic guardrails.

The solution is not just refined prompt engineering, but it is profound architectural decomposition, a mix of core software engineering tightly coupled with AI engineering.

System Architecture Overview

The system decomposes responsibilities into independent agents.

PDF → Preprocessing → Extraction Agent → Validation Layer → Judge Agent → Structured Output

Design Principles

Separation of concerns
Strict schema contracts
Deterministic QA before acceptance
Confidence scoring and Judging
Human in the loop feedback for low confidence judging
Observability metrics

Each agent has a detailed and defined scope, thus increasing reliability

Deployment Context and Infrastructure

The architecture is deployed as a set of lightweight services rather than a single monolithic script. Each stage in the pipeline runs independently and communicates through structured JSON messages. This allows the system to scale horizontally when document volume increases.

In a production environment, the pipeline would run on:

A containerized backend (Docker-based deployment)
A queue-based processing system (For processing asynchronously)
A storage layer for processing and versioning
A structured output store — database

This setup allows multiple documents to be processed simultaneously without blocking the system. If the extraction agent fails on one document, it does not interrupt the processing of others.

Extraction Agent

The extraction agent’s sole responsibility is to convert document chunks into structured JSON that adheres to the predefined schema.

Key design decisions include:

Low temperature
Explicit JSON schema enforcement
Chunk-level semantic segmentation
Carrier/partner agnostic prompt design

Chunking is important here, as a fixed-length token system breaks logical sections. Instead, variable chunking via semantic segmentation improves accuracy. The final RAG-based system is designed to be dynamic, allowing the extractor to look at the top few chunks as needed.

    Python
   
 

   class ExtractionAgent:

    def __init__(self, llm_client, prompt_template, schema: dict):
        self.llm = llm_client
        self.prompts = prompt_template
        self.schema = schema

    def run(self, chunks: list[str]) -> dict:
        prompt = self.prompts.build_extraction_prompt(chunks)

        response = self.llm.generate(
            prompt=prompt,
            temperature=0.0,
            response_schema=self.schema
        )

        return response
  

The output contract is strict and is sent for further validation. No validation is performed by the Extraction Agent.

Validation Agent

We divided validation into two parts, a hybrid approach.

Deterministic validation: This enforces JSON Schema integrity, required vs. optional fields, basic QA checks such as data types, range checks, NULLs, etc. All of these are to ensure structural correctness, which is most often tightly coupled with end-user UI.
Contextual LLM validation: A second LLM pass compares the extracted output with the original documented text. Its role is primarily to detect mismatches between extracted values and the source. It would identify, flag, and correct hallucinated and missing entries.

    Python
   
 

   class ValidationAgent:

    def __init__(self, llm_client, prompt_template):
        self.llm = llm_client
        self.prompts = prompt_template

    def validate(self, extracted: dict, source_chunks: list[str]) -> dict:

        deterministic = self._deterministic_checks(extracted)

        contextual_prompt = self.prompts.build_validation_prompt(
            extracted,
            source_chunks
        )

        contextual = self.llm.generate(
            prompt=contextual_prompt,
            temperature=0.0
        )

        return {
            "deterministic": deterministic,
            "contextual": contextual
        }

    def _deterministic_checks(self, extracted: dict) -> dict:
        errors = []

        if "plan_name" not in extracted:
            errors.append("Missing required field: plan_name")

        for item in extracted.get("benefits", []):
            if isinstance(item.get("copay"), (int, float)) and item["copay"] < 0:
                errors.append("Invalid negative copay detected")

        return {
            "valid": len(errors) == 0,
            "errors": errors
        }
  

Judge Agent

Even after validation, the system needs a decision layer. Here comes the Judge Agent, which receives the extracted output, validation results, and findings, and produces a confidence score, classifies the error, and performs a final decision.

During judging, confidence thresholds are bucketed and calibrated against historical datasets that are already processed. This helps transform the output into outcomes that can be tracked and improved operationally.

An additional context here — it is important to build the extraction agent and validation agent with two different state-of-the-art LLMs.

The judging LLM would also be a different model from the ones used for Extraction and Validation.

    Python
   
 

   class JudgeAgent:

    def __init__(self, llm_client, prompt_template, threshold: float = 0.85):
        self.llm = llm_client
        self.prompts = prompt_template
        self.threshold = threshold

    def evaluate(self, validation_result: dict) -> dict:

        prompt = self.prompts.build_judge_prompt(validation_result)

        judgment = self.llm.generate(
            prompt=prompt,
            temperature=0.0
        )

        confidence = judgment.get("confidence_score", 0.0)

        if confidence >= self.threshold:
            status = "PASS"
        elif confidence >= 0.6:
            status = "REVIEW"
        else:
            status = "FAIL"

        return {
            "confidence": confidence,
            "status": status,
            "details": judgment
        }


  

Prompt Engineering and Variability Handling

In a production-level GenAI system, prompt engineering must be treated like software development, with prompts version-controlled, reusable, and benchmarked against golden historical datasets. As document formats evolve, there can be a degradation in the accuracy of prompts unless they are continuously evaluated. Build a strong, generic base prompt and include custom prompts for specific carriers derived from historical datasets.

    Python
   
 

   class PromptTemplate:

    def __init__(self, version: str):
        self.version = version

    def build_extraction_prompt(self, chunks: list[str]) -> str:
        return f"""
        You are an expert structured data extraction system.

        TASK:
        Extract all relevant fields according to the JSON schema.
        Do NOT infer missing values.
        Preserve numeric fidelity.
        Include conditional clauses exactly as written.

        DOCUMENT CONTENT:
        {self._format_chunks(chunks)}

        Return ONLY valid JSON.
        Prompt-Version: {self.version}
        """

    def build_validation_prompt(self, extracted: dict, source: list[str]) -> str:
        return f"""
        Compare the extracted structured output with the source document.

        Identify:
        - Hallucinated values
        - Missing fields
        - Numeric mismatches
        - Logical inconsistencies

        Extracted:
        {extracted}

        Source:
        {self._format_chunks(source)}

        Return validation findings in structured JSON.
        """

    def build_judge_prompt(self, validation_result: dict) -> str:
        return f"""
        Based on the validation findings, assign:

        - Confidence score (0.0 - 1.0)
        - Error category (none/minor/major)
        - Final decision (PASS/REVIEW/FAIL)

        Validation Result:
        {validation_result}

        Return structured JSON only.
        """

    def _format_chunks(self, chunks: list[str]) -> str:
        return "\n\n".join(chunks)
  

Because document variability is unavoidable, architectures must assume entropy rather than optimizing for ideal inputs. Adaptive chunking, partner-aware prompt conditioning, and nested logic all play a profound role.

Plan Benefit Example: Inpatient Services

Input excerpt from plan document: A section of an insurance document stating: "For Inpatient Hospital Services, a 20% Coinsurance applies after the annual deductible of $500 is met. This Coinsurance is capped at a maximum of $5,000 out-of-pocket per calendar year."

Extraction Agent Output (Excerpt):

    JSON
   
 

   {   "deductible_in_network": 500,   
 "inpatient_coinsurance_rate": 0.20,   
 "inpatient_service_type": "Hospital Services",   
 "coinsurance_condition": "annual deductible of $500 is met",   
 "max_out_of_pocket_inpatient": 5000 }
  

Validation Agent result:

Deterministic check: PASS (All required fields present. inpatient_coinsurance_rate is a float between 0.0 and 1.0. max_out_of_pocket_inpatient is a positive integer).
Contextual check: PASS (The LLM confirms the $5,000 cap is correctly associated with the inpatient coinsurance, and the conditional trigger for the coinsurance is accurately captured).

Judge Agent decision: Confidence: 0.95, Status: PASS.

Observability and cost in production:

Key metrics for the system include, but are not limited to:

Extraction success rate
Validation failure rate
Average confidence score (over time)
Token usage per document and document type

Monitoring the confidence distribution using any industry-standard open-source Python module is the secret sauce that helps indicate prompt drift/regression, completely new and unexpected document structures, and errors.

Human-in-the-loop feedback needs to be accounted for, and building a simple UI that can perform actions like ignore or fix goes a long way toward usability.

For a cost optimization strategy in production, some foundational practices must be implemented.

Batch processing with token observability
Single-pass LLM for consistently defined structures
Parallelization (which is easily achievable with reusable prompts and LLM REST APIs)

Overall Architecture

Conclusion

LLMs are powerful extraction tools, but without structure, they can lead to unstable or unexpected outcomes.

By decomposing responsibilities into extraction, validation, and judgment agents, and combining them with traditional approaches such as enforcing schema contracts, confidence scoring, etc., it becomes possible to transform ‘similar but also varying’ semi-structured documents from multiple inconsistent sources into reliable, structured data at scale.

The difference between a proof of concept and a production AI-based system is not the model but the architecture around it.

Data extraction JSON large language model

Opinions expressed by DZone contributors are their own.

Related

Trending