DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Microsoft Fabric AI Functions: A Practical Overview for Data Engineers
  • DataFlow — An Open-Source Data Preparation System Accelerating LLM Training
  • Automatic Data Correlation: Why Modern Observability Tools Fail and Cost Engineers Time

Trending

  • Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale
  • Zero-Downtime Deployments for Java Apps on Kubernetes
  • Docker Hardened Images Are Free Now — Here's What You Still Need to Build
  • The Developer's Guide to Context-Aware AI: When Your Code Documentation Becomes Intelligent
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines

Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines

Glue failures scatter evidence across logs, metadata, and table state. A triage layer pulls it together and flags whether a rerun is safe.

By 
Vivek Venkatesan user avatar
Vivek Venkatesan
·
Jun. 02, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
85 Views

Join the DZone community and get the full member experience.

Join For Free

The Pipeline Did Not Fail Cleanly

Most pipeline failures don't look like "the job failed."

Consider a common scenario. A Glue job reads overnight event files, applies business rules, and writes to an Iceberg curated table. The job runs at its scheduled time and errors out partway through. The control table shows SUCCESS for the previous batch and FAILED for the current one, which is what you'd expect. The problem is what happened between those two states: the job wrote nine of the day's twelve partitions to the staging table before failing. A downstream report ran on its own schedule, picked up the partial data, and the discrepancy didn't surface until a downstream consumer noticed records were missing.

By the time someone looks at the failure, the question is no longer "Why did the job fail?" It's "Is it safe to rerun, and what's already corrupted downstream?"

That's where debugging gets messy. CloudWatch logs, Glue run metadata, the source S3 path, record counts, data quality results, target table state, and Iceberg snapshots. An experienced engineer can connect those signals, but it takes time, and a less experienced engineer often misses one. In a busy production environment that delay leads to blind reruns, duplicate records, overwritten partitions, or worse.

The frustrating part is that the evidence existed. The pipeline just had no structured way to explain itself.

That's the gap a triage layer can fill. Not by fixing the pipeline. Not by changing schemas. Not by restarting jobs. By observing the evidence already produced, classifying the failure, explaining what likely happened, and recommending what to do next.

What Agentic Observability Means

The word "agentic" gets misused a lot right now, especially in data engineering. It's worth being precise.

An agentic observability layer is not an LLM with permission to control production. It's a controlled workflow that collects pipeline evidence, builds incident context, classifies the failure against known categories, and produces a structured recommendation. The loop is observe, classify, explain, recommend, and that's where it stops. Everything past "recommend" stays with engineers, deterministic rules, or approval workflows.

The difference from normal alerting is the depth of the output. A normal alert says "Glue job daily_customer_interactions failed." An agentic observability layer should produce something closer to: "The job failed because the input contains a new column not present in the curated schema. The staging write started before the failure, so a blind retry will create duplicate records. Quarantine the batch, review the schema contract, and rerun with the same batch_id after validation."

That difference is what saves time during an incident. The goal isn't replacing engineers. It's reducing the manual triage work needed before someone can make a real decision.

Reference Architecture

This does not need to start as a new platform. The triage layer can sit beside existing Glue pipelines and consume signals that already exist.

Figure 1. Agentic observability flow for AWS Glue pipelines. Pipeline evidence is collected, converted into structured context, analyzed by an LLM triage layer, and returned as a structured incident output.

Figure 1. Agentic observability flow for AWS Glue pipelines. Pipeline evidence is collected, converted into structured context, analyzed by an LLM triage layer, and returned as a structured incident output.

The component that matters most here is the incident context builder. The LLM should never receive a raw dump of ten thousand log lines. That produces noisy, low-confidence output and burns tokens. The collector should pull a curated set of signals: Glue job name and run ID, status and duration, batch ID, source path, target table, the last fifty error log lines, data quality results, record counts, attempt count, recent deployment version, table snapshot or commit ID, and control table status.

That's enough context to analyze the failure without guessing from disconnected log lines.

Where This Fits

Before going further, one thing worth being honest about: this pattern depends on the platform already having its house in order. The agent can only work with the observability that the platform already has. It is not a substitute for basic pipeline hygiene.

It works when the platform tracks batch IDs, clear source paths, data quality results, structured logs, table commits, deployment versions, and ownership mapping. Without those signals, the agent has very little to reason over. If a pipeline doesn't track batch IDs, the agent can't reliably tell whether a run is a retry or a new batch. If quality results aren't stored, it can't reason about input validity. If table commits aren't tracked, it can't tell whether the failure happened before or after a write.

LLMs don't create observability. They summarize and reason over the observability that already exists. The teams that get the most out of this pattern are the ones with disciplined data engineering underneath.

Failure Categories

Manual debugging takes time, partly because every failure looks unique at first glance. Most don't stay unique once you classify them. A small fixed set of categories makes the output easier to review, compare, and route.

Failure category Common signals Recommended action
Schema drift New column, missing column, cast failure, contract mismatch Quarantine the batch and review the schema contract
Data skew Long-running tasks, shuffle spill, uneven partitions Repartition or isolate skewed keys
Small file pressure High file count, slow planning, frequent commits Compact affected partitions
Source delay Missing input path, low record count, late file arrival Wait, retry later, or mark the batch delayed
Code regression Recent deployment plus transformation error Roll back or compare with the previous run
Permission issue Access denied, catalog failure, IAM or Lake Formation error Fix access policy before retrying
Partial write risk Failure after write started Check staging and control tables before rerun
Unknown Weak or conflicting evidence Escalate to an engineer with summarized context


The category list isn't only documentation. It's part of the system contract. The agent picks from this list rather than inventing categories on each run, which makes downstream routing tractable. Schema drift can go to the data contract owner. Permission issues route to the platform team. Source delays go to the ingestion owner. Partial write risk triggers a manual review workflow rather than auto-retry.

This is what makes the system more useful than a chatbot that summarizes logs.

Structured Incident Output

The output should also be structured. Free-form summaries help humans skim, but they're hard to store, compare, or evaluate over time. JSON works better because it can be written to an incident table and consumed by Slack, Teams, Jira, or ServiceNow without parsing prose.

JSON
 
{
  "pipeline_name": "daily_customer_interactions",
  "job_run_id": "jr_2026_05_02_001",
  "status": "FAILED",
  "failure_category": "SCHEMA_DRIFT",
  "likely_root_cause": "Input file contains a new column named device_type that is not defined in the curated table schema.",
  "affected_source_path": "s3://raw/events/date=2026-05-02/",
  "affected_table": "curated.customer_interactions",
  "safe_to_retry": false,
  "recommended_action": "Quarantine the batch, update the schema contract, and rerun with the same batch_id after validation.",
  "confidence": 0.87
}


A structured output gives engineers a quick summary, and it gives downstream tools something reliable to use. If safe_to_retry is false, the orchestrator blocks automatic retry. If failure_category is PERMISSION_ERROR, the issue routes to the platform queue. If confidence is low, the system asks for human review. If the same failure category recurs across runs, dashboards can track it over time.

One important framing point: the LLM is not the system of record. The control table, logs, table metadata, and quality checks remain the source of truth. The agent is a reasoning layer that produces structured evidence on top of that.

Implementation Sketch

A simple implementation starts with assembling the incident context. The example below is intentionally simplified. In production, the LLM call should use structured outputs or schema-validated responses rather than free-form text parsing.

Python
 
def build_incident_context(job_run, control_record, dq_results, recent_logs):
    return {
        "job_name": job_run["JobName"],
        "job_run_id": job_run["Id"],
        "status": job_run["JobRunState"],
        "started_on": str(job_run["StartedOn"]),
        "completed_on": str(job_run.get("CompletedOn")),
        "batch_id": control_record.get("batch_id"),
        "source_path": control_record.get("source_path"),
        "target_table": control_record.get("target_table"),
        "attempt_count": control_record.get("attempt_count"),
        "control_status": control_record.get("status"),
        "data_quality_results": dq_results,
        "recent_error_logs": recent_logs[-50:]
    }


The classifier receives a fixed category list and explicit rules about what it shouldn't recommend.

Python
 
def classify_failure(llm_client, incident_context):
    prompt = f"""
    You are analyzing a failed data pipeline run.

    Classify the failure into one of these categories:
    SCHEMA_DRIFT, DATA_SKEW, SOURCE_DELAY, PERMISSION_ERROR,
    CODE_REGRESSION, PARTIAL_WRITE_RISK, SMALL_FILE_PRESSURE, UNKNOWN.

    Return only valid JSON with:
    failure_category, likely_root_cause, safe_to_retry,
    recommended_action, confidence.

    Rules:
    - Do not recommend a retry if there is partial write risk.
    - Do not recommend schema changes without human review.
    - Do not recommend permission changes without platform approval.
    - Use UNKNOWN when evidence is weak or conflicting.

    Incident context:
    {incident_context}
    """
    return llm_client.invoke(prompt)


In a real implementation, this prompt should be paired with a strict response schema (failure_category as an enum, likely_root_cause as a string, safe_to_retry as a boolean, recommended_action as a string, confidence as a float between 0 and 1), and the system should reject any output that doesn't match. In production, structured outputs are the better choice when the API supports them. The free-form prompt above is illustrative.

The result gets stored, not acted on:

Python
 
def store_incident_summary(summary, incident_table):
    incident_table.put_item(
        Item={
            "pipeline_name": summary["pipeline_name"],
            "job_run_id": summary["job_run_id"],
            "failure_category": summary["failure_category"],
            "safe_to_retry": summary["safe_to_retry"],
            "recommended_action": summary["recommended_action"],
            "confidence": summary["confidence"],
            "created_at": current_timestamp()
        }
    )


The agent writes an explanation. Other systems decide what to do with it.

What the Agent Should Never Decide

This boundary is the most important design choice in the whole pattern, and it's worth being explicit about.

An observability agent helps engineers understand a failure. It does not control production data systems. Even at high confidence, certain actions stay out of scope:

  • Changing table schemas
  • Granting IAM or Lake Formation permissions
  • Deleting data
  • Marking a partially written batch as successful
  • Overriding data quality failures
  • Promoting quarantined data
  • Rewriting production tables
  • Triggering cross-pipeline backfills
  • Compacting or expiring table snapshots without approval

These actions move from observability into production control, and that line should stay clear. In regulated or business-critical environments, the safest design lets the agent produce structured evidence and recommendations while deterministic rules, approval workflows, or engineers decide whether anything actually executes.

An agent saying "this looks like schema drift, the batch is not safe to retry" is useful. The same agent updating the curated table schema on its own is not. It's a future incident waiting to happen. Same with permissions: the agent flagging an IAM issue is useful; the agent granting itself access is a security violation.

The trade-off here is real. Letting the agent take action would reduce the mean time to recovery. But the cost of a confident wrong action (silently corrupted data, an unauthorized permission grant, a dropped partition) is much higher than the cost of a few extra minutes of human review. In a regulated data environment, that trade-off is usually easy to justify.

This matters as teams move toward self-healing pipelines. Before a pipeline can safely fix itself, it has to first explain itself reliably, at scale, with measurable accuracy. That bar isn't met yet in most production environments.

Evaluating the Triage Layer

A triage layer should be evaluated like any other production component. "The summary looks good" is not an evaluation.

To check whether the pattern behaves reasonably, a small synthetic evaluation can be assembled across common Glue failure modes. Each scenario includes a short set of log lines, control-table state, data quality results, and table metadata, and the agent is scored on two things: whether it picks the correct failure category, and whether the safe_to_retry decision is appropriate.

This is a starter evaluation, not a benchmark. Ten synthetic scenarios are enough to sanity-check the design. A real production rollout needs hundreds of labeled historical incidents, edge cases, and human-reviewed outcomes. Anything less should be treated as an early prototype, not production validation.

Scenario Expected category Agent category Safe-to-retry decision
Missing source path SOURCE_DELAY SOURCE_DELAY Correct
New column in input SCHEMA_DRIFT SCHEMA_DRIFT Correct
Access denied on catalog table PERMISSION_ERROR PERMISSION_ERROR Correct
Shuffle spill and one long task DATA_SKEW DATA_SKEW Correct
Failure after staging write PARTIAL_WRITE_RISK PARTIAL_WRITE_RISK Correct
Too many small files SMALL_FILE_PRESSURE SMALL_FILE_PRESSURE Correct
Recent code deployment plus null pointer CODE_REGRESSION CODE_REGRESSION Correct
Low record count, no hard error SOURCE_DELAY UNKNOWN Conservative escalation
Cast failure due to bad input value SCHEMA_DRIFT SCHEMA_DRIFT Wrong, recommended retry
Conflicting log signals UNKNOWN UNKNOWN Correct escalation


 In a small evaluation like this one, a well-designed classifier should pick the expected category in most scenarios and, more importantly, get the safe-to-retry decision right in nearly all of them. The illustrative results above show eight correct retry decisions, one conservative escalation (the agent returns UNKNOWN rather than guessing), and one wrong call.

That wrong call is the most instructive. On the cast failure, the agent classifies the issue correctly as schema drift but recommends cleanup-and-retry instead of quarantine-and-contract-review. A wrong root cause is inconvenient. A wrong retry recommendation can corrupt data. Safe-retry precision should be weighted higher than classification accuracy when evaluating this kind of system, and that weighting should be reflected in the prompt rules and in the validation rubric.

The metrics worth tracking in production:

Metric Why it matters
Classification accuracy Whether the agent identifies the right failure type
Safe-retry precision Whether retry recommendations are actually safe
False confidence rate Confident-but-wrong recommendations
Mean triage time Reduction in manual debugging time
Human override rate How often engineers reject the recommendation
Cost per incident LLM and log-processing cost per failed run


False confidence rate deserves attention. A low-confidence wrong answer is manageable because engineers know to scrutinize it. A high-confidence wrong answer is dangerous because teams stop scrutinizing. Confidence belongs in the output, but it should never be treated as truth. It's one signal among several in the routing decision. 

Closing

Glue job failures aren't hard because the logs are long. They're hard because the evidence is scattered across logs, run metadata, data quality results, control tables, and table commits, and an engineer has to assemble it before deciding what to do next.

An agentic observability layer turns that scattered evidence into a structured incident summary. The safest version of this pattern is controlled triage, not autonomous repair: observe, classify, explain, recommend, and stop there. Deterministic rules, approval workflows, and engineers decide what happens next.

Before pipelines can fix themselves, they need to explain themselves. That's the work worth doing first.

Observability Data (computing) GLUE (uncertainty assessment) large language model

Opinions expressed by DZone contributors are their own.

Related

  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Microsoft Fabric AI Functions: A Practical Overview for Data Engineers
  • DataFlow — An Open-Source Data Preparation System Accelerating LLM Training
  • Automatic Data Correlation: Why Modern Observability Tools Fail and Cost Engineers Time

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook