Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN

Learn how a RAG-powered bug triage agent uses AWS Bedrock, OpenSearch, and dynamic scoring to automate crash analysis and routing.

Rajasekhar sunkara

Jun. 09, 26 · Analysis

Likes (0)

Comment

Save

1.9K Views

Bug triage on a graphics engineering team is one of those tasks nobody really wants to own. A new crash report comes in, and somebody has to work out whether it looks like a known issue, what the stack trace points at, which subsystem the affected code lives in, and which sub-team should pick it up. The answers exist in the issue tracker, the source repo, and the architecture docs, but pulling them together by hand takes time. And the engineers best at it are the ones you least want spending hours on it.

On our team, the archive of resolved bugs had grown to over 1,100 issues. That is a real corpus. It contains the answer to a lot of incoming questions, but only if you can find the right three or four entries quickly. The agent described here does that lookup automatically, combines it with crash log parsing and source code search, and produces a root cause analysis with a confidence score. Triage that used to take hours now takes minutes.

This article is about the architecture choices: why AWS Bedrock with Claude, why OpenSearch with HNSW indexing, why DynamoDB for workflow state, and why ECS Fargate. None of these choices is unique. The reasoning behind them is what's portable.

What the Agent Actually Has to Do

Before the architecture, it's worth being concrete about the work. When a bug report arrives, the agent produces an analysis built on five signals:

Historical pattern match against the knowledge base of resolved issues.
Source code match against the repositories the trace points into.
Crash stack analysis on the trace itself.
Log evidence from whatever logs were attached or linkable.
Fix ownership, derived from who has historically fixed bugs in the affected components.

Each signal contributes to a final confidence score. The combination matters because no single signal is reliable on its own. A stack trace can match a bug that was fixed three releases ago, a source-code hit can be unrelated, and ownership data can be stale. A useful triage answer leans on multiple signals together.

That is the work. The architecture exists to support it reliably, repeatedly, and without baking in assumptions that will hurt later.

Why RAG, and Why These Pieces

The obvious wrong move is to skip retrieval and pass the whole corpus to the model. Context windows aren't the bottleneck people think they are. Even when they're large, signal-to-noise gets bad fast, and cost and latency scale with input size. For any given bug, the relevant slice is small: a few prior tickets, a couple of source files, maybe one architecture doc. Retrieval-augmented generation (RAG) is the right shape because the retrieval layer's job is precisely to find that slice.

OpenSearch With HNSW Indexing

The knowledge base lives in OpenSearch with vector search over a k-NN HNSW index. HNSW (Hierarchical Navigable Small World) suits corpora in the low thousands to low millions of documents. Query time stays low, and recall stays high without the tuning effort IVF-based indexes demand at smaller scales.

OpenSearch was chosen over a dedicated vector database for operational reasons. It runs in the same AWS environment as the rest of the stack, supports keyword and vector search in the same index when you need hybrid retrieval, and doesn't add a new vendor to the diagram. For a team-internal tool, the integration cost of a separate vector DB outweighs the marginal performance gain.

Titan Embeddings

Embeddings are generated with Amazon Titan. The main reason: the data (bug reports, stack traces, code snippets) never has to leave AWS. That removes a class of compliance questions that come up the moment you start sending source code or internal tickets to an external embedding API. Titan handles technical text well enough for this corpus, and it shares IAM, quotas, and billing with everything else.

Claude on Bedrock as the Reasoning Model

The reasoning step takes the retrieved context and the parsed crash log and produces the actual analysis. It runs on Claude through Bedrock. Two properties matter here. First, Claude handles long, messy, structured input well: stack traces aren't clean prose, and the surrounding context is a mix of code, logs, and ticket descriptions. Second, it expresses uncertainty rather than picking a confident-sounding wrong answer. For a system whose output a human engineer is going to read and either trust or push back on, that calibration matters more than fluency.

The Five-Signal Confidence Score

The most consequential part of the system isn't the model call. It's the scoring layer that wraps it. The agent doesn't just say "this looks like a duplicate of bug X." It produces a confidence score, and that score is what triagers use to decide whether to accept the suggestion or dig in themselves.

The score is a weighted combination of the five signals listed earlier. Each contributes a sub-score; the weights reflect how predictive each signal has been, in this team's experience, of a correct triage outcome. The interesting design choice is that the weights are not static.

Real bug reports don't always include all five signals. Some arrive without attached logs. Some point at code with no clear ownership history. With static weights, missing signals would drag the final score down even when the available signals were strongly aligned. The agent redistributes the weight of any unavailable signal across the available ones, normalized to sum to one. The conceptual shape:

    Python
   
 

   # Conceptual sketch of dynamic weight adjustment
BASE_WEIGHTS = {
     "historical_match": w1,
     "source_code_match": w2,
     "crash_stack": w3,
     "log_evidence": w4,
     "fix_ownership": w5,
}

def adjusted_weights(available_signals):
     active = {k: v for k, v in BASE_WEIGHTS.items()
              if k in available_signals}
     total = sum(active.values())
     return {k: v / total for k, v in active.items()}
  

This is a small piece of code that does a disproportionate amount of the work of making the agent's output trustworthy. A given confidence score should mean roughly the same thing whether the bug arrived with logs or without.

DynamoDB for Workflow State

A triage run is not a single API call. The agent parses the report, retrieves embeddings, runs vector search, fetches matched documents, pulls source code context, calls the reasoning model, computes the score, and writes results back. Each step can fail or be slow independently.

Workflow state for each in-flight triage lives in DynamoDB. The schema is intentionally simple: a triage ID as the partition key, a status field, and the accumulated context. Two reasons it's external rather than in-process memory.

First, recovery. If the model call fails or times out, the workflow should resume without redoing the embedding and retrieval work. Token costs add up otherwise.

Second, observability. The Flask dashboard the team uses to monitor triage operations reads from this same DynamoDB table. That includes real-time status, filterable history, analytics, and the routing view for issues that don't belong to this team. There is no separate event log to maintain. Workflow state is the source of truth, and the dashboard is a view onto it.

ECS Fargate for Orchestration

The triage workflow runs on ECS Fargate. The choice is shaped by what the workflow looks like: a sequence of calls to external services (Bedrock, OpenSearch, the issue tracker), with the long pole being model latency. Not CPU-heavy, not bursty. Incoming bugs arrive at a steady rate.

Fargate handles this shape cleanly. No cold start, no execution time limit, and the operational model is straightforward: container in, container out, IAM and networking inherited from the cluster. The Flask dashboard runs in the same Fargate cluster, sharing the same VPC and observability tooling.

The general pattern: short, stateless, bursty work fits Lambda. Orchestrated workflows with slower external calls and a need for predictable behavior fit Fargate. For a team-internal agent that runs continuously, Fargate's properties matter more than its slightly higher baseline cost.

Keeping the Knowledge Base Current

None of this works if the corpus goes stale. The ingestion pipeline syncs three sources continuously: the issue tracker, where newly resolved bugs become new entries; the documentation repo; and the source code repositories, which provide both file content and ownership signal.

The pipeline is fully automated. New content is chunked, embedded with Titan, and indexed in OpenSearch without manual intervention. Ingestion is decoupled from query. They share the index but nothing else, so a slow ingestion run never affects live triage latency, and a problematic batch can be rolled back without touching the query path.

What's Worth Taking From This

The model layer (Bedrock, Claude, Titan) is interchangeable. Swap them for OpenAI plus their embeddings, or for a self-hosted setup, and the architecture still works. What is not interchangeable, or not easily, is the shape of the rest:

Retrieval before reasoning. Don't ask the model to do retrieval against a large corpus. Get the relevant slice with a dedicated retrieval layer, then hand it over with a tight prompt.
Multiple signals with dynamic weights. Single-signal confidence scores break under real-world data. Multiple signals with weight redistribution handle the cases where inputs are incomplete.
Persist workflow state externally. Even for short workflows, having state in a queryable store pays off in failure recovery and gives the dashboard a single source of truth.
Decouple ingestion from query. They have different reliability requirements and should be able to fail independently.
Match compute to workload shape. Fargate for orchestrated, latency-tolerant workflows. The wrong choice here shows up later as cold starts, timeouts, or surprise bills.

The agent has been doing useful work since it shipped. The thing that took the longest to get right wasn't any single component. It was the scoring layer and the decision to make state external. Those are the parts that determine whether a system like this is something the team relies on or something the team works around.

AWS RAG

Opinions expressed by DZone contributors are their own.

Related

Trending