DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Culture and Methodologies

In our Culture and Methodologies category, dive into Agile, career development, team management, and methodologies such as Waterfall, Lean, and Kanban. Whether you're looking for tips on how to integrate Scrum theory into your team's Agile practices or you need help prepping for your next interview, our resources can help set you up for success.

Functions of Culture and Methodologies

Agile

Agile

The Agile methodology is a project management approach that breaks larger projects into several phases. It is a process of planning, executing, and evaluating with stakeholders. Our resources provide information on processes and tools, documentation, customer collaboration, and adjustments to make when planning meetings.

Career Development

Career Development

There are several paths to starting a career in software development, including the more non-traditional routes that are now more accessible than ever. Whether you're interested in front-end, back-end, or full-stack development, we offer more than 10,000 resources that can help you grow your current career or *develop* a new one.

Methodologies

Methodologies

Agile, Waterfall, and Lean are just a few of the project-centric methodologies for software development that you'll find in this Zone. Whether your team is focused on goals like achieving greater speed, having well-defined project scopes, or using fewer resources, the approach you adopt will offer clear guidelines to help structure your team's work. In this Zone, you'll find resources on user stories, implementation examples, and more to help you decide which methodology is the best fit and apply it in your development practices.

Team Management

Team Management

Development team management involves a combination of technical leadership, project management, and the ability to grow and nurture a team. These skills have never been more important, especially with the rise of remote work both across industries and around the world. The ability to delegate decision-making is key to team engagement. Review our inventory of tutorials, interviews, and first-hand accounts of improving the team dynamic.

Latest Premium Content
Trend Report
Platform Engineering and DevOps
Platform Engineering and DevOps
Trend Report
Developer Experience
Developer Experience
Refcard #399
Platform Engineering Essentials
Platform Engineering Essentials
Refcard #008
Design Patterns
Design Patterns

DZone's Featured Culture and Methodologies Resources

Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial

Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial

By Intiaz Shaik
Analytics pipelines tend to scale in both cost and the age of their data sources: costs increase with data volume growth, while data freshness decreases due to longer batch jobs. The common approach, scaling out the cluster, addresses the symptom rather than the architectural issue. In this tutorial, we will look at an alternative solution that addresses both problems at their root: using Netflix Maestro, a horizontally scalable workflow orchestrator open-sourced by Netflix in July 2024, along with Apache Iceberg, a standard table format for analytics on object storage. The former helps by shifting from time-based scheduling to event-driven, whereas the latter removes the overhead of listing files that slows down queries on large datasets and increases their costs. We will cover all aspects of creating a full-fledged pipeline, including code examples, explanations of why each component reduces costs, and real metrics showing what results to expect. What You'll Need ComponentPurposeNotesApache Iceberg + a catalogTable format and metadata managementREST catalog (Polaris, Nessie, Lakekeeper, Unity Catalog) recommended for new deployments; Glue/Hive also fineA compute engineReads and writes Iceberg tablesSpark 3.5+, Flink, Trino, or DuckDB via PyIcebergNetflix MaestroWorkflow orchestrationRequires Java 21, Docker, and Postgres or CockroachDB for stateCloud object storageData files and metadataS3, GCS, ADLS, or S3-compatible (MinIO works for local dev)Python 3.10+Lightweight tasks and ingestionPyIceberg 0.11+, PyArrow Terminology note: there are several products named "Maestro" in the data space. This guide is about Netflix's Maestro and is different from Maestro by Conductor, AWS Maestro, etc. Netflix's Maestro executes hundreds of thousands of workflows and up to 2 million jobs per day inside Netflix, so the scalability claim is valid — although some practitioners consider Maestro overengineered for small teams, so keep that in mind. The Problem Statement The standard stack on Hive tables stored in S3 has three structural inefficiencies: File listing dominates query planning. Listing operations on S3 are slow and rate-limited. For a query on a partitioned Hive table, listing might take more time than reading data itself.Small-file proliferation. Continuous or micro-batch writing produces thousands of Parquet files. Each query suffers from open-file overhead, and each list operation brings in additional results.Time-based scheduling wastes compute. Jobs are triggered based on a fixed schedule, not data availability. If upstream data is late, the job processes stale inputs. If the data is early, the job idles until the next scheduled run. Iceberg solves (1) and (2) in the storage tier. Maestro solves (3) in the orchestration tier. Let's see how. Why Iceberg Shifts the Cost Model Iceberg takes the table metadata out of the filesystem and puts it into a metadata tree. In response to the query "what files are part of this table?", the engine looks up a single metadata entry, follows the path to the manifest list, and gets back an exact list of data files, along with file-level statistics such as min/max values, null count, and row count. File discovery turns from an O(n) directory listing to O(1) metadata lookup. As a result, we get a chain reaction: Hidden partitioning. Declare a table PARTITIONED BY days(event_time), and queries filter on event_time directly. Partition transform happens automatically. No more WHERE year=2026 AND month=05 AND day=18, and no risk of analysts forgetting.Partition evolution. You can change the partitioning of the table from monthly to daily without rewriting old data. The metadata keeps track of it, and the engine routes queries correctly.Time travel and rollback. Writes produce immutable snapshots. If a bad load happens, you don't need to restore from backups – just roll the catalog pointer back to the previous snapshot. It matters operationally – recovery time goes from hours to seconds.Snapshot isolation and ACID. Writers operate concurrently; readers always see the consistent state, never a partial commit. The cost angle: manifest statistics can prune scans by an order of magnitude in time-filtered queries. With S3 list operations removed entirely, query costs on warehouse engines like Trino, Athena, or BigQuery (which charge per byte scanned) go down proportionally. Why Maestro Helps With Freshness and Costs The killer feature of Maestro in the context of our use case is the signal service — an event-driven trigger mechanism. Instead of scheduling "run this job at 02:00 every day", you tell Maestro to execute the job "when user_events_raw table receives a new snapshot". The trigger may originate from another Maestro workflow, an S3 event, a database table modification, or even from any external system capable of sending a request to the signal API endpoint. The gap between data arrival and data availability closes from hours (the worst-case batch window) to seconds or minutes. Other notable features of Maestro: Support for both DAGs and cyclic workflows. Unlike Airflow, Maestro allows loops and re-execution, which is useful for retry-with-backoff and convergence scenarios.ForEach loops and subworkflows as native concepts. Reduces the YAML sprawl common in large Airflow setups.At-least-once triggering with built-in deduplication leads to effective exactly-once execution.Mixed task types. A single workflow can combine Python, Spark, SQL (Trino/Presto), bash, notebook, Docker container, and Kubernetes jobs.100x performance improvement of the engine announced in September 2025 brings a step transition time from seconds to milliseconds, which is important for workflows with hundreds of steps. Step 1: Create the Iceberg Table With Sensible Defaults Begin with a definition of the table such that partitioning is done correctly from the start. By far the most frequent problem when adopting Iceberg is to overlook partitioning. SQL CREATE TABLE analytics.user_events ( user_id BIGINT, event_type STRING, event_time TIMESTAMP, session_id STRING, properties MAP<STRING, STRING> ) USING iceberg PARTITIONED BY (days(event_time), bucket(16, user_id)) TBLPROPERTIES ( 'format-version' = '2', 'write.target-file-size-bytes' = '134217728', -- 128 MB target 'write.parquet.compression-codec' = 'zstd', 'write.metadata.delete-after-commit.enabled' = 'true', 'write.metadata.previous-versions-max' = '20', 'history.expire.max-snapshot-age-ms' = '604800000', -- 7 days 'history.expire.min-snapshots-to-keep' = '10' ) LOCATION 's3://your-bucket/iceberg-tables/user_events'; Some interesting choices that should be explained: days(event_time) is a partitioning transform. Queries filtering by event_time will receive automatic partition pruning.bucket(16, user_id) is a bucket transform that evenly spreads writes among 16 buckets per day partition. It helps with hot spot prevention when one user produces disproportionately high amounts of traffic and provides better parallelism for joining on user_id.format-version = '2' allows for row-level deletions through delete files. V3 is a more recent version that adds many features, including deletion vectors, but make sure your engine supports it first.zstd provides better compression ratio by 10-20% compared to snappy with the same performance when reading.Expiring snapshot properties help avoid metadata explosion, which is one of the most frequent causes of costs silently accumulating in an Iceberg environment. Without this, each write would retain all previous snapshots indefinitely. Step 2: Ingest Data There are two reasonable options for ingesting data from Python into Iceberg: Spark (in case you already have a Spark cluster and need the scale provided by it) and PyIceberg (low overhead, no JVM required). Python from pyspark.sql import SparkSession from pyspark.sql.functions import to_timestamp, col spark = ( SparkSession.builder .appName("IcebergIngestion") .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.my_catalog.type", "rest") .config("spark.sql.catalog.my_catalog.uri", "https://your-rest-catalog/api/v1") .config("spark.sql.catalog.my_catalog.warehouse", "s3://your-bucket/iceberg-tables/") .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") .getOrCreate() ) raw = spark.read.json("s3://your-bucket/raw/events/2026-05-18/") events = ( raw .withColumn("event_time", to_timestamp(col("event_time"))) .select("user_id", "event_type", "event_time", "session_id", "properties") ) # MERGE INTO supports idempotent ingestion — important for replay safety events.createOrReplaceTempView("staging_events") spark.sql(""" MERGE INTO my_catalog.analytics.user_events t USING staging_events s ON t.user_id = s.user_id AND t.event_time = s.event_time AND t.event_type = s.event_type WHEN NOT MATCHED THEN INSERT * """) Two important aspects. First, the REST catalog should be used for any new deployment, as it allows accessing the same table via Spark, Trino, Flink, Snowflake, BigQuery, and PyIceberg without having to deal with catalog configurations drifting per engine. Second, using MERGE INTO instead of INSERT ensures that the ingestion becomes idempotent, especially when the step fails and Maestro tries to retry it. PyIceberg Ingestion (Lightweight Path) For lighter loads or ingestion processes executed as part of an orchestrator step, PyIceberg is quicker to initialize and has no dependency on the JVM. Currently, the library requires tables in PyArrow format, not pandas DataFrames: Python import pyarrow as pa from pyiceberg.catalog import load_catalog catalog = load_catalog( "my_catalog", type="rest", uri="https://your-rest-catalog/api/v1", warehouse="s3://your-bucket/iceberg-tables/", ) table = catalog.load_table("analytics.user_events") new_rows = pa.table({ "user_id": [3, 4], "event_type": ["purchase", "click"], "event_time": pa.array( ["2026-05-18T12:10:00", "2026-05-18T12:15:00"], type=pa.timestamp("us"), ), "session_id": ["sess-001", "sess-002"], "properties": [{"sku": "A123"}, {"page": "/home"}], }) table.append(new_rows) By default, PyIceberg uses "fast append" optimization, which reduces per-commit metadata operations but creates more manifest files than other optimizations. This is good for frequent micro-batch processing as long as you perform regular compaction (see below). Step 3: Define the Maestro workflow Maestro workflows can be defined using either JSON or YAML format. The following example defines a workflow that loads raw events, applies transformation, performs data quality checks, and updates the aggregate. Steps are connected by signals to start processing as soon as their dependencies are available. YAML name: user-events-pipeline description: Ingest, transform, validate, and aggregate user events trigger: signal: name: raw_events_landed match: bucket: your-raw-bucket prefix: events/ nodes: - name: ingest-events task: type: python script: ingest.py params: partition_date: ${execution_date} retry: max_attempts: 3 backoff_seconds: 60 - name: validate-schema dependencies: [ingest-events] task: type: python script: validate.py - name: transform-events dependencies: [validate-schema] task: type: spark class: com.yourorg.transforms.SessionizeEvents params: input_table: analytics.user_events output_table: analytics.user_sessions partition_date: ${execution_date} - name: dq-checks dependencies: [transform-events] task: type: trino query_file: dq_checks.sql fail_on: any_row_returned - name: refresh-daily-aggregate dependencies: [dq-checks] task: type: trino query: | INSERT INTO analytics.daily_user_metrics SELECT CAST(event_time AS DATE) AS event_date, event_type, COUNT(*) AS event_count, APPROX_DISTINCT(user_id) AS unique_users FROM analytics.user_events WHERE event_time >= DATE '${execution_date}' AND event_time < DATE '${execution_date}' + INTERVAL '1' DAY GROUP BY 1, 2 - name: emit-completion-signal dependencies: [refresh-daily-aggregate] task: type: signal emit: name: daily_metrics_ready params: date: ${execution_date} The last step, emitting a completion signal, makes pipelines composable. The downstream pipeline, such as the feature engineering task for ML, subscribes to the daily_metrics_ready topic and kicks off right away upon completion of this one without polling or any delay period.Ingestion Script Python # ingest.py import os import pyarrow as pa import pyarrow.parquet as pq from pyiceberg.catalog import load_catalog PARTITION_DATE = os.environ["partition_date"] catalog = load_catalog("my_catalog") table = catalog.load_table("analytics.user_events") raw_path = f"s3://your-raw-bucket/events/{PARTITION_DATE}/" arrow_table = pq.read_table(raw_path) # Schema enforcement before write — fail loudly on drift expected = table.schema().as_arrow() arrow_table = arrow_table.select(expected.names).cast(expected) table.append(arrow_table) print(f"Appended {arrow_table.num_rows} rows for {PARTITION_DATE}") The cast is intentional. Schema drift — upstream system silently adds or modifies a column – is one of the most frequent pipeline failures. Early detection through an error at ingestion is far less expensive than debugging further down the line. Step 4: Make Queries Cheap There are three main optimizations that account for the majority of savings. Each one is worth comprehending rather than blindly copying. Compaction: The Single Most Important Maintenance Activity Real-time or micro-batch ingestions result in lots of small files. The smaller files lead to larger metadata, inefficient query planning, and unnecessary storage of Parquet footers and row-group overheads. Compaction periodically merges them into files of the desired size (128 MB for our table definition above). With Spark: SQL -- Rewrite small files using bin-packing CALL my_catalog.system.rewrite_data_files( table => 'analytics.user_events', options => map( 'min-input-files', '5', 'target-file-size-bytes', '134217728' ) ); -- Rewrite manifests so a query reads fewer manifest files CALL my_catalog.system.rewrite_manifests('analytics.user_events'); -- Expire old snapshots beyond the retention configured in TBLPROPERTIES CALL my_catalog.system.expire_snapshots( table => 'analytics.user_events', older_than => TIMESTAMP '2026-05-11 00:00:00', retain_last => 10 ); -- Remove orphan files (files in storage not referenced by any snapshot) CALL my_catalog.system.remove_orphan_files(table => 'analytics.user_events'); Schedule as part of a Maestro workflow that runs either daily or weekly. The remove_orphan_files command is particularly crucial — without this, any failures in writing will result in untracked files in S3, which you continue to pay for storing. Sorting Within Partitions for Skipping Efficiency If you know that your analysts always filter by event_type and user_id, sort your files so that Iceberg’s file-by-file statistics can skip entire files: SQL CALL my_catalog.system.rewrite_data_files( table => 'analytics.user_events', strategy => 'sort', sort_order => 'event_type ASC, user_id ASC' ); For higher-dimensional access patterns, use Z-order: SQL CALL my_catalog.system.rewrite_data_files( table => 'analytics.user_events', strategy => 'sort', sort_order => 'zorder(event_type, user_id, session_id)' ); Let Hidden Partitioning Do Its Job The query below requires no partition predicate — Iceberg derives the partition filter from event_time: SQL SELECT user_id, COUNT(*) AS event_count FROM analytics.user_events WHERE event_time >= TIMESTAMP '2026-05-17 00:00:00' AND event_time < TIMESTAMP '2026-05-18 00:00:00' AND event_type = 'purchase' GROUP BY user_id; In Hive, we would have to do AND year=2026 AND month=5 AND day=17 to enable pruning. In Iceberg, the transformation days(event_time) happen automatically, and the extra predicate event_type enables more pruning based on min/max statistics at the file level; files that don’t cover 'purchase' in their event_type range will not be opened. Step 5: Execute the Pipeline Execute the pipeline from the Maestro command-line interface: Shell # Trigger a manual run with parameters maestro start user-events-pipeline \ --param partition_date=2026-05-18 # Check workflow status and last N runs maestro status user-events-pipeline --last 10 # Inspect a specific run maestro instance describe user-events-pipeline <run_id> # Replay a failed run from a specific step maestro instance restart user-events-pipeline <run_id> \ --from-step transform-events Maestro exports metrics on queue depth, step latency, and failure rates via /metrics. Use this together with engine metrics (Spark UI, Trino query stats) to correlate any delays in orchestration with query performance. What Kind of Savings Should You Really Be Expecting? There is the old story about 90 percent savings when making such migrations that needs to be taken with a grain of salt. The real truth is highly dependent on your source. ScenarioRealistic savingsSource of savingsHive tables on S3 → Iceberg, same engine20–50% on query costsEliminated S3 listing, file pruning via stats, fewer small filesCron-scheduled batch → Maestro signalsVariable on compute, large on freshnessCompute drops only if jobs were over-running their window; freshness improves from hours to minutesProprietary warehouse → Iceberg + open engines40–80% on storage and licenseStorage decoupled from compute; engine competition on the same dataStreaming with no compaction → Iceberg + scheduled maintenance30–60% on query costsCompaction collapses small-file overhead The 90% number is realistic if the starting point is truly pathological, say a highly partitioned Hive table on S3 with no file size management that is being queried by a byte-scanned engine. Most organizations should budget for 30%-60% improvements and view anything higher as upside. Freshness improvements, by contrast, are reliably dramatic. Upgrading from a 4-hour cron job to an event-driven pipeline that fires within seconds of completion of its upstream is a structural win, not an incremental one. Comparing Maestro to Other Options Maestro is not the only option. The lay of the land as of 2026: Airflow has the broadest deployment and the most extensive provider ecosystem. Strengths: DAG construction; weaknesses: high-frequency triggering. Airflow's scheduler is traditionally been the bottleneck when operating at very high workflow volumes.Dagster has better data-aware abstractions (assets, partitions, software-defined assets) and integrates well with dbt and modern data tooling. The scale ceiling is lower than Maestro's.Prefect is native-Python and developer-friendly, offering good dynamic workflow capabilities. Still immature for very large scale.Temporal is the best general-purpose orchestrator for application workflows, less specialized for data pipelines.Maestro beats competitors on scale and on the signal/cyclic workflow paradigm. Cost factors: smaller community, steeper operational overhead, fewer out-of-the-box integrations. If you are already using Airflow and have fewer than a few thousand workflows per day, the migration costs to Maestro probably don't justify themselves through orchestration improvements alone — Iceberg adoption can be decoupled. However, if you are hitting Airflow scheduler limitations or have highly interdependent workflows across teams, Maestro's signal paradigm deserves a serious look. Common Mistakes Some recurring pitfalls in production: Deferment of catalog selection. Setting up Iceberg with a Hadoop or filesystem catalog "as a temporary solution" creates a future migration burden. Choose a REST catalog (Polaris, Nessie, Lakekeeper, or vendor-managed) from the start.No snapshot expiration policy. Snapshots persist indefinitely by default. High-volume tables generate gigabytes of metadata each month. Set expiration policies in table properties and run expire_snapshots periodically.No orphan file removal. Failing writes leave behind Parquet files not referenced by any snapshot. Remove orphan files weekly.Over-partitioning. Partitioning by the hour on a low-volume table results in more partitions than rows. Partition by the resolution of your query filters and target file sizes, not finer.Using signals as a free pass on idempotency. Workflow execution triggered by signals can be replayed or backfilled. Make every step idempotent — use MERGE INTO for writes, de-dupe on natural keys, and never make assumptions about "this only runs once."Skipping compaction. Streaming pipelines without compaction gradually degrade query performance until someone notices that the queries are 10x slower than at launch time. Conclusion Iceberg and Maestro solve two aspects of the same problem. Iceberg makes the data layer cheap to query by converting filesystem state into metadata state. Maestro makes the orchestration layer responsive by substituting signals for clocks. Adopting either technology creates tangible value, while adoption of both yields a pipeline that is inherently cheaper to operate and inherently fresher than a cron-based/Hive setup. If your current challenge is query cost or small file issues, start with Iceberg. If you are plagued with data staleness or unreliable scheduling, start with Maestro (or any other modern orchestrator). But eventually aim to adopt both if your goal is a data platform that scales without scaling your cloud bill. Where to learn more: Netflix Maestro: github.com/Netflix/maestroApache Iceberg: iceberg.apache.orgPyIceberg: py.iceberg.apache.orgApache Polaris (Iceberg REST catalog): polaris.apache.org More
Workflows vs AI Agents vs Multi-Agent Systems: A Practical Guide for Developers

Workflows vs AI Agents vs Multi-Agent Systems: A Practical Guide for Developers

By Raju Dandigam
When I first started building AI applications, I kept hearing the same words everywhere: workflows, agents, and multi-agent systems. At first, they all sounded like different labels for the same thing. After all, in every case, you are still calling an LLM, sending some context, and getting something back. That assumption turns out to be one of the easiest ways to design the wrong system. Once you start building real projects, the difference becomes very obvious. Some systems need strict control. Some need flexibility. Some need multiple specialized roles. If you choose the wrong model, you usually pay for it in cost, reliability, debugging pain, or unnecessary complexity. This is the explanation I wish I had when I started. I want to keep it beginner-friendly, but also useful enough that you can apply it in real projects without walking away with the usual “everything is an agent” confusion. Workflow vs Agent vs Multi-Agent System The simplest way to understand the whole topic is this: A workflow is when you decide the steps in advance. An agent is a model that decides what to do next. A multi-agent system is one in which multiple agents, usually with different roles, coordinate to solve a larger problem. That core distinction aligns closely with external references: workflows follow predefined code paths, while agents dynamically direct their own tool usage and execution flow. That sounds simple, but it becomes much clearer with a relatable example. Imagine you are ordering pizza. In a workflow, the restaurant follows a script. They ask for size, toppings, crust, and address in a fixed sequence. It is fast, reliable, and predictable. In an agent-style system, you might say, “I’m hungry, and I want something good for movie night,” and the system figures out whether you usually order vegetarian, whether you want something quick, whether it should ask a follow-up question, and what option best fits your past behavior. In a multi-agent setup, one specialist handles the order, another checks ingredient availability, and another optimizes delivery timing. Each one does a narrower job, but together they solve a broader problem. That is the real difference. The question is not whether all three use AI. The question is who is controlling the process. What a Workflow Really Is A workflow is the most structured option. You define the steps, the order, and often the failure points. The model may still do useful work inside the system, but the system itself is not making open-ended decisions about how to proceed. Think of it like a recipe. Step one happens first. Step two happens second. If something goes wrong, you usually know where it happened. A simple example is a blog post generator that deliberately separates outline generation, introduction writing, body drafting, and final assembly. TypeScript import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); async function generateBlogPost(topic: string) { const outlineResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, messages: [ { role: 'user', content: `Create a blog post outline about: ${topic}` } ] }); const outline = outlineResponse.content[0].text; console.log('Step 1: Outline created'); const introResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, messages: [ { role: 'user', content: `Based on this outline, write an introduction:\n\n${outline}` } ] }); const intro = introResponse.content[0].text; console.log('Step 2: Introduction written'); const bodyResponse = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 2048, messages: [ { role: 'user', content: `Based on this outline, write the body:\n\n${outline}` } ] }); const body = bodyResponse.content[0].text; console.log('Step 3: Body written'); return `${intro}\n\n${body}`; } The reason workflows dominate production is not that teams lack ambition. It is that predefined orchestration is easier to reason about. Predictable systems are easier to test, monitor, certify, and price. That is exactly why guidance around production AI systems keeps steering builders toward workflows first, especially for reliability-critical environments. The referenced material also repeatedly points out that workflows are the better fit when requirements are stable, boundaries are clear, and reliability matters more than open-ended autonomy. That makes workflows a very strong fit for document processing, onboarding, report generation, fixed moderation pipelines, approval chains, and regulated systems. What an Agent Really Is An agent changes one important thing. Instead of hardcoding the order of operations, you give the model a goal, a set of tools, and enough context to decide what should happen next. That is where the flexibility comes from. The model can inspect the task, choose a tool, look at the result, decide whether another tool is needed, and continue until it reaches a stopping point. That pattern is what makes an agent feel more like a smart assistant than a pipeline. The external guides describe this clearly as dynamic decision-making, autonomous tool selection, reasoning, and self-directed task execution. A simple research assistant is a good example for beginners. TypeScript import Anthropic from '@anthropic-ai/sdk'; const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }); const tools = [ { name: 'search_web', description: 'Search the web for information about a topic', input_schema: { type: 'object', properties: { query: { type: 'string' } }, required: ['query'] } }, { name: 'save_notes', description: 'Save research notes to a file', input_schema: { type: 'object', properties: { notes: { type: 'string' } }, required: ['notes'] } } ]; async function searchWeb(query: string): Promise<string> { return `Results for ${query}`; } async function saveNotes(notes: string): Promise<void> { console.log(`Saved notes: ${notes.slice(0, 80)}...`); } async function researchAgent(topic: string) { const messages: any[] = [ { role: 'user', content: `Research ${topic} and save comprehensive notes.` } ]; let done = false; while (!done) { const response = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 4096, tools, messages }); if (response.stop_reason === 'tool_use') { const toolUse = response.content.find( (block: any) => block.type === 'tool_use' ); if (toolUse.name === 'search_web') { const results = await searchWeb(toolUse.input.query); messages.push({ role: 'assistant', content: response.content }); messages.push({ role: 'user', content: [ { type: 'tool_result', tool_use_id: toolUse.id, content: results } ] }); } if (toolUse.name === 'save_notes') { await saveNotes(toolUse.input.notes); done = true; } } else { done = true; } } } What matters here is not the SDK syntax. What matters is that you did not hardcode “search first, summarize second, save last.” The agent decides that. It may search once. It may search five times. It may decide it has enough information early. That is precisely why agents are useful for research, support, exploratory planning, and other tasks where you cannot fully predict the required path ahead of time. The trade-off is that you lose some of the certainty that workflows give you. The number of tool calls can vary. The runtime can vary. The cost can vary. If something behaves strangely, you often need stronger logs and better observability to understand why. Seeing the Difference Side by Side One of the best parts of your attached draft was the side-by-side review analysis example, because it shows the difference without abstract theory. That absolutely deserves to stay. Suppose the task is to analyze a customer review and generate a response. The workflow version might look like this. TypeScript async function analyzeReviewWorkflow(review: string) { const sentiment = await callLLM( `Analyze sentiment of this review as positive, negative, or neutral: ${review}` ); const topics = await callLLM( `Extract the main topics from this review: ${review}` ); const response = await callLLM( `Generate a customer support response for a ${sentiment} review about ${topics}` ); return { sentiment, topics, response }; } This is clean and efficient. It makes the same three calls every time. The cost is predictable. The behavior is stable. It is also rigid. A weird review gets handled through the same path as a normal one. Now compare that with an agent version. TypeScript async function analyzeReviewAgent(review: string) { return await runAgent({ task: `Analyze this review and generate a support response: ${review}`, tools: [ 'check_sentiment', 'extract_topics', 'search_knowledge_base', 'generate_response' ] }); } Now the system can decide whether a highly emotional complaint requires a knowledge base lookup before responding, while a simple positive review may only require sentiment classification and a thank-you response. That flexibility is exactly what makes agents attractive. It is also what makes them less predictable. This is one of the most important beginner lessons in the whole topic. A workflow handles every case with the same planned path. An agent adapts its path to the case. When Workflows Are the Better Choice This is where most of the production reality sits. If you know the exact steps, a workflow is almost always the first thing you should build. If predictability matters, a workflow is usually safer. If cost matters, workflows are easier to manage because you know roughly how many model calls happen per run. For debugging, workflows are easier because every state transition is explicit. That is also why modern workflow-oriented systems emphasize type safety, checkpointing, durable execution, human-approval steps, and clear routing. Those capabilities are not flashy, but they are exactly what real teams need when a system runs in production for weeks or months. A customer onboarding pipeline is a simple example. TypeScript async function onboardCustomer(email: string) { await sendWelcomeEmail(email); await createAccount(email); await setupDefaultPreferences(email); await sendTutorial(email); } A document processing pipeline is another. TypeScript async function processDocument(pdfPath: string) { const text = await extractText(pdfPath); const summary = await summarize(text); const keywords = await extractKeywords(text); await saveToDatabase({ text, summary, keywords }); await notifyUser(); } A content moderation flow is another good fit. TypeScript async function moderatePost(post: string) { const isSpam = await checkSpam(post); const isToxic = await checkToxicity(post); return isSpam || isToxic ? 'reject' : 'approve'; } None of these tasks benefits much from letting the model invent the control flow on the fly. They benefit from clean orchestration. When Agents Are the Better Choice Agents make more sense when the task is open-ended, when the path cannot be fully predefined, or when adaptability matters more than deterministic execution. Customer support is a classic example because every issue arrives in a different way. Research is another reason because you do not know in advance which leads will be useful. Trip planning is another challenge because different users, constraints, budgets, dates, and preferences change the best route through the task. A travel helper captures this nicely. TypeScript async function travelAgent(request: string) { return await runAgent({ task: `Help the user with this travel request: ${request}`, tools: [ 'search_flights', 'search_hotels', 'get_weather', 'suggest_itinerary', 'ask_followup_question' ] }); } The system may begin by asking a clarifying question. It may check the weather before hotels. It may avoid hotel search entirely if the user says they are staying with friends. This is exactly the sort of context-dependent behavior that agents are designed for. The guides also specifically call out use cases like deep research, agentic RAG, customer support, virtual assistants, and coding assistants as agent-friendly territory. What Multi-Agent Systems Add Multi-agent systems take the idea one step further. Instead of having one agent handle everything, you split the work among multiple specialists. This matters when specialization actually improves the result. One agent might research. Another might write. Another might review or validate. The Inkeep article makes an important distinction: true multi-agent systems are not just a sequential workflow with different names for each step. The key idea is autonomous coordination between specialized agents, often through direct communication or delegated responsibilities. A simple content team example makes this concrete. TypeScript async function researchAgent(topic: string) { return callLLM(`Research ${topic}. Return key facts, trends, and context.`); } async function writerAgent(research: string, topic: string) { return callLLM(`Using this research, write an article about ${topic}:\n${research}`); } async function editorAgent(article: string) { return callLLM(`Edit this article for clarity, accuracy, and flow:\n${article}`); } async function contentCreationTeam(topic: string) { const research = await researchAgent(topic); const draft = await writerAgent(research, topic); const final = await editorAgent(draft); return final; } This is still a simple coordinator-led version, but it shows the value of specialization. A more advanced system might allow the editor to request a revision from the writer, or the writer to request more supporting evidence from the researcher. That is where multi-agent systems start to feel like collaborative problem-solving rather than a chain of prompts. The caution here is important. Multi-agent systems are not “the next level” you should jump to just because they sound advanced. They introduce more moving parts, more coordination overhead, more debugging complexity, and higher cost. They are useful when the problem actually needs multiple kinds of expertise, not when you are just trying to make a simple app look more impressive. The Practical Decision Model A good beginner question is not “which one is the smartest?” It is “how much uncertainty does this task have, and who should own the decision-making?” If the task is well-defined and stable, start with a workflow. If the task is open-ended and the system needs to choose how to proceed, consider an agent. If the task genuinely benefits from multiple specialists with separate responsibilities, consider multiple agents. That decision model lines up closely with the source material as well. Use workflows when requirements are clear, control is important, cost matters, and debugging stays simple. Use agents when tasks are exploratory, human-like reasoning is valuable, and adaptability matters more than fixed control flow. Use multi-agent systems when a single reasoning unit is no longer sufficient to capture the problem's diversity. The Beginner Mistakes That Cost Time and Money The first mistake is using agents for simple tasks that should be handled by normal code or a fixed workflow. If you want to add two numbers, do not build an agent. If you want to categorize simple support tickets with a stable schema, start with a workflow. Not every AI problem needs autonomy. TypeScript function addNumbers(a: number, b: number) { return a + b; } The second mistake is forcing a workflow onto a task that clearly needs adaptation. Creative writing, research, and support escalation often branch in ways that are hard to encode cleanly in advance. If you keep adding if-statements and exception paths to rescue a rigid workflow, that is often a sign the task wants agent behavior. The third mistake is building multi-agent systems too early. Three agents for a simple email writer is usually just an expensive ceremony. You should earn that complexity by hitting a real need first. These mistakes sound obvious when written down, but they are very common because the AI space rewards novelty in demos more than maintainability in products. The Cost Conversation Matters More Than People Admit A workflow-based newsletter creator might always make three model calls, one for the intro, one for the main copy, and one for the closing section. That means the cost per run is fairly easy to estimate. TypeScript async function createNewsletter(topics: string[]) { const intro = await generateIntro(topics); const articles = await generateArticles(topics); const outro = await generateOutro(); return { intro, articles, outro }; } An agent-based newsletter creator might decide it needs extra research, then rewrite one section twice, then call another tool to validate tone. Sometimes that flexibility is useful, but it also means cost and latency can move around more than you expect. TypeScript async function newsletterAgent(topics: string[]) { return runAgent({ task: `Create a newsletter about these topics: ${topics.join(', ')}`, tools: ['research_topic', 'draft_section', 'revise_section', 'validate_tone'] }); } That does not automatically make agents bad. It just means the operational model is different. The broader production guidance on workflows versus agents keeps coming back to exactly this point: deterministic systems are easier to budget for, observe, and control. The Hybrid Model Is Usually the Best Answer This is probably the most useful real-world takeaway in the entire topic. You do not have to choose one pattern forever. Many successful systems use workflows to structure the outer system and agents only where flexibility is genuinely needed. The Prompt Engineering Guide explicitly recommends hybrid approaches, such as using workflows for structure and agents for open-ended subtasks. That pattern looks like this. TypeScript async function smartCustomerSupport(message: string) { const category = await categorize(message); if (category === 'simple_faq') { return faqWorkflow(message); } if (category === 'complex_issue') { return supportAgent(message); } return escalateToHuman(message); } This is a very practical architecture. The workflow gives you control, routing, and predictability. The agent only appears where variability is too high for rigid orchestration. That means you keep the system understandable while still benefiting from adaptive behavior. If you are building beginner-to-intermediate AI products, this is one of the best mental models to adopt early. A Cleaner Way to Think About Real Projects A document processor usually wants a workflow because the same stages repeat every time. A support assistant may want an agent because issues differ, and tool selection depends on context. A software delivery assistant might eventually become a multi-agent system if planning, implementation, testing, and review are separate responsibilities that benefit from specialization. Here is a simplified example of that last case. TypeScript async function developFeature(requirement: string) { const specs = await productManagerAgent(requirement); const code = await developerAgent(specs); const testResults = await qaAgent(code); if (!testResults.passed) { return developerAgent(`Fix these issues:\n${testResults.issues}`); } return code; } This kind of setup can make sense, but only if the complexity is real. It should come from the nature of the work, not from the desire to use more agents. Conclusion If you are just starting, build a workflow first. That advice is not anti-agent. It is pro-clarity. Workflows teach you how to decompose tasks, define boundaries, measure outcomes, and understand where AI actually adds value. Once you understand the stable parts of your system, it becomes much easier to identify the unstable parts that may benefit from an agent. Once you understand where one agent becomes overloaded, it becomes much easier to justify multiple specialized agents. That progression is healthier than starting with maximum autonomy and then trying to reverse-engineer stability later. So my practical rule is simple. If the task can be described as a sequence of reliable steps, use a workflow. If the system needs to decide the steps as it goes, use an agent. If the problem truly needs multiple specialized minds working together, then and only then reach for a multi-agent design. The best AI systems are not the ones with the most autonomy. They are the ones that stay understandable when something goes wrong. More
From
From "Vibe Coding" to Production: Setting Up an Evals Loop for Claude Agents
By Nikita Kothari
A Deep Dive into Tracing Agentic Workflows (Part 2)
A Deep Dive into Tracing Agentic Workflows (Part 2)
By VIVEK KATARYA
Orchestrating Zero-Downtime Deployments With Temporal
Orchestrating Zero-Downtime Deployments With Temporal
By Akhil Madineni
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers

AWS has been building agentic infrastructure for some time now — Bedrock, AgentCore, Strands — mostly aimed at engineers who want to build their own agent systems from scratch. Amazon Quick is a different layer of the same bet: a ready-to-use agentic workspace that targets teams directly, without requiring custom orchestration code. This article walks through what Quick is, how its components fit together technically, how the MCP integration model works with real code, and where it sits relative to the rest of AWS's agent stack. What Amazon Quick Is Amazon Quick is an AI assistant for work that connects to your existing tools — Slack, Microsoft Teams, Outlook, CRMs, databases, and local files — and gives a unified layer for querying, automating, and acting across them. It launched in preview at AWS's "What's Next with AWS" event on April 28, 2026. The product is aimed at teams, not just individual users. One person can build a custom agent scoped to a specific dataset or workflow, and the whole team benefits from it. Responses from Quick agents are grounded in your actual business data, not the underlying model's training distribution. Under the hood, Quick is built on Amazon Bedrock AgentCore and uses the Model Context Protocol (MCP) as its standard for connecting to external tools. It runs on AWS IAM and VPC, which means it inherits the same security and compliance posture as the rest of your AWS workloads. Components Quick bundles five distinct capabilities. It helps to understand each one separately before thinking about how they compose. ComponentWhat it doesSpacesCollaborative workspaces where teams pool files, dashboards, and data sources. Agents in a Space are grounded in that Space's data.AgentsCustom, domain-scoped agents built on your team's specific data. One person builds, everyone uses.ResearchMulti-source synthesis across internal data, the public web, and third-party datasets. Produces structured reports.Visualize (Quick Sight)Integrated BI layer. Conversational access to dashboards, charts, and forecasting — no separate BI tool required.Automate (Quick Flows)Workflow automation from simple daily tasks to complex multi-step processes with cross-app action execution. Each component is available through the web app, mobile, and a native desktop app (currently in preview for macOS and Windows) that can read local files and calendar context without requiring browser access. Where Quick Sits in the AWS Agent Stack AWS is building in two directions at once. AgentCore is the infrastructure layer for engineers who want to compose their own agent systems — runtime, memory, gateway, observability — with any model and any framework. Quick is the product layer on top: opinionated, team-facing, and deployable without writing orchestration code. The practical implication: if you're an engineer building internal tools or automation pipelines, you'll likely interact with both layers. AgentCore for the infrastructure wiring; Quick as a surface where non-technical teammates interact with the agents you build. The Integration Architecture The core question for any engineer evaluating Quick is: how does it actually connect to external systems, and what does the request path look like? Quick uses MCP (Model Context Protocol) as its primary integration standard. This is significant because MCP is an open protocol — it means Quick agents are not locked into AWS-specific connectors, and any MCP-compatible server can be registered as a tool source. High-Level Request Flow The sequence below shows the full lifecycle of a single agent-triggered tool call — from the moment Quick receives a prompt through to the response returning from a downstream API. Quick acts as the MCP client. Your MCP server exposes tools via listTools and callTool. Quick discovers them at registration time and makes them available to any agent or automation in the workspace. Authentication flows through OAuth 2.0, with support for Dynamic Client Registration (DCR) so Quick can register itself automatically without manual credential setup. Building an MCP Server for Quick Here is a minimal Python MCP server using the mcp SDK that exposes two tools Quick can invoke — get_ticket and list_open_tickets. This pattern works whether you host the server yourself or run it on AgentCore Runtime. Install Dependencies Python pip install mcp[server] httpx uvicorn Server Implementation Python # server.py from mcp.server import Server from mcp.server.sse import SseServerTransport from mcp.types import Tool, TextContent import httpx import json from starlette.applications import Starlette from starlette.routing import Route app = Server("jira-quick-integration") JIRA_BASE_URL = "https://yourorg.atlassian.net" JIRA_TOKEN = "Bearer <your-token>" # in production, load from AWS Secrets Manager @app.list_tools() async def list_tools() -> list[Tool]: return [ Tool( name="get_ticket", description="Retrieve details for a single Jira ticket by issue key.", inputSchema={ "type": "object", "properties": { "issue_key": { "type": "string", "description": "The Jira issue key, e.g. ENG-1234" } }, "required": ["issue_key"] } ), Tool( name="list_open_tickets", description="List open Jira tickets assigned to a given user.", inputSchema={ "type": "object", "properties": { "assignee": { "type": "string", "description": "The Jira username or email of the assignee" } }, "required": ["assignee"] } ) ] @app.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: headers = {"Authorization": JIRA_TOKEN, "Content-Type": "application/json"} async with httpx.AsyncClient() as client: if name == "get_ticket": key = arguments["issue_key"] resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/issue/{key}", headers=headers ) resp.raise_for_status() data = resp.json() summary = data["fields"]["summary"] status = data["fields"]["status"]["name"] return [TextContent(type="text", text=f"{key}: {summary} [{status}]")] elif name == "list_open_tickets": assignee = arguments["assignee"] jql = f"assignee={assignee} AND status != Done ORDER BY updated DESC" resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/search", headers=headers, params={"jql": jql, "maxResults": 20} ) resp.raise_for_status() issues = resp.json().get("issues", []) results = [ f"{i['key']}: {i['fields']['summary']}" for i in issues ] return [TextContent(type="text", text="\n".join(results) or "No open tickets found.")] raise ValueError(f"Unknown tool: {name}") # Wire up SSE transport for Quick compatibility sse = SseServerTransport("/messages/") async def handle_sse(request): async with sse.connect_sse( request.scope, request.receive, request._send ) as streams: await app.run(streams[0], streams[1], app.create_initialization_options()) starlette_app = Starlette( routes=[Route("/sse", endpoint=handle_sse)] ) if __name__ == "__main__": import uvicorn uvicorn.run(starlette_app, host="0.0.0.0", port=8080) A few design constraints to be aware of when building for Quick: Each MCP tool call has a 300-second hard timeout. Operations that exceed this fail with HTTP 424. Keep individual tool calls narrow and fast.The tool list is treated as static after registration. If you add or remove tools on the server, the Quick admin must re-establish the connection to pick up changes.Quick supports both Server-Sent Events (SSE) and streamable HTTP as transports. Streamable HTTP is preferred for new implementations. Registering the MCP Server in Quick Once your server is running and publicly reachable over HTTPS, registration in Quick takes the following path: Shell Quick Console → Integrations → Add Integration → MCP Fields: Server URL: https://your-mcp-server.example.com/sse Auth type: OAuth 2.0 (or Service, or None) Client ID: <from your identity provider> Authorization URL: https://auth.example.com/oauth/authorize Token URL: https://auth.example.com/oauth/token If your identity provider supports OAuth Dynamic Client Registration, Quick will auto-register and you skip the manual client ID step entirely. Quick sends an initial unauthenticated request to the MCP server; if it receives a 401 with a WWW-Authenticate header containing a resource_metadata URL, it fetches the metadata document and proceeds with DCR automatically. Once registered, Quick calls listTools at startup and exposes every discovered tool to agents and automations in the workspace. The AgentCore Gateway Option For teams that don't want to write and operate an MCP server from scratch, Amazon Bedrock AgentCore Gateway provides a managed alternative. You point Gateway at a Lambda function or an OpenAPI spec, and it handles the MCP wrapping, auth, logging, and semantic tool discovery automatically. If you use it, Quick never calls your internal APIs directly — everything flows through Gateway's auth and routing layer, as shown in the sequence diagram above. The semantic search capability is worth noting specifically. When an agent has access to dozens or hundreds of tools, passing the full tool list on every turn wastes context and causes the model to pick the wrong tool. Gateway's built-in x_amz_bedrock_agentcore_search tool lets Quick find the right tool by semantic similarity rather than scanning the entire registry each turn. Practical Considerations A few things worth keeping in mind before integrating: Tool scope matters. When agents are given too many tools simultaneously, selection accuracy degrades — the model reasons over too many options per turn and picks incorrectly more often. Keeping each agent or MCP server to a focused set of 3–5 tools produces better results than exposing everything through one endpoint. This is a known pattern in multi-agent architectures and applies equally to Quick agents. The 300-second timeout is real. Design each tool call to complete a single, bounded operation. Avoid chaining multiple downstream API calls inside a single tool invocation. If you need a multi-step workflow, model it as separate tools and let the agent orchestrate the sequence. Local context on the desktop app. The desktop app reads local files and calendar events directly, without upload. For engineers who work primarily in terminals and local editors, this is a meaningful integration point — meeting context, local documentation, and recent file changes are all available to the assistant without any configuration. MCP interoperability. Because Quick uses MCP as the standard, the same MCP server you build for Quick can also be consumed by Claude Code, Amazon Q Developer, and other MCP-compatible clients. The integration contract is portable. References Amazon Quick — Product overview and featuresIntegrate external tools with Amazon Quick Agents using MCP (AWS ML Blog, Feb 2026)MCP integration — Amazon Quick User GuideAmazon Bedrock AgentCore — Overview and documentationIntroducing Amazon Bedrock AgentCore Gateway (AWS ML Blog)Top announcements of the What's Next with AWS, 2026 (AWS News Blog, Apr 2026)

By Jubin Abhishek Soni DZone Core CORE
How to Build an Agentic AI SRE Co-Pilot for Incident Response
How to Build an Agentic AI SRE Co-Pilot for Incident Response

Large-scale cloud platforms have reached a level of complexity — spanning multi-region Kubernetes clusters, streaming systems like Kafka, and heterogeneous data stores — that often exceeds human cognitive limits. Failures are no longer isolated events; they are emergent behaviors arising from tightly coupled systems where issues propagate across layers such as networking, orchestration, and data pipelines. Even with modern observability stacks, operators must manually correlate signals across dashboards, making incident response slow, inconsistent, and cognitively taxing. Traditional approaches rely heavily on static runbooks and tribal knowledge. These mechanisms do not scale in modern distributed systems. Agentic AI introduces a fundamentally different paradigm. Rather than merely detecting anomalies (as in traditional AIOps), agentic systems use Large Language Models (LLMs) to reason, plan, and act. These systems can iteratively generate hypotheses, validate them using real data, and execute multi-step remediation workflows. The result is not just faster detection, but a closed-loop system capable of autonomous diagnosis and recovery. This article expands on how to architect a production-grade SRE agent that can safely and effectively automate cloud incident response. The system is organized into three layers: Perception (data ingestion), Cognition (multi-agent reasoning), and Action (guarded execution), all operating over a shared knowledge graph. Establish a Cloud Knowledge Graph At the core of any intelligent SRE agent is context. Raw telemetry alone is insufficient; the system must understand how components relate to each other. This is achieved through a domain-specific cloud knowledge graph. The graph models: Nodes: Services, pods, clusters, regions, gateways, Kafka topics, and databasesEdges: Traffic flows, deployment relationships, data lineage, ownership, and failover pathsAttributes: SLOs, capacity limits, configuration history, and prior incidents This structure transforms observability data into a causal reasoning substrate. Instead of treating metrics independently, the agent can traverse dependencies and infer propagation paths. For example, a spike in API latency can be traced through upstream gateways to downstream services and eventually to a throttled database. This graph is not static — it evolves continuously with infrastructure changes and incident learnings. Over time, it becomes a living system model enriched with historical context, enabling better hypothesis generation and faster root-cause analysis. In practice, maintaining graph freshness is critical. You should integrate it with service registries, deployment pipelines, and configuration management systems to ensure it reflects real-time topology. Build the Perception Layer (Observability Pipeline) The Perception Layer acts as the sensory system of the agent, continuously ingesting telemetry across the stack. This includes: Metrics: CPU, memory, I/O, network utilization, Kafka consumer lagLogs: Structured and semi-structured application and infrastructure logsTraces: End-to-end request paths across microservices However, raw ingestion is only the first step. The real value lies in transforming this data into structured, actionable signals. A stream-processing pipeline should: Normalize data across heterogeneous sourcesDetect anomalies using statistical methods and thresholdsEmit structured events tied to entities in the knowledge graph These events act as triggers for the Cognition Layer. Importantly, they should already be enriched with context (e.g., “Service A in region us-east-1 exceeds latency SLO”), reducing the reasoning burden on downstream agents. A critical design consideration is balancing sensitivity and noise. Excessive alerting leads to “signal overload,” a well-known issue where operators — and agents — struggle to prioritize meaningful events . Techniques such as event deduplication, correlation, and temporal aggregation are essential to ensure high-quality inputs. Architect a Multi-Agent Cognition Layer Instead of using a single massive prompt, build a Cognition Layer utilizing a multi-agent LLM architecture (using GPT-5 or Claude-Opus class models) orchestrated by a control plane (e.g., a serverless orchestration layer). Assign specialized roles to different agents: Detector Agent: Monitors the anomaly events and groups related alerts into candidate incidents based on the knowledge graph's dependency structure.Hypothesis Agent: Proposes potential root causes by analyzing the graph and recent telemetry data.Validator Agent: Acts as the investigator by issuing targeted queries back to the observability tools and cloud APIs to confirm or reject the hypotheses based on hard evidence.Planner Agent: Synthesizes an actionable remediation plan. This plan should be an ordered list of operations, complete with preconditions, postconditions, and explicit rollback triggers.Critic (Governance) Agent: Reviews the remediation plan against organizational safety policies before execution, ensuring constraints are not violated. Implement a Guarded Action Layer The Action Layer is what separates an active agent from a passive AIOps recommendation engine. It executes the Planner Agent's steps via the Kubernetes API (scaling, restarting pods) and Cloud Provider APIs (toggling failovers, adjusting traffic weights). Safety is paramount. You must wrap this layer in a strict governance framework: Enforce hard limits on scaling factors and failover scopes.Implement canary rollouts, applying changes to a single zone before expanding.Build auto-rollback mechanisms that trigger immediately if Service Level Objectives (SLOs) deteriorate after an action.Require explicit human-operator approval for high-risk operations like region-wide failovers. Rollout and Optimization Strategies When deploying your SRE agent, start in a "shadow" or assist mode. Allow the agent to observe incidents, propose hypotheses, and draft plans while human operators retain full control and execute the final decisions. As confidence in the system grows, gradually grant it autonomy for low-risk, routine actions. To manage operational costs and latency: Optimize prompts: Externalize static system descriptions into retrieved documents.Caching: Cache intermediate inferences for reuse across similar recurring incidents.Batching: Batch non-urgent tool calls and defer low-impact infrastructure checks to background tasks. Conclusion Agentic AI represents a shift from reactive monitoring to proactive, autonomous operations. By combining a real-time observability pipeline, a continuously evolving knowledge graph, and a multi-agent reasoning system, you can build an SRE agent capable of end-to-end incident management. Using this framework can significantly reduce Mean Time To Recovery, improve root-cause accuracy, and decrease reliance on human escalation — all while maintaining strict safety guarantees. More importantly, these systems create a virtuous cycle: every incident enriches the knowledge graph, improves agent reasoning, and strengthens operational resilience. As cloud systems continue to grow in complexity, agentic SRE architectures will likely become a foundational component of modern reliability engineering.

By Akshay Pratinav
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End

AI agents have come a long way. They aren’t just answering simple questions, but they’re handling order checks, summarizing support tickets, updating records, routing incidents, approving requests, and even calling internal tools. As these agents slip deeper into real business workflows, just peeking at model logs isn’t enough. Teams need to see everything: what the agent did, why it did it, which systems it poked, and whether the end result actually helped the business. Agent Observability That’s where agent observability comes in. Traditional observability lets teams watch over their apps, APIs, databases, and infrastructure. Agent observability goes a step further. It shines a light on the whole AI workflow: it connects the dots from the user’s request to the agent’s decisions, the tools it touches, the systems it interacts with, and all the way to the final outcome. Let’s see a customer support example. Say a customer messages, “My subscription renewal failed, but I got charged twice.” A human rep checks the account, payment history, billing rules, refund policy, and ticket history before answering. Now, an AI agent might do that job automatically. It’ll spot the billing problem, look up the customer record, call the billing system, check for duplicate payments, and either resolve the issue or escalate it if things get too messy. On the surface, this whole thing just looks like a simple chat. However, under the hood, it’s a full-on workflow. If you want good observability, you need that behind-the-scenes view: Why bother? Because the final response doesn’t tell you the whole story. If the customer comes back unhappy, you need to nail down whether the agent checked the right account, used the right billing tool, hit an error, misread the request, or escalated when it couldn’t help. Don’t just watch the answer: Follow the whole journey When you break down agent interactions, a few basic layers show the full picture. First, track the user request. What did the user ask? Was it urgent, fuzzy, sensitive, or bound to a customer contract? Second, watch the agent’s action. Did it answer straight away, ask a follow-up question, search a knowledge base, use a tool, or hand off to a human? Third, note the context. What sort of information did it use? Did it pull a help article, customer details, invoice, ticket, policy, or product data? Fourth, log tool usage. Did the agent call billing APIs, CRM systems, databases, incident tools, or an approval workflow? Did those calls work, or did they fail? Lastly, look at the result. Did the agent fix the customer’s problem? Was the ticket reopened? Did a human have to clean up after the agent? Without these layers, you’ll know when something was slow or incorrect, but not why. Maybe the context was off, a tool call failed, it lacked permissions, the prompt changed, or something further downstream broke. Use a Single ID to Track Everything One of the easiest fixes is to tag the whole workflow with a tracking ID. Let that ID travel with the request, from the interface through the agent, tools, APIs, and your business systems. Now, if a support ticket gets botched, the team can retrace every step: what the customer asked, what the agent understood, which account it checked, what the billing system said back, and why the agent chose to close or escalate. It’s not just for support. Maybe your SRE team uses an AI agent to help dig into a production alert. The agent scans logs, checks recent deployments, reviews database metrics, and suggests the likely cause. That same tracking ID means you’ll know exactly which systems the agent checked and whether it missed anything crucial. Don’t ignore tool calls; they’re real actions Here’s where things get serious. When an agent calls a tool, it’s taking action. Looking up customers, updating records, approving requests, creating tickets, and kicking off workflows need to be watched closely. For each tool call, capture details like tool name, how long it took, success or failure, retries, permission results, error messages, and what actually happened. Take a finance workflow. Say the agent reviews vendor invoices by extracting details, matching with a purchase order, checking taxes, and routing exceptions to finance. If an invoice gets approved by mistake, did the agent misread the invoice? Match it with the wrong purchase order? Miss a policy update? Or did the finance system return incomplete info? That’s why tracking tool calls is critical. A wrong answer in chat is one thing, but a wrong move in your business system can lead to trouble such as money lost, operations disrupted, and even compliance issues. Understand Agent Decisions, But Protect Privacy Teams need to understand what the agent did, but you don’t want to log every single “thought” it had; it’s just unnecessary noise. Instead, record decision details in a structured way. Example: Intent: billing disputeConfidence: mediumTool: billing lookupReason: account verification neededPolicy result: escalateFinal action: handoff to human Now you have enough to debug the workflow and for reporting, without exposing raw thought streams. You can spot how often agents escalate from low confidence, where tools fail, or if policy rules stop an action. Connect Observability to Business Outcomes Don’t just track the tech stuff; what really matters is whether the agent gets the job done. Watch business metrics like: Resolution timeEscalation rateWorkflow completion rateTool failuresCost per workflowSLA hits or missesReworkHow often humans step in If you’ve got an e-commerce agent helping buyers pick products, check inventory, apply discounts, and guide checkout, you want to know: did the customer actually buy the item? If checkout drops after you tweak a prompt, find out why. Did the agent push out-of-stock items? Apply discounts wrong? Use the wrong tool? Lose customers with confusing answers? Observability at this level helps both engineering and business teams get answers, fast. Build Dashboards for Different Audiences Everyone’s got different needs. SREs care about latency, failed tools, retries, issues with dependencies, and expensive cost spikes. Security teams focus on policy denials, suspicious tool actions, sensitive data flags, or prompt injection attempts. Product owners want completion rates, escalations, customer satisfaction, and abandoned workflows. Engineers need to see how agent behavior shifts after you change the model, prompt, workflow, or deployment. Business folks need throughput, SLAs, cost savings, and improvements to customer experience. Take security operations. Say an agent checks suspicious logins, identity logs, privilege changes, and endpoint activity. Security needs to know: did the agent just review info, or did it try to lock an account? If it got blocked, you want that visible, too. Alert on AI-Specific Failures AI agents fail in new ways. Teams need alerts for things like sudden spikes in tool denials, fallback responses, unexpected tool usage, cost blowups, prompt injection attempts, completion drops, or escalating cases. If an agent suddenly goes wild with refund actions, it could mean a prompt is off, a policy is weak, or something’s getting abused. If fallback responses shoot up, maybe the knowledge base is broken. Costs spike? Maybe the agent is stuck looping, retrying, or making unnecessary expensive calls. Tie alerts to deployments, too. Agents change behavior after you update a prompt, switch models, change schema, adjust policies, or edit a workflow. Teams should compare how the agent behaved before and after. A Simple Way to Grow Observability Observability matures in steps. Basic logs: prompts, responses, errors, timestampsTool visibility: what got used, if it worked, how long it tookEnd-to-end traces: follow the user request through the agent, tools, APIs, systemsBusiness-level result tracking: resolution, escalation, completion, rework, cost, SLAAutomated alerts: regressions after updates, anomalies, unusual patterns Observability is more about making sense of the whole workflow and visibility. Teams need to know what users wanted, what the agent decided, which info it used, which tools it grabbed, which systems it touched, and whether business value was delivered. As AI agents settle into production, observability has to cover more than just servers and app logs. The teams that win will be the ones who trace agent behavior end to end, spot failures early, explain what happened, and keep improving safely.

By Srinivas Chippagiri DZone Core CORE
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It

There is a pattern that repeats itself across engineering organizations regardless of team size, tech stack, or industry. A sprint ends. Features are shipped. The QA team is still writing automation for the previous sprint. The backlog of unautomated scenarios grows. Leadership asks what it would take to close the gap. The answer comes back: more engineers, more time, more tooling budget. Six months later, the gap is the same size. Sometimes larger. This is not a resource problem. It is an architectural problem. And until the architecture changes, the gap does not close. The Upstream Problem Nobody Measures When engineering teams analyze their automation coverage gaps, they almost always focus on execution test runs that are slow, maintenance is high, and flaky tests waste time. These are real problems. But they are downstream of a more fundamental issue that rarely gets measured: the time between a requirement being written and automation existing for it. In a traditional QA workflow, that gap looks like this: Requirement lands in JiraDeveloper builds the featureQA engineer reads the requirement, interprets it, designs test scenariosQA engineer writes test casesQA engineer scripts automation in Playwright or SeleniumQA engineer executes, debugs, maintains Steps 3 through 5 take days. Sometimes weeks. Every sprint adds to the backlog. Every requirement change breaks existing automation. The team runs hard and stays in the same place. The industry has responded to this by automating step 6, making execution faster, smarter, and more parallelized. But steps 3 through 5, requirement interpretation, test design, and scripting, remain almost entirely manual in most organizations. This is the upstream problem. And it is where the real automation opportunity sits in 2026. What Changes When You Start From Requirements The architecture shift that actually closes the coverage gap starts much earlier in the pipeline than most automation teams consider. Instead of "requirement arrives → developer builds → QA manually creates coverage," the new model is "requirement arrives → AI evaluates and enhances → AI generates test cases → AI generates scripts → AI executes → results with traceability returned." The human does not design coverage. The human does not script automation. The human reviews requirements, approves test cases when necessary, and focuses on exploratory testing and quality strategy, the work that actually requires human judgment. This is what requirement-driven autonomous testing means in practice. The requirement is the input. The executed test result is the output. AI owns everything in between. The 5 Stages of a Requirement-to-Result Pipeline Platforms like TestMax implement this model as a connected five-stage pipeline. Understanding each stage explains why the architecture works differently from traditional automation approaches. Stage 1: Requirement Ingestion The pipeline accepts requirements from wherever they live, Jira tickets, Azure DevOps work items, Word documents, PDFs, Excel files, or requirements authored directly in the platform. No reformatting required. The requirement enters the system as it exists. This matters because one of the friction points in traditional QA automation is the translation step, converting a Jira ticket into a format that test tooling can work with. When ingestion is native, that step disappears. Stage 2: Requirement Intelligence Before any test generation begins, every requirement is evaluated by AI across five quality dimensions: clarity, completeness, consistency, testability, and correctness. This stage is the most underestimated in the entire pipeline. Poor requirements produce poor tests always. A requirement that says "the login form should work correctly" is not testable. A requirement that specifies valid credentials, invalid passwords, empty field behavior, account lockout thresholds, and session persistence rules is. When AI catches ambiguity at the requirement stage, it costs nothing to fix. When that same ambiguity surfaces after automation has been built against it, it costs days. The requirement of the intelligence layer moves the defect detection upstream to where it is cheapest. Requirements that fail quality review are flagged with specific improvement suggestions. AI offers rewrites. Nothing ambiguous proceeds to test generation. Stage 3: AI Test Case Generation Once a requirement passes quality review, the platform generates structured test cases automatically. Not surface-level happy path scenarios, complete coverage across positive paths, negative paths, boundary conditions, and edge cases. For a single requirement, like users can reset their password via email verification, the generated coverage includes: Valid email address submitted – verification email receivedInvalid email format – appropriate error returnedEmail address not registered – system response without revealing account existenceVerification link clicked – password reset flow initiatedVerification link expired – appropriate error with re-send optionNew password does not meet policy requirements specific validation messagesSuccessful reset – session handling, redirect behaviour All of this is generated automatically from the requirement. No human designs the coverage strategy. Stage 4: Automation Generation Approved test cases are converted into executable Playwright scripts automatically. Production-ready code with appropriate waits, assertions, and selector strategies generated without a human writing a single line. This is the step that eliminates the scripting bottleneck. In traditional automation, scripting bandwidth is a hard ceiling on coverage growth. When the team can script 50 test cases per sprint, coverage grows at that rate regardless of how many requirements are produced. When scripts are generated automatically from approved test cases, that ceiling disappears. Coverage can grow at the rate requirements are produced, not the rate engineers can write code. Stage 5: Autonomous Execution and Evidence AI agents execute the generated test suite through Playwright MCP. They manage environment setup, handle retries, capture logs, screenshots, and video per test, and return a complete traceability matrix linking every result to its source requirement. The output is not a pass/fail count. It is a complete evidence package suitable for audit, governance, and release decision-making generated automatically from the requirements the team was already writing. Why This Architecture Closes the Coverage Gap The traditional automation model has a linear constraint: coverage grows proportionally to engineering effort. More requirements always mean more backlog because the human work required per requirement is roughly constant. The requirement-driven autonomous model removes the linear constraint. When AI handles test design, scripting, and execution per requirement, the engineering effort per requirement drops dramatically. Coverage can scale with the requirements themselves rather than with team headcount. There are three concrete consequences: Coverage lag is eliminated. When test generation takes minutes rather than days, new features can have automation in the same sprint they are built. The perpetual state of automation backlog, where coverage is always weeks behind the code it is supposed to validate, is a consequence of the manual model, not an inevitability. Maintenance burden shifts. In traditional automation, 60 to 80 percent of automation engineering effort goes to maintaining existing scripts. When AI generates scripts from requirements, the maintenance responsibility belongs to the generation layer. UI changes that would previously break dozens of handwritten selectors are addressed at the generation stage. Requirement quality improves as a side effect. When every requirement must pass an AI quality evaluation before entering the test pipeline, the incentive to write precise, testable requirements increases. Teams that implement requirement-driven testing typically report improvement in requirement quality within two to three sprints, not because they trained their product managers differently, but because the pipeline now provides immediate, specific feedback on every requirement. Integrating With Existing Workflows A practical concern with any architectural change is migration cost. The requirement-driven autonomous model does not require replacing existing infrastructure. Generated Playwright scripts integrate directly into existing CI/CD pipelines. Teams running Jira or Azure DevOps connect those systems natively requirements flow in without manual re-entry. For teams using ATF or other existing test frameworks, the autonomous testing layer runs alongside rather than replacing what already exists. The practical starting point is a single sprint. Take the new requirements entering your backlog this week. Run them through a requirement-driven platform. Compare the test coverage produced in time, in scenario depth, in maintenance overhead against what your team would have produced manually. The experiment answers the adoption question more convincingly than any benchmark. The Architectural Question for 2026 The relevant question for QA teams in 2026 is not whether to use AI in testing. Almost every serious testing platform has added AI capabilities in some form. The question is: where in the pipeline is AI actually doing meaningful work? At one end of the spectrum, AI heals broken selectors and suggests which tests to run. The human still reads requirements, designs coverage, writes scripts, and manages execution. AI makes individual tasks faster. At the other end, AI owns the pipeline from requirement evaluation through execution and evidence delivery. The human provides requirements and reviews results. AI does everything in between. The teams that figure out where they sit on that spectrum and decide consciously which model their coverage goals require are the ones that will stop having the same conversation about automation backlogs next quarter.

By Waqar Hashmi
Identity in Action
Identity in Action

Switching from one single sign-on (SSO) vendor to another is a complex process that involves more than just changing technologies. This is a high-stakes identity operation that impacts security, user experience, following the rules, accessing applications, and keeping things running smoothly. It's not the same as moving a reporting tool or a collaboration platform because SSO is at the front door of every application in your environment. If you set it up wrong, everything will stop working. But the biggest danger of SSO migrations is not that they won't work. The little things that go wrong are the most annoying Users being locked out of apps that are important to the businessAccounts being left alone that were never deprovisionedMFA enrollments disappearing without a word and Helpdesk queues are getting longer on the morning of cutover because there was no communication about the change. This guide discusses the best ways to move to cloud SSO and the most important things to keep in mind. It discusses everything from getting the identity estate ready for the move of integrations to phased rollout strategies, making the user experience as smooth as possible, and planning for MFA migration. Why Businesses Change SSO Providers Companies don't usually change their SSO platforms on a whim. One of the following things usually makes it happen: Acquisition of a vendor or announcement of the end of a product's life. Cost consolidation or figuring out how to use enterprise licenses. Standardizing platforms under a broader cloud strategy. Requirements for compliance or regulation that the current business can't meet. Issues with scalability, performance, or missing features in the current platform.A merger or acquisition that introduces a second identity domain. Whatever the reason, migration causes compounding risk since SSO is foundational infrastructure, not an individual application. 3 Types of Migration Approaches and Their Differences There are three main ways to move to SSO, and each one has its risks and effects on governance. Federated Protocol Swap Retain the same IdP architecture but replace the vendor platform underneath. For example, moving from PingFederate to Entra ID External Identities. The protocol (SAML, OIDC, SCIM) may remain the same, but attribute mappings, claim transformations, and session behaviors differ in ways that are often not clear until something breaks in production. Full IdP Replacement The old IdP is completely removed, and a new one is put in its place. Need to set up, test, and cut over every connection with a service provider (SP) again. This type has the most risk, and it's also the one that most businesses don't consider. Consolidation Migration A single authoritative platform brings together many IdPs. Such an event can happen when companies merge or acquire another. There are technical and organizational problems, such as different business units having different app owners, SLAs, and levels of tolerance for disruption. Governance alignment needs to happen before any technical work can begin. Migration Process: The 7 Steps Audit and clean upPlan and PrepareMFA MigrationCommunication PlanningPhased RolloutGovernance ConsiderationDecommission and close out Step 1: Audit and Clean up Most organizations rush, ignore, and migrate everything, including unused applications, inactive users, orphaned accounts, and integrations that have remained unused for three years. These don't break, but leave a security risk. Following validations reduces testing and inventory. Create a complete, clean list of applications: Validate against the CMDB or application catalog.Validate apps being used.Validate access logs from SIEM.Validate against IGA platforms.Reduce redundant applications. Create a complete, clean list of valid users: Active users.Exclude accounts with no activity for 90 days. Exclude dormant accounts whose passwords were never changed.Validate against IGA platforms and HR systems. Mark the unused applications for the decommissioning process. Note down the protocols used (SAML, OIDC, WS-Federation, or legacy), application owners, attributes and claims, MFA requirements, CA policies, and session time-out configurations. Step 2: Plan and Prepare Every application that relies on SSO consumes identity attributes passed in SSO protocols. New IdPs rarely use the same attributes and often have case-sensitive and format changes. These mismatches cause silent authentication failures and will be extremely difficult to diagnose during cutover. Application Metadata Prepare the claims transformation registry. Confirm the case and formats.Validate transformation rules. Redirect URLs For each application, configure a transparent redirect from the legacy IdP login URL (or intranet homepage) to the new IdP's login endpoint. The user will not experience major changes. The only change a user would notice would be the new MFA prompt. Rollback Process Identify when you should roll back.Who will be able to make the rollback decision? Rollbacks generally occur in the following use cases: The rate of successful authentications drops below 95%.Validate SSO failures for major applications.More calls to the help desk than usual during the first 2 days of migration. Migration go-live Documentation regarding new login flow end-to-endPlan for extended staff during the migration. Validate helpdesk access to the new platform.Identify and set up escalation contacts for issues that the helpdesk cannot resolve. Step 3: MFA Migration Prepare a complete inventory of existing MFA enrollments that includes How many users have MFA enrolled vs. password only? What factors are in use? Authenticator Apps – Need to re-enrollSMS – Same phone number and email can be used. Hardware token – FIDO2/WebAuthn keys can be reused if the new vendor supports itBiometrics – Need to re-enroll.How many and which users have only a single factor enrolled? Follow the steps for re-enrollment: Open the self-service enrollment portal.Phone numbers and emails can be reused (since they remain the same).Send advance communications at least two weeks out, explaining what will change and why.Track re-enrollment completion rates by department and group.Send follow-up emails, including deadlines.Set up a plan to re-enroll privileged accounts. Step 4: Communication Plan Communication is a major step in the migration process and should be tracked as a separate workstream, treated with its timeline, owners, deadline, and success metrics. There are three different audiences involved in SSO migration. End users who simply need to know what will change and what to do.Helpdesk and IT staff who need operational readiness confirmations.Stakeholders who need status updates and risk visibility. Major email templates include: General UpdatesMFA-Enrollment NoticesCut Over Day notification Step 5: Phased Rollout Never perform a cutover for the entire organization. Instead, choose a phased rollout. This reduces risk, helps validate configurations in production with real users and real traffic, and provides time to identify issues before affecting most of the organization. First Phase—Technology users Internal IT staff.Identity administrator.Helpdesk personnel.power users.Second Phase - High-frequency application users like ERP applications CRM applications Collaboration platform BI toolsThird Phase—General user population Lower-risk departmentsExceptions and low-activity users ContractorsUsers who log in very lessThird-party users Step 6: Governance Considerations To ensure successful migration and validations, consider the following governance aspects: Changes to IGA Solutions JML changes Provisioning accounts in IDP with required attributes for SSO claims.Disabling or deletion of accounts during terminations.User transfers: changes to account attributes and group memberships.Changing birthright roles Update with new SSO groups.Cleanup of legacy vendor applications. Audit Log Monitoring Onboard logs from new vendor to SIEMSet up alerts for notifications, including Authentication failuresCA policy failuresPassword failuresToken expiration Non-Human Identities Create a separate inventory of NHA accounts and migrate their credentials to the new system. These include accounts with no owners. Step 7: Decommission and Close Out The process can move forward once all the checks are done and the MFA enrollments are at acceptable levels. Monitor the new system for 30 days and plan for the decommissioning of the old SSO solution. Conclusion SSO is the authentication layer for all the applications in the organization. Performing migration without a proper plan includes risk. Most companies follow one or a combination of the above-described approaches. Adhering to a proper plan with communication and the right strategies will never make you think about rollback strategies.

By Kapil Chakravarthy Sanubala
Getting Started With Agentic Workflows in Java and Quarkus
Getting Started With Agentic Workflows in Java and Quarkus

This post walks through building and running a real-world agentic workflow with Agentican and Quarkus. Specifically, an agentic workflow to automate market research and information sharing: Identify the top vendors within a market category.Research the positioning and strengths of each vendor.Classify the findings as either standard or urgent.Draft a brief to share with others in the company. Prerequisites QuarkusJava 25Maven (or Gradle)LLM provider API key Step 1: Add the dependency Create a Quarkus app, and add the Agentican Quarkus runtime module: XML <dependency> <groupId>ai.agentican</groupId> <artifactId>agentican-quarkus-runtime</artifactId> <version>0.1.0-alpha.3</version> </dependency> Step 2: Define Agents, Skills, and the Workflow Create an `agentican-catalog.yaml` file on the classpath. This is where you describe: Who does the work (agents)What they need to do it (skills)How they will do it (workflows) YAML agents: - id: researcher name: researcher role: | Expert at finding accurate, sourced information about companies and markets. Quotes sources. Distinguishes opinion from fact. - id: writer name: writer role: | Synthesizes research into structured, concise briefs. Avoids hedging language. Cites concrete evidence. skills: - id: web-search name: web-search instructions: | When a question requires external information, call the search tool first. Quote sources in your answer. Update the `agentican-catalog.yaml` file to define the workflow. YAML workflows: - id: market-brief name: market-brief description: Research vendors in a market and produce a structured brief outputStep: deliver params: - name: topic description: Market to research required: true - name: vendor_count description: Number of vendors defaultValue: "5" steps: - name: identify agent: researcher skills: [web-search] instructions: | Identify the top {{param.vendor_count} vendors in {{param.topic}. Return a JSON array of vendor names — names only, no commentary. - name: deep-dive type: loop over: identify steps: - name: analyze agent: researcher skills: [web-search] instructions: | Deep-dive vendor {{item}: positioning, key strengths, recent news. Quote sources. - name: classify agent: writer instructions: | Read the per-vendor deep-dives below. If any vendor has launched a competitive feature in the last 30 days, return the single word 'urgent'. Otherwise return 'standard'. Deep-dives: {{step.deep-dive.output} dependencies: [deep-dive] - name: deliver type: branch from: classify default: standard branches: - name: urgent steps: - name: urgent-brief agent: writer instructions: | Synthesize a vendor brief flagged URGENT for executive review. Lead with the recent competitive moves. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} - name: standard steps: - name: standard-brief agent: writer instructions: | Synthesize a vendor brief. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} A few things worth flagging: agent: researcher references the agent for a step, skills referenced by name, too.outputStep designates the step whose output becomes the workflow's typed result.{{param.X} interpolates workflow inputs into step instructions.{{step.X.output} interpolates an upstream step's output.{{item} is the current value inside a loop iteration.type: loop steps take an over reference (a step that produced a list, or a list-typed param).type: loop steps run their nested steps once per item, in parallel, and on virtual threads.type: branch steps take a from reference (a step whose output is used to select a branch).branches: mutually exclusive steps (or sets of steps) with default for unrecognized values. The framework loads agentican-catalog.yaml from the classpath, or you can define where it's loaded from: Properties files agentican.catalog-config=/etc/agentican/agentican-catalog.yaml Note: Agents, skills, and workflows can be defined via a fluent builder API as well. Step 3: Configure the Models Agentican reads the engine configuration from `application.properties`. The minimum is one LLM: Properties files agentican.llm[0].api-key=${ANTHROPIC_API_KEY} The provider defaults to `anthropic`, and the model defaults to `claude-sonnet-4-5`. Want OpenAI instead? Properties files agentican.llm[0].provider=openai agentican.llm[0].api-key=${OPENAI_API_KEY} agentican.llm[0].model=gpt-4o-mini Want to mix and match? Configure `name`s and reference them per-agent in the YAML catalog: Properties files agentican.llm[0].name=default agentican.llm[0].api-key=${ANTHROPIC_API_KEY} agentican.llm[1].name=efficient agentican.llm[1].provider=openai agentican.llm[1].api-key=${OPENAI_API_KEY} agentican.llm[1].model=gpt-4o-mini Step 4: Create a Typed Workflow Instance Define the workflow input and output records: Java public record ResearchParams(String topic, int vendorCount) {} public record VendorBrief(String topic, List<Vendor> vendors) { public record Vendor(String name, String positioning, List<String> strengths) {} } Then inject the typed workflow, and call it from a REST endpoint: Java @Path("/market-brief") public class VendorBriefResource { @Inject @AgenticanWorkflow(name = "market-brief") Workflow<ResearchParams, VendorBrief> brief; @POST @Path("/{topic}") public VendorBrief generate(@PathParam("topic") String topic) { return brief.start(new ResearchParams(topic, 5)).await(); } } Now, test the endpoint: Shell curl -X POST http://localhost:8080/market-brief/data%20observability%20platforms A few things worth flagging — they're what set this apart from a generic "call an LLM" library: ResearchParams.vendorCount becomes the workflow parameter vendor_count via SNAKE_CASE mapping.start() returns a WorkflowRun<VendorBrief> and await() parses the output step's text into a VendorBrief.@AgenticanWorkflow(name = "vendor-brief") resolves the registered workflow at injection time. Note: WorkflowRun itself exposes future() for a CompletableFuture<R>, and there's a ReactiveWorkflow<P, R> Mutiny variant for Vert.x stacks. Step 5: Add Agent Tools Agentican ships two integrations out of the box: MCP (Model Context Protocol) There is one config block per server. Tools are auto-discovered: Properties files agentican.mcp[0].slug=github agentican.mcp[0].name=GitHub agentican.mcp[0].url=https://mcp.github.com/sse agentican.mcp[0].headers.Authorization=Bearer ${GITHUB_TOKEN} Composio 100+ SaaS toolkits — Slack, Notion, Linear, Salesforce, GitHub, Google Workspace: Properties files agentican.composio.api-key=${COMPOSIO_API_KEY} agentican.composio.user-id=user-123 Tools are referenced by name within agent steps: YAML steps: - name: research agent: researcher tools: [github_search_repositories] instructions: "Profile open-source vendors in {{param.topic}." Structured agentic workflows for the JVM. Where to Go Next Getting Started — install, configure, and run workflowsCore Concepts — architecture, terminology, and data flowWorkflows & Steps — CDI surface, beans, qualifiers, override patterns.Agents — defining agents, skills, and rolesGetting Started (Quarkus) — dependency setup, config, first taskCDI Integration — injection, qualifiers, lifecycle events, bean overridesREST API — endpoints, SSE streaming, WebSocket, error codesObservability — Micrometer metrics, OTel tracing, Prometheus queries

By Shane Johnson
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps

Teams often say they are building one app. A lot of the time, that is not true. I saw this while reviewing a telemedicine MVP. At first, the plan sounded simple enough: video visits, messaging, scheduling, and basic records. Then the version-one list kept growing: Patient appprovider dashboardAdmin panelMessagingVideoBillingEHR connectionDevice support later At that point, this was no longer one app. It was several systems being planned as one MVP. A patient-facing productA provider-facing productAn admin productA set of outside-service connections When a team treats all of that like one first release, things get messy before development even starts. The Moment It Stopped Being One App The problem was not the number of screens. The problem was the number of users, roles, and data rules hiding behind those screens. A patient needed intake, booking, reminders, and follow-up. A provider needed schedules, patient context, notes, and quick actions during the day. An admin needed visibility, support tools, and role controls. The outside-services side added video vendors, messaging vendors, EHR work, and, later, device data. That is not one product. That is a group of different systems with different jobs. Once that became obvious, the planning changed. Split the Product by User First Before estimating anything, it helps to split the product by who it is for. For this telemedicine project, the first useful split looked like this: 1. Patient Side This part handled: IntakeBookingRemindersFollow-up messagingJoining a visit The patient's side had to stay simple. It also had to be clear about what the patient could and could not see. 2. Provider Side This part handled: Schedule viewPatient detailsVisit notesQuick responsesRole-based access This was not just a different set of screens. It had different speed needs, different daily habits, and different data access rules. 3. Admin Side This part handled: Role setupSupport actionsVisibility into operationsReportingNon-clinical controls Admin work often looks small during planning. In real projects, it adds a lot of rules and a lot of testing. 4. Outside-Service Work This part handled: Video vendor setupMessaging vendor setupEHR-related workFuture device dataLogging and audit-related movement of data This is where many teams get surprised. Video, messaging, and EHR are not tiny add-ons. Each one brings its own work. Start With Access Rules Before the Feature List In multi-role products, one of the quickest ways to find hidden work is to define access rules early. Before locking the feature list, ask: Who can create this dataWho can read itWho can change itWho can delete itWho can export it For the telemedicine project, this made a big difference. A few features looked simple in the scope doc. Once the team asked who could view or change the related data, the work got much larger. A basic example: Admins can help fix booking problems. That sounds harmless. But then the real questions start: Can admins see messages?Can they see visit notes?Can they see call history?Can they open uploaded files? That one sentence can change a big part of the system. Access rules often show hidden work much faster than a feature list does. Treat Outside Services as Separate Work Another mistake teams make is treating outside services like small items on a checklist. On paper, it can look like this: VideoMessagingEHR later In practice, each one adds its own work: Vendor setupRequest and response formatsError handlingRetry rulesLoggingReplacement cost if the vendor needs to change later That is why these items should be planned separately. For the telemedicine case, once video, messaging, and EHR work were split out from the main product list, the first release became easier to define. Some items that seemed close to launch were clearly not ready for version one. Ship One Complete Path First Once the team stopped calling everything an MVP, the first release got smaller. The version-one path that stayed in looked like this: Patient intakeAppointment bookingSecure video through the chosen vendorFollow-up messagingBasic provider access controls That was enough to test whether the product solved a real problem for a clinic. What moved out of the first release: Deeper EHR workMore reportingDetailed billing flowsDevice supportBroader admin tooling Those things were not bad ideas. They just did not belong in the first build. 4 Simple Documents to Create Before Sprint Planning When a team starts to suspect that one MVP is several systems, four short documents can help a lot. 1. User-to-System Map List each part of the product and the main user for it. 2. Permission Matrix Write down who can create, view, change, delete, and export each type of data. 3. Outside-Service List Separate core product work from vendor work and data that moves in or out of the system. 4. First-Release Path Write the one end-to-end path that version one has to get right. These are short documents, but they make planning much better. Why This Matters Outside Healthcare, Too This lesson is not only for telemedicine. It applies to any multi-role product where the team is building for more than one type of user. That includes: Customer apps with admin panelsSaaS products with back-office toolsPlatforms with provider and client sidesProducts that depend on outside vendors from day one The moment a team has different users with different goals, the work stops being “just one app.” Final Point A lot of MVPs get too big because teams keep calling them one product long after that stops being true. The fix is not always better estimates. Sometimes the fix is much simpler: Split the product by user.Write down the access rules.Separate outside-service work.Ship one complete path first. That makes the first release easier to plan, easier to build, and easier to test.

By Kajol Shah
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents

In my 30 years of navigating the IT landscape, I’ve seen ‘Agile’ transform from a revolutionary mindset into what often feels like a series of manual project hurdles. In many large projects I’ve led, I’ve noticed we’ve traded innovation for a culture of ‘babysitting’ Jira boards and tracking Excel sheets. I wish to develop the Agentic Agile Office (AAO) not as another layer of automation, but as a fundamental shift in how I believe we must manage project velocity and governance. The Bottlenecks I’ve Encountered In my experience, traditional Enterprise Agile often buckles under its own weight. I’ve watched Technical Program Managers (TPMs) and Scrum Masters spend up to 60% of their time on administrative overhead. I’ve seen the "manual tax" of chasing status updates slow down the very speed Agile was designed to create. I believe it’s time to move past this. How I Define Autonomous AI Agents The AAO framework I’m proposing moves beyond simple chatbots. I am focusing on agentic AI — systems capable of reasoning, planning, and executing tasks autonomously. Within my framework, these agents don't just answer questions; they take action: The Backlog Agent: This will automatically analyze user feedback and technical debt to suggest prioritization scores for the Product Owner.The Dependency Agent: This agent scans multiple team boards in real-time. I want it to identify and flag architectural conflicts before they cause a sprint failure.The Governance Agent: I see this as the ultimate safeguard, ensuring all code commits meet compliance standards without a human auditor needing to manually check every pull request. Deep Dive: The Architecture of the AAO While defining these agents is the first step, I believe it is critical to understand the architectural engine that drives this office. To move beyond simple automation, I have structured the AAO as a three-tier system: 1. The Intelligence Layer: Reasoning Over Data In my three decades in the industry, the biggest issue hasn't been a lack of data, but the "data fog." I designed the AAO to use large action models (LAMs) that don't just read your tickets; they understand the intent behind them. Contextual memory: I want these agents to remember that a delay in a previous quarter was caused by a specific API bottleneck so they can predict similar risks today.Reasoning loops: Instead of a static trigger, I’ve structured these agents to use "Chain of Thought" processing to validate if a story is actually "Ready" based on historical standards. 2. The Workflow: A Day in the Life of an Agentic Sprint I’ve reimagined the standard sprint cycle to show exactly where I believe these agents provide the most value: Pre-planning: Before the team meets, I have the Backlog Agent scrub requirements. If a user story lacks an acceptance criterion, the agent flags it to the Product Owner immediately, saving us 30 minutes of "discovery" time during the meeting.In-sprint execution: I’ve implemented the Dependency Agent to act as a "digital scout." If a developer changes a schema that another team relies on, the agent detects the conflict in the pull request and notifies both Scrum Masters before the build even fails.The "always-on" retrospective: I believe retrospectives shouldn't just happen every two weeks. My Insight Agent tracks velocity trends daily. If I see a team's burndown stalling, the agent provides me with a root-cause analysis before I even ask. 3. My Strategy: Agentic Over Generative AI I want to be clear on a point of common confusion: Generative AI writes the email; agentic AI recognizes a project risk, decides an email is necessary, and drafts it for my review. In my framework, I am moving the human from being the operator of the tool to being the editor of the agent's actions. I’m shifting our workload from "doing the work" to "verifying the outcomes." Why I Believe This Redefines Our Roles This technical shift leads to a natural question: if agents are handling the logistics, what happens to the people? In my view, this shift doesn't diminish our roles; it elevates them. By offloading the "babysitting" of Jira boards to autonomous agents, I want to empower leadership to focus on: Complex problem solving: Negotiating high-level blockers that require a human touch.Mentorship: Spending more time coaching teams to improve their craft.Strategic alignment: Ensuring technical output truly maps to business value. My Vision for the Future To me, the Agentic Agile Office represents the transition from Agile-by-process to Agile-by-intelligence. I am confident that by integrating these agents, enterprises can finally achieve continuous delivery without the human burnout I’ve witnessed throughout my career. I no longer ask "How do we scale Agile?" I now ask: "How quickly can I help you integrate the agents that will do the scaling for you?"

By Madhusudhan Chivukula
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. The role of the enterprise developer has become more complex over time as organizations adopt new technologies and tools, often without retiring their old ones. Add high staff turnover and increasing time and cost pressure, and developers are confronted with charting their own path through the SDLC. The purpose of internal developer platforms (IDPs) is to create a win-win scenario that benefits developers and their organizations. In this tutorial, you’ll define one golden path for a backend service that covers service setup, deployment, observability, and guardrails end to end. Step 1: Define the Platform Product and First Golden Path Successful IDP efforts focus on end-to-end developer workflows: building a new interface, deploying an updated microservice, running a regression suite, or standing up an environment. Ideally, the whole workflow can be supported directly from your IDP as self-service. Once you have identified the workflow to support, you need to design the “golden path,” which parts you will standardize and what you expose as configuration. It’s important to get that balance right. Components that have to change often, like service accounts, interfaces, and sizing, should be configurable. Creating templates and patterns helps reduce variability between outputs, making it easier to roll out necessary patching and updates. For the first golden path, pick one high-value workflow that is common, repeatable, and easy to measure. We will use the deployment of our backend service to an integration test environment because it touches build, deployment, validation, and evidence capture in one flow. User adoption is the key to success. To measure, it’s important to track both user adoption, such as how often a workflow is triggered, and outcome metrics like the number of compliant application instances, percentage of deployment failures, and average deployment duration. Step 2: Design the Golden Path (Templates and Defaults) Next, we get to design the golden path. An important factor for the developer experience is to provide documentation with contextual guidance. This can be traditional how-to guides or more advanced features such as AI-enabled chatbots. The documentation should explain how testing, application deployments, and other lifecycle activities happen along the golden path, and provide architectural guidance on embedding any newly developed capability in the existing architecture. Standards and governance are other aspects that should be available for self-service, including naming conventions, common libraries, and reusable services. On the technical side, the golden path should cover at least the following: Code repo and standard branching structureSkeleton code based on coding standards (e.g., environment config file, logging framework, data layer)CI/CD pipeline into an ephemeral cloud environment, or pointed at a standard persistent dev environmentSkeleton quality gates in the CI/CD pipeline (e.g., unit test, functional regression, security scan)Access to common utilities; injection of environment values (e.g., URLs, IP addresses, access and secrets management)Ability to spin up the environment (if cloud based) And lastly, the IDP needs to be designed with intuitive naming, a search function, tagging methods, and a hierarchical browsing structure so users can easily find the appropriate golden path. Supporting multiple ways of discovery provides a more resilient interface and eases the adoption of new golden path templates as they become available. For our backend service, choosing the workflow will show a representation of the steps included. Step 3: Wire Self-Service Workflows (Without Tickets) Besides golden path templates, IDPs should aim to be a one-stop shop for developers, so common requests should be available for self-service. Your existing ticket/ITSM systems can be a good source for creating the backlog. Identify the most common requests and start automating them in priority order. In many cases, a ticket continues to be useful even in the self-service model for tracking and approvals, which can be integrated into the automatic workflow. Approvals should be provided automatically based on defined criteria, and only require human approvals when the request is outside of those parameters, such as access to restricted data, use of expensive resources, and non-standard requests. Over time, developers should be able to request new features through a transparent feature backlog and voting mechanism to engage the community. When creating new features, keep things common wherever possible and provide ways for users to tailor their requests. For example, the standard deployment process might define a step for secrets injection, but some teams will tailor the process to skip it as necessary. This approach has two advantages: It creates a common language and process across teams and reduces the work to build and maintain the IDP. Spending a bit more time up front to create customizability pays off over the medium and long term. For our backend service, the first service we define is deployment to the integrated test environment. Step 4: Standardize Delivery With CI/CD + GitOps + IaC in One Flow The principle of the golden path deployment process remains unchanged: You build a software artifact once, and you deploy it multiple times along the environment path. For our backend service, promotion should happen through a versioned change (think GitOps) to the desired environment state, so application version, infrastructure definition, and deployment evidence remain traceable together. In the build stage, code is prepared in any pre-compile steps, then compiled and packaged with all necessary configuration files. In the deployment process, environment variables are injected, and the package is deployed to the target environment, which is scripted as Infrastructure as Code. The validation itself is usually layered: a technical validation to confirm that the deployment was correct, functional regression of core functionality, and testing the new changes. This sequence is based on speed of feedback, which is important in an automated IDP service. When a validation check fails, the golden path needs to have defined failure behavior with clear steps to execute. Pipeline failures like a broken build, failed test, or policy violation will block progression automatically. If the environment is materially impacted, a rollback is automatically initiated. Only in rare cases should a human evaluation be required — for example, when the level of ambiguity is too high and impacts stakeholders who are using the environment. Some policy violations can be treated with time-bound exceptions, such as allowing a new security vulnerability in a non-production environment. This allows functional testing to continue while the team remediates the security vulnerability. Prior to going live, the exception would be removed so the security vulnerability doesn’t progress to production. These types of exceptions should be set to auto-expire to prevent them from being forgotten later. Golden Path Steps and Guardrails stepself-service actionguardrailevidence Build Trigger pipeline via check-in action in source control Code scan and unit test results Build log, composition scan result Promote to non-prod environment Merge to staging branch, promotion request Technical validation, regression test Test results Promote to prod Promotion request Approval and compliance check Approval and audit trail Rollback Automated trigger or manual request Post-rollback validation and regression test Test results Step 5: Bake in Operability for Observability and Day-2 Readiness IDPs reduce cognitive load and toil as solutions to common concerns are built in. This is especially true for the operational concerns. Each workflow and self-service feature creates the log files and traces for auditability. All code and configuration are driven from version control, and the metrics recorded provide insights into the outcomes and performance of the IDP. New operational initiatives, like introducing a software bill of materials, can be rolled out across all technologies that use the IDP. When done correctly, templates can be updated centrally, and the log files provide full auditability to identify where old versions are still in use, reducing the overall security exposure. The IDP governance model needs to define the ownership of templates and any inheritance rules. For instance, some teams will tailor the template by adding additional steps required for their technology. Alongside the IDP instrumentation, standard dashboards and alert definitions ship with the template, pre-wired to the appropriate ownership group. Who responds to what is documented, not assumed. Runbooks and escalation paths are stored in version control alongside the service itself so they evolve with the system rather than rotting in a forgotten wiki page. Our backend service will include the following with the golden path: Logs, metrics, and tracesAlertsRunbook linkOwnership metadata The final piece is the feedback loop. Incidents, near-misses, and recurring friction points are resolved and also used to help continuously improve the platform, first becoming a backlog item. Step 6: Add Guardrails and Governance Without Slowing Delivery The IDP should leverage approved templates where possible and embed basic compliance and policy checks in the workflows. Platform developers will receive immediate feedback on any problems they need to fix. When issue resolution requires a longer time, time-bound exceptions can be allowed. Along the environment path from development to production, the quality gates should become more restrictive as the software quality improves. For our backend service, we define security scanning prior to deployments, and we don’t accept any deviations from the corporate standard for it. We follow a simple block, warn, escalate paradigm. The goal is to address problems that teams can deal with immediately and provide enough time for more complex work. This balance allows work to flow at pace. It is important to version templates and workflows so you can track what is in use. When significant problems are identified with a version, you can use the IDP logs to find any items in use and replace them quickly. Having the right guardrails in place might feel restrictive but in fact reduces the amount of rework over time as there are fewer incidents. Fast feedback reduces the time it takes to resolve problems. Step 7: Measure Adoption, DevEx, and Platform ROI One of the key success factors for IDPs is having the ability to measure adoption (covered earlier), developer experience, and platform ROI (e.g., DORA, SPACE). This allows you to break down and distinguish between adoption measures and outcome metrics. Implementing these criteria in the platform from the beginning captures data systematically. Good adoption measures to start with: number of executed workflows, number and currency of templates, and number of active users. The following outcome metrics can also be used as part of the business case for IDPs: deployment failure rate, MTTR, incident volumes, number of tickets, and security vulnerabilities. The team managing the IDP should actively use the metrics together with captured feedback from the user base (e.g., feature requests) to prioritize the backlog. Executive dashboards should be implemented to provide accountability and increase support across the organization. A Minimal IDP You Can Scale Bringing it together, take the following actions to kick-start your internal developer platform: Choose a common and not too complex workflow for your first golden pathCreate the code repository and CI/CD pipelineDefine a self-service UI for the workflowEmbed quality gates, metrics, and operational tooling into the workflow Start with one workflow for one pilot team, prove the path, then extend to the next workflow or team. Don’t forget to engage with the pilot users to receive feedback and support adoption. If you want to dive deeper, explore the CNCF Platforms for Cloud-Native Computing whitepaper and Platform Engineering Maturity Model. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Mirco Hering DZone Core CORE
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature Flag Debt: Performance Impact in Enterprise Applications

Feature flags have become standard practice in enterprise applications, enabling teams to release code into production environments without exposing new features to users. As teams leverage feature flags to increase delivery velocity, technical debt accumulates. Left unchecked, this debt will slowly and silently impact application performance, maintainability, and developer productivity. What Is Feature Flag Debt? Feature flag debt occurs when feature flags are left in the codebase after they’ve served their purpose. The most common symptoms of feature flag debt include: Dead code Context switching for developers Feature flag debt can go unnoticed because it typically doesn’t cause broken features. As a result, developers are often reluctant to clean up flags so they can focus on developing new features. Impact on Performance Feature flag debt can have serious consequences for application performance. In front-end applications, this is often overlooked. Once a feature flag has been introduced into a codebase, it incurs a long-term cost every time the application is loaded in the browser. Larger JS bundles: Each feature flag adds logic to the application. When feature flags are not cleaned up, the associated code is typically not removed from the final bundled app. This means more code for users to download and more memory used on the client.Reduced execution speed in client-side rendering: The browser must download, parse, and evaluate the entire bundle, even if certain code paths are never executed. This leads to slower parsing, longer load times, and slower interaction time. Impact on Developer Productivity Feature flag debt also negatively impacts developer productivity. Imagine having to read through an if/else statement that checks a feature flag that will never be true. Developers frequently encounter this scenario when working with feature flags. New engineers, in particular, often struggle to know which feature flags are safe to ignore. Should they be commenting out this code? What if they need it later? Why Aren’t Feature Flags Cleaned Up? It should be standard practice to remove feature flags from the codebase once they’re no longer needed. However, they often become a long-term liability for the application for several reasons: Nobody takes responsibility for cleaning up flags.People are afraid to remove code.There are no tools to help automate the process.There’s always something more pressing to work on. We often don’t see a defined feature flag lifecycle, which leads to indefinite accumulation. Example of Feature Flag Debt For example, let’s take a look at how a feature would typically look when wrapped in a feature flag: JavaScript const isAIAgentsFeatureFlagEnabled = isFeatureEnabled('ai-agents'); if (isAIAgentsFeatureFlagEnabled) { // lines of code // Code to run when the feature flag is enabled } else { // lines of code // Code to run when the feature flag is disabled } When first implemented, this doesn’t look too bad. When this feature is rolled out to production, there’s still the safety net of keeping the original functionality should something go wrong. However, after the feature flag is turned on for everyone and the feature reaches general availability (GA), there is no reason to keep both pathways in the application. The application still ships both pieces of code in the bundle, but only one will ever execute at runtime. The else block now represents dead code that will not get executed, but still takes up space in the bundle and adds to code complexity. Manage and Eliminate Feature Flag Debt Organizations need to take measures to prevent feature flag debt from slowing down their applications. Defining a feature flag life cycle is a great place to start. By enforcing that each feature flag has a description, owner, status, and expiration date, the team can ensure flags aren’t left to become debt. Treat feature flags as temporary and not part of the application's core architecture. When the feature is in GA, remove the flag and delete any code paths that are no longer needed. This results in a cleaner, more maintainable, and performant codebase. JSON [ { "feature_flag_name": "ai-agents", "description": "Feature flag that will allow AI agents to assist users with workflows and provide suggestions", "owner": "architecture crew", "status": "GA", "expiration_date": "2026-12-31" }, { "feature_flag_name": "smart-checkout", "description": "Feature flag that will allow smart checkout features, including dynamic pricing, custom offers", "owner": "architecture crew", "status": "Dev", "expiration_date": "2026-12-31" }, { "feature_flag_name": "ai-agents-eval", "description": "Feature flag to allow the evaluation framework to execute tests against AI agents to determine how accurate they are", "owner": "agent evaluation crew", "status": "QA", "expiration_date": "2026-10-12" }, { "feature_flag_name": "experiment-recommendation-v2", "description": "Feature flag for experimenting v2 recommendation version", "owner": "agent evaluation crew", "status": "GA", "expiration_date": "2026-12-31" } ] Having the feature flags stored in a format similar to the above can help identify who to contact to clean up old flags. Performance Gains From Cleanup Removing unused feature flags reduces bundle size and eliminates unnecessary code execution, resulting in faster load times, improved rendering performance, and a cleaner codebase. Conclusion For most enterprise applications, feature flags aren’t the problem; it’s forgetting to take them down. As the application grows over time, old feature flags accumulate, which will silently bloat the bundle size, degrade performance, and clutter the code.

By Poornakumar Rasiraju

The Latest Culture and Methodologies Topics

article thumbnail
A Practical Guide to Temporal Workflow Design Patterns
Learn Temporal workflow design patterns for reliable distributed systems using durable execution, sagas, polling, fan-out/fan-in, signals, and versioning.
June 18, 2026
by Akhil Madineni
· 306 Views
article thumbnail
WebSocket Debugging Without a Proxy — A Browser-First Workflow
A proxy-free workflow — online tester for endpoint validation, Chrome extension for live frame interception and transformation, no server access needed.
June 17, 2026
by Dan Pan
· 678 Views
article thumbnail
Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial
Iceberg replaces filesystem state with a metadata tree (cheap queries, ACID snapshots). Maestro replaces cron with event signals (fresh data).
June 16, 2026
by Intiaz Shaik
· 4,374 Views · 1 Like
article thumbnail
Workflows vs AI Agents vs Multi-Agent Systems: A Practical Guide for Developers
Use workflows for control, agents for flexibility, and multi-agent systems only when complexity truly demands it. Add intelligence only where it makes a real difference.
June 15, 2026
by Raju Dandigam
· 10,851 Views · 1 Like
article thumbnail
From "Vibe Coding" to Production: Setting Up an Evals Loop for Claude Agents
Replacing unreliable “vibe coding” with a rigorous automated evaluation loop using curated datasets, Claude judge agents, and metric tracking for production AI agents.
June 11, 2026
by Nikita Kothari
· 1,848 Views · 1 Like
article thumbnail
A Deep Dive into Tracing Agentic Workflows (Part 2)
Tracing agentic systems uses hierarchical IDs to form a System DAG, exposing performance and cost issues. Observer agents automate diagnosis and system self-correction.
June 10, 2026
by VIVEK KATARYA
· 1,165 Views
article thumbnail
Orchestrating Zero-Downtime Deployments With Temporal
Temporal provides the durable control plane for safe zero-downtime deployments across canaries, approvals, retries, and rollbacks.
June 10, 2026
by Akhil Madineni
· 1,035 Views
article thumbnail
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
A technical deep dive into Amazon Quick — how it works, how it connects to your tools via MCP, and where it sits in the AWS agent stack.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 1,416 Views
article thumbnail
How to Build an Agentic AI SRE Co-Pilot for Incident Response
Build an agentic SRE co-pilot using LLMs to autonomously reason, plan, and execute incident response across complex, multi-cloud infrastructure.
June 8, 2026
by Akshay Pratinav
· 1,120 Views
article thumbnail
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Learn how to trace AI agents end to end, from prompts and tool calls to business outcomes, with observability practices for production workflows.
June 5, 2026
by Srinivas Chippagiri DZone Core CORE
· 2,997 Views · 1 Like
article thumbnail
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It
Most QA teams are stuck in a manual scripting loop. Here's the requirement-driven architecture that eliminates the coverage gap permanently.
June 5, 2026
by Waqar Hashmi
· 2,029 Views
article thumbnail
Identity in Action
A practical guide to SSO migration covering risks, MFA, phased rollout, and governance to ensure secure identity transitions without disruption.
June 3, 2026
by Kapil Chakravarthy Sanubala
· 2,556 Views · 3 Likes
article thumbnail
Getting Started With Agentic Workflows in Java and Quarkus
A step-by-step tutorial on how to add agentic workflows to Quarkus applications with the Agentican framework via YAML and annotations.
June 3, 2026
by Shane Johnson
· 2,400 Views · 3 Likes
article thumbnail
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
Many MVPs get too big because teams treat several user-facing systems and vendor-dependent workflows as one app instead of planning one complete path first.
June 2, 2026
by Kajol Shah
· 1,535 Views
article thumbnail
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
Agentic Agile Office uses autonomous AI agents to cut admin overhead, detect risks early, and shift teams from manual tracking to intelligent, high-velocity delivery.
June 1, 2026
by Madhusudhan Chivukula
· 1,449 Views · 1 Like
article thumbnail
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
Learn how to build an internal developer platform with golden paths, GitOps, CI/CD, observability, and governance built into workflows.
May 28, 2026
by Mirco Hering DZone Core CORE
· 2,510 Views · 1 Like
article thumbnail
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature flags help teams move fast, but when they’re not cleaned up, they quietly add extra code, slow down performance, and make applications harder to maintain.
May 27, 2026
by Poornakumar Rasiraju
· 3,875 Views · 1 Like
article thumbnail
DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
A practical checklist for platform engineering teams to improve DevOps, golden paths, reliability, governance, and developer experience at scale.
May 27, 2026
by Josephine Eskaline Joyce DZone Core CORE
· 2,671 Views · 1 Like
article thumbnail
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Liquid Clustering replaces rigid partitioning and Z-Order with adaptive clustering in Unity Catalog, improving performance with less maintenance.
May 26, 2026
by Seshendranath Balla Venkata
· 2,536 Views · 1 Like
article thumbnail
Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
Learn how platform teams can embed continuous optimization into internal developer platforms using GitOps, HITL workflows, and full-stack tuning.
May 26, 2026
by Graziano Casto DZone Core CORE
· 2,055 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×