DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Testing, Deployment, and Maintenance

The final step in the SDLC, and arguably the most crucial, is the testing, deployment, and maintenance of development environments and applications. DZone's category for these SDLC stages serves as the pinnacle of application planning, design, and coding. The Zones in this category offer invaluable insights to help developers test, observe, deliver, deploy, and maintain their development and production environments.

Functions of Testing, Deployment, and Maintenance

Deployment

Deployment

In the SDLC, deployment is the final lever that must be pulled to make an application or system ready for use. Whether it's a bug fix or new release, the deployment phase is the culminating event to see how something works in production. This Zone covers resources on all developers’ deployment necessities, including configuration management, pull requests, version control, package managers, and more.

DevOps and CI/CD

DevOps and CI/CD

The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).

Maintenance

Maintenance

A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.

Monitoring and Observability

Monitoring and Observability

Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.

Testing, Tools, and Frameworks

Testing, Tools, and Frameworks

The Testing, Tools, and Frameworks Zone encapsulates one of the final stages of the SDLC as it ensures that your application and/or environment is ready for deployment. From walking you through the tools and frameworks tailored to your specific development needs to leveraging testing practices to evaluate and verify that your product or application does what it is required to do, this Zone covers everything you need to set yourself up for success.

Latest Premium Content
Trend Report
Platform Engineering and DevOps
Platform Engineering and DevOps
Trend Report
Security by Design
Security by Design
Trend Report
Software Supply Chain Security
Software Supply Chain Security
Refcard #290
Getting Started With Log Management
Getting Started With Log Management

DZone's Featured Testing, Deployment, and Maintenance Resources

Identity in Action

Identity in Action

By Kapil Chakravarthy Sanubala
Switching from one single sign-on (SSO) vendor to another is a complex process that involves more than just changing technologies. This is a high-stakes identity operation that impacts security, user experience, following the rules, accessing applications, and keeping things running smoothly. It's not the same as moving a reporting tool or a collaboration platform because SSO is at the front door of every application in your environment. If you set it up wrong, everything will stop working. But the biggest danger of SSO migrations is not that they won't work. The little things that go wrong are the most annoying Users being locked out of apps that are important to the businessAccounts being left alone that were never deprovisionedMFA enrollments disappearing without a word and Helpdesk queues are getting longer on the morning of cutover because there was no communication about the change. This guide discusses the best ways to move to cloud SSO and the most important things to keep in mind. It discusses everything from getting the identity estate ready for the move of integrations to phased rollout strategies, making the user experience as smooth as possible, and planning for MFA migration. Why Businesses Change SSO Providers Companies don't usually change their SSO platforms on a whim. One of the following things usually makes it happen: Acquisition of a vendor or announcement of the end of a product's life. Cost consolidation or figuring out how to use enterprise licenses. Standardizing platforms under a broader cloud strategy. Requirements for compliance or regulation that the current business can't meet. Issues with scalability, performance, or missing features in the current platform.A merger or acquisition that introduces a second identity domain. Whatever the reason, migration causes compounding risk since SSO is foundational infrastructure, not an individual application. 3 Types of Migration Approaches and Their Differences There are three main ways to move to SSO, and each one has its risks and effects on governance. Federated Protocol Swap Retain the same IdP architecture but replace the vendor platform underneath. For example, moving from PingFederate to Entra ID External Identities. The protocol (SAML, OIDC, SCIM) may remain the same, but attribute mappings, claim transformations, and session behaviors differ in ways that are often not clear until something breaks in production. Full IdP Replacement The old IdP is completely removed, and a new one is put in its place. Need to set up, test, and cut over every connection with a service provider (SP) again. This type has the most risk, and it's also the one that most businesses don't consider. Consolidation Migration A single authoritative platform brings together many IdPs. Such an event can happen when companies merge or acquire another. There are technical and organizational problems, such as different business units having different app owners, SLAs, and levels of tolerance for disruption. Governance alignment needs to happen before any technical work can begin. Migration Process: The 7 Steps Audit and clean upPlan and PrepareMFA MigrationCommunication PlanningPhased RolloutGovernance ConsiderationDecommission and close out Step 1: Audit and Clean up Most organizations rush, ignore, and migrate everything, including unused applications, inactive users, orphaned accounts, and integrations that have remained unused for three years. These don't break, but leave a security risk. Following validations reduces testing and inventory. Create a complete, clean list of applications: Validate against the CMDB or application catalog.Validate apps being used.Validate access logs from SIEM.Validate against IGA platforms.Reduce redundant applications. Create a complete, clean list of valid users: Active users.Exclude accounts with no activity for 90 days. Exclude dormant accounts whose passwords were never changed.Validate against IGA platforms and HR systems. Mark the unused applications for the decommissioning process. Note down the protocols used (SAML, OIDC, WS-Federation, or legacy), application owners, attributes and claims, MFA requirements, CA policies, and session time-out configurations. Step 2: Plan and Prepare Every application that relies on SSO consumes identity attributes passed in SSO protocols. New IdPs rarely use the same attributes and often have case-sensitive and format changes. These mismatches cause silent authentication failures and will be extremely difficult to diagnose during cutover. Application Metadata Prepare the claims transformation registry. Confirm the case and formats.Validate transformation rules. Redirect URLs For each application, configure a transparent redirect from the legacy IdP login URL (or intranet homepage) to the new IdP's login endpoint. The user will not experience major changes. The only change a user would notice would be the new MFA prompt. Rollback Process Identify when you should roll back.Who will be able to make the rollback decision? Rollbacks generally occur in the following use cases: The rate of successful authentications drops below 95%.Validate SSO failures for major applications.More calls to the help desk than usual during the first 2 days of migration. Migration go-live Documentation regarding new login flow end-to-endPlan for extended staff during the migration. Validate helpdesk access to the new platform.Identify and set up escalation contacts for issues that the helpdesk cannot resolve. Step 3: MFA Migration Prepare a complete inventory of existing MFA enrollments that includes How many users have MFA enrolled vs. password only? What factors are in use? Authenticator Apps – Need to re-enrollSMS – Same phone number and email can be used. Hardware token – FIDO2/WebAuthn keys can be reused if the new vendor supports itBiometrics – Need to re-enroll.How many and which users have only a single factor enrolled? Follow the steps for re-enrollment: Open the self-service enrollment portal.Phone numbers and emails can be reused (since they remain the same).Send advance communications at least two weeks out, explaining what will change and why.Track re-enrollment completion rates by department and group.Send follow-up emails, including deadlines.Set up a plan to re-enroll privileged accounts. Step 4: Communication Plan Communication is a major step in the migration process and should be tracked as a separate workstream, treated with its timeline, owners, deadline, and success metrics. There are three different audiences involved in SSO migration. End users who simply need to know what will change and what to do.Helpdesk and IT staff who need operational readiness confirmations.Stakeholders who need status updates and risk visibility. Major email templates include: General UpdatesMFA-Enrollment NoticesCut Over Day notification Step 5: Phased Rollout Never perform a cutover for the entire organization. Instead, choose a phased rollout. This reduces risk, helps validate configurations in production with real users and real traffic, and provides time to identify issues before affecting most of the organization. First Phase—Technology users Internal IT staff.Identity administrator.Helpdesk personnel.power users.Second Phase - High-frequency application users like ERP applications CRM applications Collaboration platform BI toolsThird Phase—General user population Lower-risk departmentsExceptions and low-activity users ContractorsUsers who log in very lessThird-party users Step 6: Governance Considerations To ensure successful migration and validations, consider the following governance aspects: Changes to IGA Solutions JML changes Provisioning accounts in IDP with required attributes for SSO claims.Disabling or deletion of accounts during terminations.User transfers: changes to account attributes and group memberships.Changing birthright roles Update with new SSO groups.Cleanup of legacy vendor applications. Audit Log Monitoring Onboard logs from new vendor to SIEMSet up alerts for notifications, including Authentication failuresCA policy failuresPassword failuresToken expiration Non-Human Identities Create a separate inventory of NHA accounts and migrate their credentials to the new system. These include accounts with no owners. Step 7: Decommission and Close Out The process can move forward once all the checks are done and the MFA enrollments are at acceptable levels. Monitor the new system for 30 days and plan for the decommissioning of the old SSO solution. Conclusion SSO is the authentication layer for all the applications in the organization. Performing migration without a proper plan includes risk. Most companies follow one or a combination of the above-described approaches. Adhering to a proper plan with communication and the right strategies will never make you think about rollback strategies. More
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1

Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1

By Sangharsh Agarwal
I set out to build a simple Slack bot that could answer questions about our GitHub repository — open bugs, pending PRs, and recent releases. Straightforward enough. It turned into 400 lines of API glue code. When I asked Claude, ChatGPT, Gemini, and several coding assistants for architecture advice, they all converged on the same conventional pattern: What every AI suggestedWhat it means in practice1. Slack receives the mentionWrite a GitHub REST client2. Bot calls GitHub REST APIRouting logic per question type3. Feed response into Claude/GPTPagination per endpoint4. Model formats the answerMaintain API versions5. Bot posts back to SlackRepeat for every new data source This works. I built it. Three days, 400 lines of API client code, and it answered perhaps 60% of the questions my team asked. Questions like "Are any critical bugs related to PRs merged this week?" required custom correlation logic across multiple endpoints. Every new question type meant new code. Adding error monitoring as a second data source meant a separate integration entirely. After digging deeper into how AWS Bedrock handles tool use, I discovered the Model Context Protocol. I rebuilt the same bot in an afternoon — 150 lines, answering a far wider range of questions, and adding a new data source is a handful of lines in a single function. This article explains what changed and why it matters. The core insight: don't build an API client that feeds a model. Build a model that calls tools. These are fundamentally different architectures. The Architecture: Three Layers, One Loop The system is built in three layers. Each has exactly one responsibility and hands off cleanly to the next: Slack (Socket Mode) User types @mention → question received ↓ question passed to agent AWS Bedrock — Claude (Agent Loop) Reason → decide tools → call → read results → repeat ↓ tool calls routed via registry MCP Servers (GitHub + any other) 40+ tools per server — issues, PRs, releases, code search… ↓ tool results → reasoning → formatted answer → Slack Slack receives the @mention and passes the question down. Bedrock runs the agent loop — Claude reasons about which GitHub MCP tools to call, executes them, reads the results, and loops until it has enough data to answer. The tool registry routes each call to the correct MCP server automatically. The answer travels back up to Slack. Before vs. After: A Real Question To understand why this matters, consider a specific question a developer might ask in Slack: "Are any critical bugs related to PRs merged this week?" On the surface, this seems simple. But answering it correctly requires data from two separate GitHub API endpoints — the issues API for bugs, and the pull requests API for recent merges — and then correlation logic to match issue references in PR descriptions. If you are writing a traditional bot, you need to anticipate this question, write the two API calls, handle pagination on each, and write the join logic. Now imagine a dozen different question types. Each one is a new coding task. Traditional approachMCP approach1. Search GitHub for critical bugsClaude calls list_merged_prs (this week)2. Search for PRs merged this weekClaude calls search_issues (critical bugs)3. Write correlation logic across bothClaude calls get_issue for each candidate4. Handle pagination on each endpointClaude cross-references links in PR bodies5. Feed combined data to model to formatClaude returns correlated, formatted answer6. New question? Write new logic.New question? Model figures out new tools. What makes the MCP approach powerful is not just the line count — it is what the model is doing. Claude receives the full JSON Schema for every available GitHub tool at startup. When the question arrives, it reasons over those tool descriptions, selects the relevant ones, calls them in the right order, and then reasons over the combined results to produce an answer. It does not need to be told: "for bug questions, use search_issues". It reads the tool description and figures that out. The result is that the model can handle questions you never anticipated. "Show me PRs merged this week still linked to open bugs" — a slightly different framing of the same question — works without any code changes, because Claude adapts its tool selection to the new phrasing. Example Slack response: Plain Text :rotating_light: *Critical Bugs Linked to Recent PRs* • <https://github.com/org/repo/issues/1234|#1234> — Payment processing failure (linked to <https://github.com/org/repo/pull/5678|PR #5678>, merged Apr 14) • <https://github.com/org/repo/issues/1290|#1290> — Auth token timeout on mobile (linked to <https://github.com/org/repo/pull/5691|PR #5691>, merged Apr 15) Summary: 2 critical bugs found. Both linked to PRs merged this week. 6 tool calls: list merged PRs, search critical issues, get_issue per candidate. What the Model Context Protocol Does MCP is an open standard that lets AI models discover and call external tools through a uniform interface. Every MCP server exposes a tools/list endpoint returning every available action as a full JSON Schema. The model loads these at startup and reasons over them autonomously. Your application code never routes a single query. GitHub's official MCP server at api.githubcopilot.com/mcp/ exposes 40+ tools — issues, PRs, releases, code search — and a single GitHub token is all the authentication required. The shift is architectural, not cosmetic. The conventional model is a formatter — it receives data you fetched. The MCP model is a reasoning agent — it decides what to fetch, fetches it, and synthesizes the results. The first scales with the API code you write. The second scales with the MCP ecosystem. Why SRE and Platform Teams Should Care This bot started as a developer productivity tool. But when our SRE and platform engineering teams reviewed the architecture, they saw something broader: a pattern that could eliminate an entire category of operational toil. Platform teams spend considerable time maintaining integrations — every API change means updating a client, every new data source means a new integration project. The MCP pattern changes that calculus entirely. Integration toil. MCP server owners maintain compatibility with their own APIs. When GitHub updates its REST API, GitHub's MCP server absorbs that change. You own zero API client code.API drift. Traditional bots silently degrade when response schemas change. With MCP, the server owner tracks those changes — your bot keeps working.Correlation complexity. Linking deploys to errors, PRs to bugs, incidents to changesets — this logic is brittle in code and breaks constantly. Models do this naturally by reasoning across tool results in context.Platform rebuilds for new capabilities. Each new MCP server extends the bot without touching the agent loop. The loop is infrastructure. The servers are plugins. New team joins? New tool added? It is configuration, not development.The compounding effect matters most: every new MCP server registered is immediately available for any question the model asks. Traditional integrations accumulate glue code. MCP integrations accumulate capabilities. Conclusion The conventional approach to building AI-powered developer tools is not wrong — it works, and many teams run it successfully. But it carries a hidden cost: every new capability requires new code, every new data source requires a new integration, and every API change requires maintenance. Over time, that cost compounds. The Model Context Protocol eliminates that cost. By exposing tools through a uniform interface that the model discovers at startup, MCP shifts the integration burden away from your codebase and onto the ecosystem. The model reasons about which tools to call. You reason about what questions to answer. Part 1 has covered the why — the architectural distinction, the before/after comparison on a real question, and why this matters especially for SRE and platform teams. Part 2 puts it into practice with the complete implementation, step-by-step setup, and production lessons that make it reliable for daily use. Continue to Part 2: Implementation, Setup, and Production Patterns. Full project code on GitHub: https://github.com/sangharshcs/slack-github-mcp-bot. More
Using LLMs to Automate Data Cleaning and Transformation Pipelines
Using LLMs to Automate Data Cleaning and Transformation Pipelines
By David Taiwo Balogun
When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
By Janani Annur Thiruvengadam DZone Core CORE
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
By Vivek Venkatesan
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps

Teams often say they are building one app. A lot of the time, that is not true. I saw this while reviewing a telemedicine MVP. At first, the plan sounded simple enough: video visits, messaging, scheduling, and basic records. Then the version-one list kept growing: Patient appprovider dashboardAdmin panelMessagingVideoBillingEHR connectionDevice support later At that point, this was no longer one app. It was several systems being planned as one MVP. A patient-facing productA provider-facing productAn admin productA set of outside-service connections When a team treats all of that like one first release, things get messy before development even starts. The Moment It Stopped Being One App The problem was not the number of screens. The problem was the number of users, roles, and data rules hiding behind those screens. A patient needed intake, booking, reminders, and follow-up. A provider needed schedules, patient context, notes, and quick actions during the day. An admin needed visibility, support tools, and role controls. The outside-services side added video vendors, messaging vendors, EHR work, and, later, device data. That is not one product. That is a group of different systems with different jobs. Once that became obvious, the planning changed. Split the Product by User First Before estimating anything, it helps to split the product by who it is for. For this telemedicine project, the first useful split looked like this: 1. Patient Side This part handled: IntakeBookingRemindersFollow-up messagingJoining a visit The patient's side had to stay simple. It also had to be clear about what the patient could and could not see. 2. Provider Side This part handled: Schedule viewPatient detailsVisit notesQuick responsesRole-based access This was not just a different set of screens. It had different speed needs, different daily habits, and different data access rules. 3. Admin Side This part handled: Role setupSupport actionsVisibility into operationsReportingNon-clinical controls Admin work often looks small during planning. In real projects, it adds a lot of rules and a lot of testing. 4. Outside-Service Work This part handled: Video vendor setupMessaging vendor setupEHR-related workFuture device dataLogging and audit-related movement of data This is where many teams get surprised. Video, messaging, and EHR are not tiny add-ons. Each one brings its own work. Start With Access Rules Before the Feature List In multi-role products, one of the quickest ways to find hidden work is to define access rules early. Before locking the feature list, ask: Who can create this dataWho can read itWho can change itWho can delete itWho can export it For the telemedicine project, this made a big difference. A few features looked simple in the scope doc. Once the team asked who could view or change the related data, the work got much larger. A basic example: Admins can help fix booking problems. That sounds harmless. But then the real questions start: Can admins see messages?Can they see visit notes?Can they see call history?Can they open uploaded files? That one sentence can change a big part of the system. Access rules often show hidden work much faster than a feature list does. Treat Outside Services as Separate Work Another mistake teams make is treating outside services like small items on a checklist. On paper, it can look like this: VideoMessagingEHR later In practice, each one adds its own work: Vendor setupRequest and response formatsError handlingRetry rulesLoggingReplacement cost if the vendor needs to change later That is why these items should be planned separately. For the telemedicine case, once video, messaging, and EHR work were split out from the main product list, the first release became easier to define. Some items that seemed close to launch were clearly not ready for version one. Ship One Complete Path First Once the team stopped calling everything an MVP, the first release got smaller. The version-one path that stayed in looked like this: Patient intakeAppointment bookingSecure video through the chosen vendorFollow-up messagingBasic provider access controls That was enough to test whether the product solved a real problem for a clinic. What moved out of the first release: Deeper EHR workMore reportingDetailed billing flowsDevice supportBroader admin tooling Those things were not bad ideas. They just did not belong in the first build. 4 Simple Documents to Create Before Sprint Planning When a team starts to suspect that one MVP is several systems, four short documents can help a lot. 1. User-to-System Map List each part of the product and the main user for it. 2. Permission Matrix Write down who can create, view, change, delete, and export each type of data. 3. Outside-Service List Separate core product work from vendor work and data that moves in or out of the system. 4. First-Release Path Write the one end-to-end path that version one has to get right. These are short documents, but they make planning much better. Why This Matters Outside Healthcare, Too This lesson is not only for telemedicine. It applies to any multi-role product where the team is building for more than one type of user. That includes: Customer apps with admin panelsSaaS products with back-office toolsPlatforms with provider and client sidesProducts that depend on outside vendors from day one The moment a team has different users with different goals, the work stops being “just one app.” Final Point A lot of MVPs get too big because teams keep calling them one product long after that stops being true. The fix is not always better estimates. Sometimes the fix is much simpler: Split the product by user.Write down the access rules.Separate outside-service work.Ship one complete path first. That makes the first release easier to plan, easier to build, and easier to test.

By Kajol Shah
Optimizing Databricks Spark Pipelines Using Declarative Patterns
Optimizing Databricks Spark Pipelines Using Declarative Patterns

If you've ever inherited a Spark job that runs in 35 minutes and someone asks you to make it faster, you know the routine. You start by checking partition counts, then file sizes, then shuffle stages, then broadcast hints. You find a handwritten OPTIMIZE schedule from 2022, a Z-ORDER on the wrong column, and a cluster sized for last year's data volume. By the time you've made the job fast, you've absorbed three new things to maintain. The next person to inherit it will absorb four. This pattern — call it the hand-tuning treadmill — is what the declarative optimization story on Databricks is trying to break. It's not a single feature; it's a cluster of capabilities that collectively let teams describe what a table should look like and let the engine handle the physical optimizations. What follows is the practical view of those patterns: where they fit, what they replace, and how to migrate without a rewrite weekend. 1. The Hand-Tuning Treadmill: Why Imperative Optimization Doesn't Scale Before getting into the declarative side, it's worth being concrete about what "imperative Spark optimization" actually means in production. The shape is consistent across teams I've audited: Layout decisions frozen on day one. Somebody picks a partition column when the table is created. The data shape changes a year later. Nobody re-partitions because the migration is scary. Query plans drift toward full scans.Maintenance jobs that nobody owns. An OPTIMIZE / Z-ORDER / VACUUM script lives in a notebook scheduled at 3 AM. It runs on a cluster that's slightly mis-sized. When data volume grows, the job runs into the morning workload, and people complain about latency.Cluster sizing as a guess. Worker count is a heuristic from a senior engineer's memory of last year's spike. Half the time it's too big, half the time it's too small, and the cost discussion gets emotional.Hint-driven plans. Broadcast hints, repartition hints, coalesce (N) — sprinkled through pipelines to fix yesterday's problem, kept indefinitely because removing them feels risky. None of these are bugs. They're symptoms of the imperative model: the team owns the layout, the maintenance, the sizing, and the plan tuning. In small pipelines, ownership is fine. At scale, it becomes the bottleneck that the team can't outsource. 2. What "Declarative" Means in the Spark Optimization Context Declarative is a word that gets used in two different ways here, and it's worth pulling them apart. Within Lakeflow pipelines (formerly DLT), it means "describe the tables, not the steps" — the engine builds the DAG and runs it. But in the broader optimization story, declarative also means "describe the desired property of the table or workload, not the operations to maintain it": Layout: I want this table clustered by these columns; figure out when and how to re-cluster.Maintenance: I want this table optimized and vacuumed; figure out the schedule.Ingestion: I want all new files in this path picked up exactly once; figure out checkpointing and listing.Quality: These rows must satisfy these expectations; enforce them and report what gets dropped.Compute: I want this query fast and not wasteful; size and scale appropriately. Each one of those bullets corresponds to a piece of the declarative stack. Used together, they replace a remarkable amount of the boilerplate that has historically lived in Spark pipelines. The mental shift: You stop writing operations against the table and start writing properties of the table. The engine becomes the actor; you become the editor. 3. The Declarative Optimization Stack on Databricks The chart below maps each thing the team declares to the engine capability that handles it, ending at the physical Delta table. It's the picture I draw on whiteboards when teams ask, "What's the order to adopt these in?" Figure 1. The declarative optimization stack: each user-facing intent at the top maps to a continuous engine behavior, which keeps the underlying Delta tables well-clustered, compacted, and statistically up-to-date — without human intervention. Two things are worth highlighting in this picture. First, every box in the engine row is something that runs continuously, not on a cron — there is no daily "optimization window" anymore. Second, the bottom layer is identical to what you'd get from any well-tuned imperative pipeline: 256 MB Parquet files with current statistics. The declarative path doesn't change what good looks like; it changes who does the work to keep things looking good. 4. Layout: Liquid Clustering Replaces Hand-Maintained Z-ORDER Liquid Clustering is the change with the largest practical impact, because partition-key choices are where most lakehouse pipelines accumulate the most technical debt. The declarative version: you specify the columns the data is most often filtered or joined by, and the engine maintains a layout that supports those access patterns — incrementally, as new data arrives, without a full rewrite. When access patterns change, you change the cluster columns, and the engine re-clusters in the background. Defining Liquid-Clustered Tables SQL -- New table, clustered by the columns most commonly filtered on. -- No more PARTITIONED BY, no more guessing at partition cardinality. CREATE TABLE prod.gold.daily_totals ( account_id STRING, region STRING, ingest_date DATE, daily_total DECIMAL(18,2), txn_count BIGINT ) USING DELTA CLUSTER BY (region, ingest_date, account_id); -- Even better: let the engine pick the clustering columns by -- observing real query patterns over time. CREATE TABLE prod.gold.events_clustered USING DELTA CLUSTER BY AUTO AS SELECT * FROM prod.silver.events; Migrating an Existing Partitioned/Z-ORDER Table SQL -- Convert a legacy partitioned table to liquid clustering. -- Existing data files are not rewritten immediately; the engine -- rebalances incrementally on subsequent writes + maintenance. ALTER TABLE prod.silver.transactions CLUSTER BY (account_id, ingest_date); -- Force the first clustering pass for a freshly converted table OPTIMIZE prod.silver.transactions FULL; Why this matters: the recurring 2 AM Slack thread of "can we re-partition this table?" goes away. Layout becomes a property you change with one DDL statement, not a multi-week rewrite project. 5. Maintenance: Predictive Optimization Replaces Cron-Driven OPTIMIZE/VACUUM Predictive optimization is the part that retired the most legacy code in the pipelines I've migrated. Once enabled at the catalog or schema level, the engine monitors each table's read and write patterns and decides on its own when to compact files, re-cluster, vacuum, and refresh statistics. The big win isn't the operations themselves — the imperative pipeline could already run those — it's that the timing is observed-driven, not schedule-driven. Tables that get heavy ingestion get more frequent maintenance. Cold tables get left alone. SQL -- Turn it on at the catalog level once; new tables inherit. ALTER CATALOG prod SET PREDICTIVE OPTIMIZATION = ENABLED; -- Or at the schema level for a phased rollout ALTER SCHEMA prod.gold SET PREDICTIVE OPTIMIZATION = ENABLED; -- Inspect what the engine has been doing on a given table SELECT operation, operation_metrics.numFilesAdded AS files_added, operation_metrics.numFilesRemoved AS files_removed, operation_metrics.numOutputBytes AS output_bytes, timestamp FROM (DESCRIBE HISTORY prod.gold.daily_totals) WHERE userMetadata IS NULL -- engine-driven, not user AND operation IN ('OPTIMIZE', 'VACUUM') AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; What you should delete after enabling this: the nightly notebook that runs OPTIMIZE on every table in a schema, the VACUUM cron job, the ANALYZE TABLE wrapper, and the alerting that wakes someone up when those jobs run long. None of them are needed anymore, and leaving them on creates duplicate work that the engine and the cron will fight over. 6. Ingestion: Auto Loader Replaces Listing-Based File Detection Auto Loader is the declarative answer to the perennial "which files have we processed already?" problem. Instead of listing a directory, comparing it to a state file, and figuring out the new bits, you describe the source location and the format and let the engine maintain its own incremental state. It uses cloud-native event notifications (S3 events, ADLS notifications, or efficient directory listing as a fallback), and the checkpoint is just another piece of state the engine owns. Python from pyspark.sql.functions import current_timestamp # Streaming ingest from S3 with schema inference + evolution. # Replaces hand-maintained checkpointing, listing logic, and # whatever file-tracking table the team built two years ago. (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaLocation", "s3://acme-checkpoints/txns_schema") .option("cloudFiles.schemaEvolutionMode", "addNewColumns") .load("s3://landing/txns/") .withColumn("_ingest_ts", current_timestamp()) .writeStream .format("delta") .option("checkpointLocation", "s3://acme-checkpoints/txns_writer") .trigger(availableNow=True) # batch-style; runs to completion .toTable("prod.bronze.txns")) Two notes from production. First, schemaEvolutionMode is the option that prevents the silent-data-loss class of bugs when partner schemas change; pick the policy explicitly rather than letting it default. Second, trigger(availableNow=True) gives you batch ergonomics on a streaming source — the job runs until it has consumed everything and exits, which is what most teams actually want for daily ingestion. 7. Transforms and Quality: Declarative Pipelines Replace Bare Spark + External DQ The final piece is the transformation layer. Lakeflow pipelines (the rebrand of Delta Live Tables) let you declare each table as a Python or SQL definition, and add expectations as a first-class concept. The engine derives the DAG from the dependencies and enforces the expectations on every write — the data quality framework, the lineage layer, and the orchestration glue collapse into a single artifact. Python import dlt from pyspark.sql.functions import sum as _sum, col @dlt.table( name="silver_txns", table_properties={ "delta.enableChangeDataFeed": "true", "delta.tuneFileSizesForRewrites": "true", }, cluster_by=["account_id", "ingest_date"], ) @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP')") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id", "region") .agg(_sum("amount").alias("daily_total"))) The decorators do four things at once: define the table, declare its layout (cluster_by), declare its quality rules, and let the engine infer that gold_daily_totals depends on silver_txns from the dlt.read call. There is no DAG file. There is no separate Great Expectations suite. Lineage is generated for free in Unity Catalog, including column-level edges. If you want to query how the expectations have been performing — useful for SLO dashboards or alerting — the event log surfaces it directly: SQL -- Pass / fail / drop counts per expectation, last 24 hours SELECT flow_name, details:flow_progress.data_quality.expectations[0].name AS exp_name, details:flow_progress.data_quality.expectations[0].passed_records AS passed, details:flow_progress.data_quality.expectations[0].failed_records AS failed, details:flow_progress.data_quality.expectations[0].dropped_records AS dropped, timestamp FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 1 DAY ORDER BY timestamp DESC; 8. Putting It Together: Where to Start, What to Measure Adopting all of this at once is a recipe for pain. The order I've seen work, and a small set of metrics to verify the change is paying off: Step Adopt Retire Verify with 1 Predictive optimization at schema level Nightly OPTIMIZE / VACUUM jobs Reduction in maintenance-cluster cost 2 Liquid clustering on top 5 tables Static partitioning + Z-ORDER p95 query latency on the same workloads 3 Auto loader for 1-2 ingestion pipelines Custom file-tracking + listing logic End-to-end data freshness 4 Lakeflow pipelines for new pipelines only External DQ + DAG glue (for new work) Lines of pipeline code per table 5 Serverless compute for SQL warehouses + DLT Hand-sized job clusters Cost-per-query, scale-up time What you do not need to migrate: imperative pipelines that already work and aren't growing. Declarative patterns are about new work and high-pain hot spots, not a heroic rewrite of every notebook ever shipped. 9. Honest Limitations and Where Imperative Still Wins Three places where the declarative model still bites — worth knowing before you commit: Procedural logic still belongs in Jobs. If your pipeline is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table. Don't try to bend dlt around it.Predictive optimization needs observation time. On a table that's a week old, the engine hasn't seen enough patterns to make great decisions. For tables under heavy initial load, an explicit OPTIMIZE FULL after the first big ingest still helps.Cluster-by-column choice still matters. CLUSTER BY AUTO is great for stable workloads with predictable filters. For tables whose access pattern is genuinely heterogeneous across teams, an explicit cluster-by based on the dominant query is usually faster.Hint-driven escapes are still allowed. If a particular query benefits from a /*+ BROADCAST(t) */ hint and AQE isn't catching it, the hint is fine. Just keep them rare and document why. Conclusion The declarative optimization story isn't a single feature you toggle — it's a quiet shift in who owns the boring parts of a Spark pipeline. Layout, maintenance, ingestion bookkeeping, plan tuning, cluster sizing, data quality enforcement: every one of those was traditionally a thing the team owned and paid for in toil. The current Databricks stack lets you express each as an intent and let the engine handle the operations underneath. Adopt them in order, retire what they replace, and the optimization treadmill slows from a daily concern to a quarterly review. That's the actual win, and it's the reason the declarative paradigm has gone from a Lakeflow detail to the default mental model for new pipelines on Databricks.

By Seshendranath Balla Venkata
Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales

The Patch That Took Down Black Friday It wasn't malware. It wasn't a zero-day exploit. It was a routine patch cycle. The team had scheduled OS updates across 1,200 retail locations for the Tuesday before the busiest shopping week of the year. Everything looked fine in the test environment. The change advisory board approved it. The maintenance window was set. Then 1,200 stores simultaneously reached out to the central repository and started downloading a 500 MB update bundle. The WAN links — already stressed from pre-holiday inventory syncs—buckled under the load. Patches timed out. Retry logic kicked in, creating a second wave. Point-of-sale systems stalled. Stores opened with degraded systems. The incident lasted six hours and involved every tier of IT support. If you've managed patch operations at scale, this story probably sounds familiar. Maybe not Black Friday, but you've seen the variant: the critical security patch that failed silently on 30% of nodes, the update that caused a two-hour outage at a branch office, and the maintenance window that expanded from two hours to six because of cascading retry storms. The root cause is almost never the patch itself. It's the distribution model. This article walks through a production architecture we built to solve exactly this problem. This offline-first patch management system has been running across a fleet of thousands of edge nodes for several years. We will explain the design principles, the implementation mechanics, the code that powers the system, and the lessons we've learned along the way. Why Patch Management Breaks at Scale Traditional enterprise patching tools were designed for a world that edge infrastructure doesn't live in. They assume: Stable, high-bandwidth connectivity to central repositoriesNodes that are always online when the patch job runsIT staff available on-site to handle failuresCentralized infrastructure with predictable network topology Edge environments operate under the opposite conditions. Retail stores, manufacturing floors, remote branch offices, and distributed kiosks share a common reality: the Wide Area Network (WAN) link is constrained, unreliable, and expensive. There's no on-site IT. And the systems can't afford to be down.The math at scale worsens this. If 1,000 nodes simultaneously download a 500 MB update, that's 500 GB of instantaneous WAN (Wide Area Network) traffic. When you incorporate retry storms, which are a default feature of most package managers, your network will experience multiple waves of this load simultaneously. The result is timeouts, partial installs, dependency conflicts, and configuration drift. The Numbers Before We Redesigned Patch completion rate: ~68% across the fleet on any given cycleAverage time to full fleet coverage: 4–7 daysIncidents triggered by patch cycles: multiple per quarterManual IT interventions per patch event: dozensWAN utilization during patch windows: unpredictable spikes The turning point came when we stopped asking, 'how do we make the patch tool more reliable?' and started asking, 'how do we make the network irrelevant to the install step?' Four Principles That Guided the Redesign Before writing a single line of code, we established constraints that any solution had to satisfy. These aren't theoretical — each one was derived from a failure mode we'd actually experienced. Decouple Distribution from Execution Separation of concerns. The network delivery layer and the installation layer should never depend on each other's availability. If the WAN link drops mid-transfer, the install still completes from the local bundle. Move Complexity to the Center Edge nodes are not servers. They shouldn't be resolving dependency conflicts or reaching out to multiple upstream mirrors. All of that logic lives in the central build pipeline. Prefer Local Operations over Network Calls Every package install that hits the local repo instead of the internet is a failure point removed. At 10,000 nodes, every failure point multiplied by 10,000 becomes a crisis. Design for Failure by Default The assumption isn't 'what if connectivity drops?' — it's 'connectivity will drop.' Idempotent scripts, retry logic, and pre-flight checks are built in from day one, not bolted on later. The Architecture: Pre-Staged Tarball + Local Repository The core idea is straightforward, even if the implementation has nuance. Instead of having each edge node reach out to upstream repositories at patch time, you build a complete, validated patch bundle in a controlled environment and push it out as a single artifact. The node unpacks it, constructs a local repository, and installs from that — never touching the WAN during the install phase. How a Patch Cycle Works Each patch cycle follows a deterministic four-step workflow: Central aggregation: The build pipeline collects OS updates, security fixes, and dependency packages for every OS variant in the fleet. This runs on a build server with internet access, not on production infrastructure.Bundle construction: All packages are assembled into a versioned, compressed tarball. The bundle is GPG-signed, checksummed, and tagged with the target OS variant and patch cycle ID.Rate-limited distribution: The bundle is pushed to each edge location using bandwidth-throttled file transfer (rsync with --bwlimit, or a custom agent with transfer scheduling). Transfer happens days before the install window — during off-peak hours, in the background.Local execution: On patch day, an on-device agent verifies the bundle signature, constructs a local package repository, and runs the install — no WAN connectivity required. If the transfer hasn't completed, the install defers gracefully. Building the Patch Bundle (RHEL/CentOS) Here's the core of the build pipeline for RPM-based systems. This script runs on a build server and produces the artifact that gets distributed to edge nodes: Code GITHUB repo: https://github.com/srinivas-thotakura-eng/offline_patchmanagement/blob/main/build-patch-bundle.sh Distributing the Bundle (Rate-Limited Rsync) Distribution happens well before the maintenance window — typically 48–72 hours in advance. We use rsync with bandwidth limitations to avoid impacting business traffic. Installing on the Edge Node The on-device install script runs during the maintenance window. It verifies the bundle before touching the system — if verification fails, it exits cleanly and logs the failure without leaving the node in a broken state. What Happened When We Deployed This in Production The architecture went live across a fleet of several thousand edge nodes over a phased rollout. We ran it in parallel with the legacy tool for two full patch cycles before cutting over completely. Here's what changed: Metric Traditional Model Offline-First Architecture Peak WAN Usage Unpredictable spikes (500+ GB simultaneous) Controlled, rate-limited (~92% reduction) Patch Success Rate ~68% — failures from timeouts & drops >99% — local execution, no WAN dependency Failure Recovery Manual IT intervention required ~94% automated self-healing Maintenance Windows Variable, often extended Predictable, business-hours safe Configuration Drift Frequent across fleet Eliminated — deterministic inputs On-Site IT Required Yes — for troubleshooting Zero-touch — fully autonomous The improvement in patch success rate—from roughly 68% to consistently above 99%—was the most operationally impactful change. But the secondary effect surprised us more: the reduction in on-call incidents. Patch cycles had previously generated multiple escalations per event. After the redesign, they became routine background operations that nobody noticed. The Result We Didn't Expect Eliminating WAN dependency at install time didn't just improve reliability — it changed the operational culture. Patch cycles stopped being 'events' that engineers had to monitor. They became background jobs that ran, completed, and reported back. The on-call team stopped dreading patch Tuesdays. What Happens When Things Go Wrong No distributed system is failure-free. The goal isn't to eliminate failures — it's to make failures safe, visible, and self-healing wherever possible. Transfer Failures If a bundle doesn't arrive at an edge node before the maintenance window, the install script detects the missing bundle and defers. It logs the event, reports to the central management API, and retries on the next scheduled transfer window. The node doesn't attempt a partial install. Verification Failures If the checksum or GPG signature doesn't match, the script exits immediately with a distinct error code (2 or 3). This is treated as a critical alert — it indicates either a corrupted transfer or a potential tampering event. The node is quarantined from the next patch cycle until the source bundle is re-verified. Install Failures If yum exits with an error, the script logs the failure, reports it centrally, and leaves the system in its pre-patch state. Because we run with --disablerepo='*' --enablerepo='local-patch', dependency resolution is entirely local—there are no external calls that can partially succeed and leave the system inconsistent. Rollback For critical package updates, we pre-capture a snapshot before the install using LVM thin snapshots (on nodes that support it) or filesystem-level snapshots via Timeshift on Ubuntu-based nodes. The install script records the snapshot ID, and rollback can be triggered remotely via the management API if health checks fail post-install. Integrating With GitOps and Kubernetes Workflows If your edge fleet uses Kubernetes — or if you're moving in that direction — the offline-first model fits naturally into a GitOps workflow. Patch bundles can be version-controlled and deployed declaratively, treating infrastructure state as code rather than as an operational procedure. Defining Patch Targets in Git YAML # patch-policy.yaml # Stored in Git — defines what gets patched and when apiVersion: patchmgmt.io/v1 kind: PatchPolicy metadata: name: edge-fleet-q4-2024 namespace: operations spec: bundleRef: version: "20241105-build-42" checksum: "sha256:abc123..." targets: selector: matchLabels: role: edge-node region: us-east schedule: maintenanceWindow: "Tue 02:00-04:00" timezone: "America/New_York" rolloutStrategy: type: RollingUpdate batchSize: 100 batchDelayMinutes: 15 rollback: enabled: true healthCheckUrl: "http://localhost:8080/health" healthCheckTimeoutSeconds: 120 With a CRD like this in place, patch deployments become pull requests. The audit trail lives in Git. Rollbacks are reverted commits. Compliance teams can review the exact bundle version that was applied to every node on any given date. Lessons Learned (the Hard Way) Distribution is the real engineering problem. Installing packages is a solved problem. Getting a 500 MB bundle to 10,000 locations reliably, on a schedule, without impacting business traffic—that's where most of the design effort needs to go.Idempotency isn't optional. Every script in the pipeline must be safe to run twice. Networks are unreliable. Management systems retry. If re-running your install script would cause a problem, you have a design flaw.Sign everything. We added GPG signing after our first attempt at a simpler checksum-only approach. The signing overhead is negligible. The confidence it provides when an edge node validates a bundle at 3 am with no human present is not.Report failures aggressively. Silent failures at scale are invisible failures. Every script exit condition — success, deferred, verification failure, and install failure — writes to the central management API, which is the application programming interface that allows different software components to communicate with each other. The dashboard shows you exactly what state each of 10,000 nodes is in, in real time.Test the offline path explicitly. In development, your test environment has excellent connectivity. Your staging environment has excellent connectivity. Block the network interface on your test node before you test your 'offline' installation path. You'll find bugs that wouldn't surface otherwise. Bundle size matters more than you think. We over-engineered our first bundles — including every available update regardless of whether it was needed. Trimming bundles to the actual delta reduced transfer time by ~60% and dramatically improved transfer completion rates on marginal WAN links. Wrapping Up Patch management at the edge scale is a distribution problem disguised as a software problem. The tools and techniques that work fine for a hundred servers in a data center break in predictable ways when you multiply them across thousands of branch offices, retail stores, or industrial sites with constrained, unreliable WAN links. The offline-first approach — build centrally, distribute early, execute locally — isn't a new idea. It's how software was deployed before the ubiquitous internet. What's changed is that we now have the tooling to make it systematic, auditable, and automated at scale. The architecture described here runs in production across a large fleet of edge nodes. The improvement in patch completion rate (68% → >99%) and the near-elimination of patch-related incidents have made it one of the highest-ROI infrastructure changes the team has shipped. If you're dealing with similar challenges — bandwidth storms, silent failures, unpredictable maintenance windows — the code here is a starting point. The specific implementation will vary by operating system (OS), by fleet size, and by your existing tooling, which refers to the software and tools you currently use. But the principles hold: decouple, centralize, go local, and design for failure. The network will let you down. Build systems that don't care when it does.

By srinivas thotakura
Implementing Observability in Distributed Systems Using OpenTelemetry
Implementing Observability in Distributed Systems Using OpenTelemetry

Modern distributed systems demand observability, the ability to understand internal states from external outputs. Observability is achieved by collecting traces, logs, and metrics to improve performance, reliability, and availability. No single signal is sufficient; it's the combination and correlation of these data that form a narrative for root cause analysis. In monolithic applications, debugging was easier since one service handled a request. In contrast, microservices distribute a request across many services, making it hard to follow a transaction’s path. OTel’s distributed tracing shines here; it propagates context with each request, so you can trace a transaction across service boundaries. This means when Service A calls Service B, they share a common trace ID, allowing you to view a single trace spanning multiple services. Similarly, OpenTelemetry can attach unique identifiers to logs, making it easier to correlate log events across services. Overall, OTel provides a unified API for instrumenting code and an ecosystem of instrumentation libraries for frameworks that can automatically capture common operations. It focuses on data generation and collection, while the actual storage and querying of telemetry is handled by backend tools. Setting Up OpenTelemetry in a Python Microservice Installation To get started, install the OpenTelemetry libraries for Python. At minimum, you'll need the API and SDK, plus exporters/instrumentation for your use case. For example: PowerShell pip install opentelemetry-api opentelemetry-sdk \ opentelemetry-exporter-prometheus \ opentelemetry-instrumentation-flask This installs the core OTel API/SDK and the Prometheus metrics exporter and Flask instrumentation. You might also install the OTel OTLP exporter, which is a generic exporter that can send data to an OpenTelemetry Collector or other backend via the OTLP protocol. Additionally, it's recommended to set a service name for your application so that telemetry from this service is identifiable. This can be done via code or an environment variable. In code, you'll see below how we attach a service name as a resource attribute so that traces and metrics are tagged with service.name. Distributed Tracing With OpenTelemetry Tracing involves capturing spans that represent units of work in the system. In a microservice, a span could represent an incoming HTTP request, a database query, or an external API call. Spans form a trace when linked together via context propagation. Using OpenTelemetry, we can instrument our Python service to create spans for critical operations and automatically propagate the trace context to downstream services. First, let's initialize OpenTelemetry tracing in our Python microservice. We create a tracer provider, configure an exporter, and obtain a tracer instance: Python import time from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import ConsoleSpanExporter, BatchSpanProcessor from opentelemetry.sdk.resources import Resource, SERVICE_NAME # Set up tracer provider with service name for identification trace.set_tracer_provider(TracerProvider(resource=Resource.create({SERVICE_NAME: "order-service"}))) tracer = trace.get_tracer(__name__) # Configure a span processor with a Console exporter (prints trace data to stdout) trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) # Example: instrument a code block with a span with tracer.start_as_current_span("process_order"): # Simulate processing (e.g., calling another service or performing work) time.sleep(0.1) # If this code calls another service, OpenTelemetry context propagates via HTTP headers automatically In this snippet, we configured a TracerProvider with a resource attribute service.name="order-service" so that all spans from this service are labeled. We added a BatchSpanProcessor with a ConsoleSpanExporter this will batch and print our spans to the console in JSON for demonstration. In a real system, you might use a Jaeger exporter here to send spans to a Jaeger agent. The tracer = trace.get_tracer(__name__) gives us a tracer we can use to start spans. We then start a span named "process_order" using a context manager (start_as_current_span), which automatically ends the span when the block exits. Inside that span, you would put the operation you want to measure. Metrics Collection and Export (Prometheus Integration) While tracing shows the path of individual requests, metrics provide aggregated insights into system behavior. OpenTelemetry’s metrics API allows you to define instruments like counters and histograms to record these values. First, ensure the Prometheus client/exporter is set up. We’ll use OTel’s Prometheus exporter, which works by exposing a /metrics HTTP endpoint that Prometheus will scrape. In code, this is done by creating a PrometheusMetricReader and starting an HTTP server for metrics. Here’s how you can integrate metrics in a Flask microservice: Python from flask import Flask, request, g from prometheus_client import start_http_server from opentelemetry import metrics from opentelemetry.exporter.prometheus import PrometheusMetricReader from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.resources import Resource, SERVICE_NAME from opentelemetry.instrumentation.flask import FlaskInstrumentor import time # Initialize metrics provider with Prometheus exporter (reader) resource = Resource(attributes={SERVICE_NAME: "order-service"}) reader = PrometheusMetricReader() # exposes metrics in Prometheus format provider = MeterProvider(resource=resource, metric_readers=[reader]) metrics.set_meter_provider(provider) meter = metrics.get_meter(__name__) # Define metric instruments request_counter = meter.create_counter( name="app_requests_total", description="Total number of requests processed", unit="1" ) request_latency = meter.create_histogram( name="app_request_latency_ms", description="Request latency in milliseconds", unit="ms" ) # Start Prometheus client on an endpoint (e.g., port 8000) for scraping start_http_server(port=8000, addr="0.0.0.0") # Flask app and instrumentation app = Flask(__name__) FlaskInstrumentor().instrument_app(app) # auto-instrument Flask for tracing @app.before_request def before_request(): g.start_time = time.time() # Increment counter for each incoming request, with the route path as a label request_counter.add(1, {"endpoint": request.path}) @app.after_request def after_request(response): # Record the request duration in milliseconds duration_ms = (time.time() - g.start_time) * 1000 request_latency.record(duration_ms, {"endpoint": request.path}) return response # Example route @app.route("/hello") def hello(): return "Hello, World!", 200 In the setup above, we configured a MeterProvider with a PrometheusMetricReader. This essentially registers an HTTP endpoint that exposes our metrics in Prometheus format. We explicitly call start_http_server(port=8000) to start the metrics server on port 8000, Prometheus will scrape this. We created two metric instruments: a counter to count the number of requests, and a histogram to track the distribution of request durations. In the Flask hooks, we use these instruments: at the beginning of each request, we note the start time and increment the counter. After the request is handled, we compute the elapsed time and record it in the histogram again, labeled by the endpoint path. These labels let us break down metrics per route. Log Correlation With OpenTelemetry Logs are the third pillar of observability. They provide detailed event information and error messages. OpenTelemetry can augment logging by injecting trace context into logs, so that you know which trace/span a log entry is associated with. In Python, the package opentelemetry-instrumentation-logging can automatically enrich Python logging records with trace context. After installing it, you can enable it with: Python from opentelemetry.instrumentation.logging import LoggingInstrumentor LoggingInstrumentor().instrument(set_logging_format=True) This will ensure that whenever you call the standard logging functions, if a trace is currently active, the log record will contain the trace and span IDs. For instance, you might see logs like: Plain Text INFO [trace_id=0xf4a3b...] Order 123 processed successfully indicating that the log was emitted during a specific trace. To fully centralize logs, you would forward them to a log backend. One approach is using the OpenTelemetry Collector to collect and export logs. Conclusion Implementing observability in a microservice architecture is no small feat, but OpenTelemetry greatly simplifies the process by providing a one-stop solution for instrumentation. We have shown how to set up distributed tracing to follow requests across services, how to collect metrics and export them to Prometheus for monitoring, and how to correlate logs with trace context. With these in place, you gain deep visibility into your system. You can monitor performance and identify latency bottlenecks, get alerted on anomalies via metrics, trace requests end-to-end to see where failures occur, and dive into logs for detailed errors. This comprehensive observability is crucial for engineers to effectively maintain and optimize distributed systems. In summary, OpenTelemetry enables a consistent, portable way to implement observability across distributed systems. Embracing it in your microservices will lead to faster debugging, better performance insights, and more resilient applications. With traces, metrics, and logs at your fingertips, you are no longer flying blind in your distributed architecture; instead, you have the data to understand and improve your system continually.

By Mugunth Chandran
Event-Driven Pipelines With Apache Pulsar and Go
Event-Driven Pipelines With Apache Pulsar and Go

A Practical Walkthrough Most distributed systems eventually hit a wall with their messaging layer, whether it's Kafka's tight coupling between compute and storage, RabbitMQ's limited replay capabilities, or the operational overhead of managing multiple tools for queuing and streaming. Apache Pulsar was engineered to address these gaps from the ground up. In this article, we'll dissect a working Go-based demo that wires together a Pulsar producer, consumer, and Prometheus monitoring layer into a cohesive, observable messaging pipeline. The full source is on GitHub. Why Pulsar Deserves a Closer Look Pulsar's architecture makes a deliberate trade-off that most messaging systems avoid: it physically separates the broker tier (which handles routing, subscriptions, and protocol) from the storage tier (Apache BookKeeper, which handles persistence). This isn't just an implementation detail. It means you can independently autoscale message routing capacity without touching your storage cluster, and vice versa. Beyond the architecture, a few capabilities stand out for engineering teams: Multi-tenancy at the protocol level – tenants, namespaces, and topics form a three-level hierarchy, making it practical to run a single Pulsar cluster for multiple teams or services without namespace collisions.Four distinct subscription semantics – Exclusive, Shared, Failover, and Key_Shared give you precise control over how messages are distributed across consumer instances, something Kafka's consumer group model doesn't natively offer.Cursor-based message retention – Pulsar retains messages based on subscription cursors, not time-based log compaction. A consumer that falls behind doesn't lose messages; it simply catches up from its last acknowledged position.Native schema enforcement – The built-in schema registry validates message payloads at the broker level before they reach consumers, catching contract violations at the boundary rather than deep inside application logic. What the Demo Builds The project is structured as three independent Go binaries, each with a single responsibility: reStructuredText ├── producer/ │ ├── main.go # HTTP server → Pulsar publisher │ ├── go.mod │ └── go.sum ├── consumer/ │ └── main.go # Pulsar subscriber → message processor ├── monitor/ │ └── main.go # Pulsar producer + Prometheus metrics server ├── prometheus.yml # Scrape configuration └── README.md All three use `github.com/apache/pulsar-client-go/pulsar` — the official, Apache-maintained Go client. The client is not a thin wrapper; it implements the full Pulsar binary protocol, connection pooling, producer batching, and automatic Prometheus metric registration. Component 1: The Producer The producer exposes a single HTTP endpoint. An incoming HTTP request triggers a Pulsar publish operation, decoupling the caller from any direct knowledge of the messaging infrastructure. Go package main import ( "context" "fmt" "log" "net/http" "github.com/apache/pulsar-client-go/pulsar" ) var client pulsar.Client func main() { var err error client, err = pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6650", }) if err != nil { log.Fatal(err) } defer client.Close() http.HandleFunc("/publish", func(w http.ResponseWriter, r *http.Request) { msg := r.URL.Query().Get("msg") err := publishMessage(msg) if err != nil { w.Write([]byte("msg failed to published")) } else { w.Write([]byte("msg successfully published")) } }) if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatal(err) } } func publishMessage(msg string) error { producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "my-topic", }) if err != nil { log.Fatal(err) } _, err = producer.Send(context.Background(), &pulsar.ProducerMessage{ Payload: []byte("Hello"), }) return err } A few architectural observations worth unpacking: The HTTP-to-Pulsar bridge pattern is deliberately pragmatic. Rather than requiring every upstream service to embed a Pulsar client, you expose a thin HTTP adapter. This is particularly valuable when integrating with systems that speak HTTP natively — CI/CD pipelines, third-party webhooks, or legacy services that can't easily adopt a new client library. `pulsar.NewClient` establishes a connection pool, not a single TCP connection. The client maintains persistent connections to the broker and handles reconnection, load balancing across broker nodes, and TLS negotiation transparently. Calling `client.Close()` via `defer` ensures all in-flight messages are flushed before the process exits. `producer.Send` with `context.Background()` submits the message to the producer's internal send queue. The Pulsar client batches outgoing messages by default (configurable via `BatchingMaxMessages` and `BatchingMaxPublishDelay`), which significantly improves throughput under load without any changes to application code. For production use, the producer instance should be created once at startup and reused across requests. Creating a new producer per request incurs connection overhead and bypasses the batching optimization entirely. Component 2: The Consumer The consumer subscribes to a topic and processes messages with explicit acknowledgment. The subscription type chosen here — `pulsar.Shared` — has meaningful implications for how the system scales. Go consumer, err := client.Subscribe(pulsar.ConsumerOptions{ Topic: "topic-1", SubscriptionName: "my-sub", Type: pulsar.Shared, }) if err != nil { log.Fatal(err) } defer consumer.Close() for i := 0; i < 10; i++ { msg, err := consumer.Receive(context.Background()) if err != nil { log.Fatal(err) } fmt.Printf("Received message with Id: %#v -- content: '%s'\n", msg.ID(), string(msg.Payload())) consumer.Ack(msg) } if err := consumer.Unsubscribe(); err != nil { log.Fatal(err) } Subscription Semantics in Depth Pulsar's subscription model is one of its most differentiating features. Here's how the four types behave at the broker level: Exclusive – The broker enforces that only one consumer holds the subscription at any time. A second consumer attempting to subscribe with the same name will receive an error. This guarantees strict message ordering but eliminates horizontal scaling.Shared – The broker distributes messages across all active consumers in round-robin order. Any number of consumers can join or leave the subscription dynamically. This is the right choice for stateless workloads where processing order doesn't matter and throughput is the priority.Failover – The broker designates one consumer as the active receiver. Others remain connected but idle, ready to take over if the active consumer disconnects. This preserves ordering while providing high availability — a pattern common in financial transaction processing.Key_Shared – The broker routes messages with the same key consistently to the same consumer instance. This enables stateful processing (e.g., per-user session aggregation) without external coordination, as long as the consumer count remains stable. Why Explicit Acknowledgment Matters `consumer.Ack(msg)` signals to the broker that the message has been durably processed and can be removed from the subscription's cursor. If the consumer process crashes between `Receive` and `Ack`, the broker will redeliver the message to another consumer in the subscription. This is the mechanism behind at-least-once delivery. For workloads that require exactly-once semantics, Pulsar supports transactional acknowledgment, where the `Ack` and any downstream writes are committed atomically. That's a more advanced topic, but the foundation is the same `Ack` call shown here. Component 3: The Monitor The monitor is architecturally the most interesting component. It runs two HTTP servers concurrently — one for the application endpoint, one for Prometheus metrics — and uses a Pulsar producer to generate observable traffic. Go package main import ( "context" "fmt" "log" "net/http" "strconv" "github.com/apache/pulsar-client-go/pulsar" ) func main() { client, err := pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6605", }) if err != nil { log.Fatal(err) } defer client.Close() prometheusPort := 2112 go func() { if err := http.ListenAndServe(":"+strconv.Itoa(prometheusPort), nil); err != nil { log.Fatal(err) } }() producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "topic-1", }) if err != nil { log.Fatal(err) } defer producer.Close() ctx := context.Background() webPort := 8082 http.HandleFunc("/produce", func(w http.ResponseWriter, r *http.Request) { msgId, err := producer.Send(ctx, &pulsar.ProducerMessage{ Payload: []byte(fmt.Sprintf("hello-world")), }) if err != nil { log.Fatal(err) } else { log.Printf("Published message: %v", msgId) fmt.Fprintf(w, "Message Published: %v", msgId) } }) if err := http.ListenAndServe(":"+strconv.Itoa(webPort), nil); err != nil { log.Fatal(err) } } How the Pulsar Client Registers Prometheus Metrics When `pulsar.NewClient` is called, the Go client automatically registers a set of Prometheus collectors with the default `prometheus.DefaultRegisterer`. No additional instrumentation code is required. The metrics are served at `/metrics` on whatever port you bind `http.DefaultServeMux` to port — in this case, port `2112`. The metrics exposed include: `pulsar_client_producers_opened` / `pulsar_client_producers_closed` – producer lifecycle counters`pulsar_client_consumers_opened` / `pulsar_client_consumers_closed` – consumer lifecycle counters`pulsar_client_messages_published_total` – cumulative publish count per topic`pulsar_client_publish_latency_seconds` – histogram of end-to-end publish latency`pulsar_client_bytes_published_total` – total bytes written to the broker Running the Prometheus metrics server in a goroutine while the main goroutine handles the application HTTP server is idiomatic Go concurrency. Both servers share the same `http.DefaultServeMux`, which is why the Prometheus `/metrics` handler (registered automatically by the client library) is accessible on the metrics port without any explicit route registration. Prometheus Scrape Configuration YAML scrape_configs: - job_name: pulsar-client-go-metrics scrape_interval: 10s static_configs: - targets: - localhost:2112 The `scrape_interval: 10s` is a reasonable starting point for development. In production, you would typically align this with your alerting resolution requirements — a 30-second interval is common for dashboards, while 10 seconds or less is appropriate for latency-sensitive alerting rules. With these metrics flowing into Prometheus, you can build Grafana panels that surface producer throughput, consumer lag, and publish latency percentiles with three signals that matter most when diagnosing messaging pipeline issues. Running the Full Pipeline Prerequisites Apache Pulsar standalone Go 1.18+Prometheus (optional) Startup Sequence 1. Launch Pulsar in standalone mode: Shell bin/pulsar standalone 2. Start the producer service: Shell cd producer && go run main.go 3. Start the consumer service: Shell cd consumer && go run main.go 4. Start the monitor service: Shell cd monitor && go run main.go 5. Start Prometheus: Shell prometheus --config.file=prometheus.yml 6. Publish a message via the producer endpoint: Shell curl "http://localhost:8080/publish?msg=test_pulsar_message_publish_event" # msg successfully published 7. Trigger the monitor producer: Shell curl http://localhost:8082/produce # Message Published: (messageId) 8. Inspect raw Prometheus metrics: Shell curl http://localhost:2112/metrics | grep pulsar Engineering Takeaways Decouple publish triggers from client library dependencies. The HTTP-to-Pulsar adapter pattern used in the producer is not just a demo convenience. It is a legitimate architectural boundary. Services that need to emit events don't need to know anything about Pulsar's protocol, topic naming, or client configuration. They make an HTTP call; the adapter handles the rest.Match subscription type to processing semantics, not just throughput. A common mistake is defaulting to `Shared` for everything because it scales horizontally. If your processing logic is stateful, for example, aggregating events per user session. `Key_Shared` gives you the same horizontal scalability while preserving per-key ordering without any application-level coordination.Treat the acknowledgment boundary as your consistency boundary. Everything between `Receive` and `Ack` is your processing window. Any side effects (database writes, downstream API calls, cache updates) that happen in this window must be idempotent, because Pulsar will redeliver the message if the consumer fails before acknowledging. Design your processing logic around this constraint from the start, not as an afterthought.Zero-cost observability is a genuine advantage. The fact that `pulsar-client-go` registers Prometheus metrics automatically means you get throughput, latency, and connection health data from the moment your application starts, without writing a single line of instrumentation code. This is a meaningful operational advantage over client libraries that require manual metric registration. Extending the Demo The current implementation is intentionally minimal. Here are technically meaningful extensions worth exploring: Schema enforcement – Replace raw `[]byte` payloads with Pulsar's schema-aware producer/consumer API. Using `pulsar.NewAvroSchema` or `pulsar.NewJSONSchema` moves payload validation to the broker, preventing malformed messages from ever reaching consumers.Dead-letter topic routing – Configure `DeadLetterPolicy` on the consumer to automatically route messages that exceed a maximum redelivery count to a separate topic. This prevents poison-pill messages from blocking the subscription indefinitely.Producer batching tuning – Set `BatchingMaxMessages`, `BatchingMaxSize`, and `BatchingMaxPublishDelay` on `ProducerOptions` to optimize the throughput/latency trade-off for your specific workload profile.Graceful shutdown – Add `os/signal` handling to flush in-flight messages and close the producer cleanly before the process exits. The current `defer client.Close()` handles the happy path but won't fire on `SIGKILL`.Kubernetes-native deployment – Package each component as a container and deploy using the official Pulsar Helm chart. The producer and monitor can be exposed as Kubernetes Services; the consumer can run as a Deployment with HPA scaling based on the `pulsar_client_messages_published_total` metric exported to a custom metrics adapter. Conclusion The `apache-pulsar` project demonstrates that building a production-grade messaging pipeline with Apache Pulsar and Go doesn't require much code. It requires understanding the right abstractions. The producer-consumer-monitor triad covers the three concerns that matter in any event-driven system: getting data in, getting data out, and knowing what's happening in between. Pulsar's architecture, decoupled storage, flexible subscription semantics, and built-in observability make it a strong candidate for teams that have outgrown simpler messaging systems and need more precise control over delivery guarantees, scaling behavior, and operational visibility. Source Code https://github.com/shivik/apache-pulsar-demo

By Shivi Kashyap
Zero-Downtime Deployments for Java Apps on Kubernetes
Zero-Downtime Deployments for Java Apps on Kubernetes

This article provides a comprehensive guide to achieving zero-downtime deployments for Java-based applications on Kubernetes. We cover deployment strategies, Kubernetes primitives, Java-specific considerations, session state handling, database migrations, traffic shifting techniques, CI/CD pipelines, GitHub Actions, Jenkins with automated rollbacks, observability (Prometheus, Grafana, Jaeger), Helm/ArgoCD examples, testing strategies (canary analysis, chaos, smoke tests), and troubleshooting. Deployment Strategies Kubernetes offers several strategies for deploying new versions without downtime: Rolling Update Incrementally replace old pods with new ones, maintaining availability. Kubernetes Deployment object uses rolling updates by default. You can control maxUnavailable and maxSurge to tune the rollout. Blue-Green Deployment Run two separate environments: Blue = current, green = new. Only one serves live traffic at a time. Once the Green version is verified, switch the Service or Ingress to point at Green, then scale down Blue. This allows instant rollback by redirecting traffic back to Blue. Argo Rollouts defines a blue/green strategy with an active and preview Service. Traffic flows only to the active version until promotion. Canary Deployment Gradually shift a small percentage of traffic to the new version. Start with a few pods of v2, monitor, then incrementally increase. Tools like Istio or Argo Rollouts can control traffic weights. For instance, sending 10% of traffic to v2 can be done by running 9 v1 pods and 1 v2 pod (10%). Argo defines a canary rollout with setWeight steps and pauses for analysis. Shadow/Mirroring The new version receives a copy of live requests for testing under real load, but its responses are not returned to users. This is low risk but does not assist in rollback decisions since users don’t see the new behavior. Kubernetes Primitives for Zero Downtime Deployment A Deployment naturally performs rolling updates. By default, it creates a new ReplicaSet and scales it up while scaling down the old one controlled by maxUnavailable/maxSurge. This ensures some pods always serve traffic. To use blue/green, you would deploy two separate Deployments (e.g., app-blue, app-green) and switch Services. Service and Ingress A Service fronts pods. For blue/green, you can point a single Service at either the blue or green pods. Ingress can also switch between backend services. E.g., label selectors can be adjusted to redirect traffic from version blue to version green pods. PodDisruptionBudget Ensures a minimum number of pods stay running during voluntary disruptions. For instance, setting minAvailable 1 ensures at least one pod remains during a rolling update. To avoid complete downtime during maintenance. Horizontal Pod Autoscaler (HPA) Scales pods based on CPU/memory or custom metrics. It automatically updates a workload to match demand. An HPA can be attached to the Deployment so that if traffic spikes during a rollout, new pods will be created to handle the load. Example: YAML apiVersion autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 Liveness and Readiness Probes Critical for zero downtime. A liveness probe checks if the app is alive; if it fails, K8 restarts the pod. A readiness probe tells if the app is ready to serve traffic. During startup or shutdown, the readiness probe should fail, causing the pod to be removed from the service load balancer. Spring Boot Actuator provides /actuator/health for this. In K8S YAML: YAML livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 5 Spring Boot exposes health/liveness and health/readiness groups by default. Quarkus and Micronaut have similar health endpoints. Spring Boot supports graceful shutdown by setting server.shutdown is equals to graceful and tuning spring.lifecycle.timeout-per-shutdown-phase. This causes the embedded server, either Tomcat/Jetty/Undertow, to stop accepting traffic and wait up to the timeout for active requests. Java @Component public class ShutdownListener implements SmartLifecycle { private boolean running = true; @Override public void stop() { running = false; } @Override public boolean isRunning() { return running; } } Quarkus provides graceful shutdown configuration. By setting quarkus.shutdown.timeout=10s, Quarkus will wait up to 10 seconds for current requests to finish before exiting. You can annotate a bean method with @Shutdown to run cleanup code. Micronaut has @EventListener for ShutdownEvent: Java @Singleton public class ShutdownBean { @EventListener void onShutdown(ShutdownEvent event) { } } Kubernetes Hooks You can use a preStop hook in the Deployment spec to run a script before SIGTERM. YAML lifecycle: preStop: exec: command: ["/bin/sh","-c","sleep 5"] terminationGracePeriodSeconds: 30 The grace period (default 30s) should be tuned to let the app finish. K8S doc 77†L99-L107 describes the sequence container enters Terminating, runs preStop, sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. JVM Tuning Set -XX +ExitOnOutOfMemoryError to avoid hanging. Tune thread pools so they drain quickly. Monitor GC pause times, consider using low-latency GC to minimize pause before shutdown. Session and State Handling To maintain zero downtime when pods switch: Stateless services: Best practice is to keep services stateless. Store session state or user data in an external store, such as Redis or a database. This way, any pod can handle any request, and pods can be replaced without losing the user session.Sticky sessions: If an app uses in-memory sessions, you can enforce sticky sessionsService affinity: Set sessionAffinity: ClientIP on the Service. Kubernetes routes requests from the same client IP to the same pod.Ingress affinity: Use Ingress annotations to bind a user’s requests to one pod. However, sticky sessions introduce risk and are not suitable for autoscaling.StatefulSets: For true stateful workloads, use StatefulSet with stable identities. StatefulSets pair pods with PersistentVolumes, which are not zero-downtime by themselves. GitHub Actions CI/CD Pipeline zero-downtime: YAML name: Deploy on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-java@v3 with: { java-version: '17' } - name: Build run: mvn clean package -DskipTests name: Docker Build & Push run: | docker build -t ghcr.io/myorg/myapp:${{ github.sha } echo ${{ secrets.GITHUB_TOKEN } | docker login ghcr.io -u ${{ github.actor } --password-stdin docker push ghcr.io/myorg/myapp:${{ github.sha } - name: Set image tag run: echo "::set-output name=image::ghcr.io/myorg/myapp:${{ github.sha } deploy: needs: build runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 with: { path: manifests } - name: Update K8s deployment uses: azure/setup-kubectl@v3 - name: Deploy to Kubernetes run: | kubectl set image deployment/myapp-deployment myapp=ghcr.io/myorg/myapp:${{ needs.build.outputs.image } kubectl rollout status deployment myapp-deployment This workflow builds the image, pushes it, and updates the deployment. The rollout status command waits for all new pods to become ready. If health checks fail, it will abort without downtime. Conclusion Zero-downtime deployment on Kubernetes combines careful architecture and automation, using rolling updates, progressive strategies, ensuring graceful shutdown and health checks in your Java apps, externalizing state, managing database changes, and orchestrating with CI/CD pipelines. Kubernetes primitives like Deployments, Services, Probes, and HPA, along with tools like Istio or Argo Rollouts, provide the building blocks.

By Ramya vani Rayala
Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java

The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code

By Sergiy Yevtushenko
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka

After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: Java { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Production concerns addressed: Key-based partitioning: Using orderId as the key ensures all events for the same order go to the same partition, maintaining orderingAsync publishing with callbacks: Non-blocking publish with explicit success/failure handlingStructured logging: Captures partition and offset for troubleshooting Kafka Consumer With Idempotency At-least-once delivery means duplicates are possible. Consumers must deduplicate: Java @KafkaListener(topics = "orders.order-created.v1", groupId = "billing-service") public void onOrderCreated(OrderCreated event) { // Check if already processed if (processedEventsRepository.existsByEventId(event.getEventId())) { log.debug("Skipping duplicate event: {}", event.getEventId()); return; } // Process event billingService.createInvoice( event.getOrderId(), event.getCustomerId(), event.getItems() ); // Mark as processed processedEventsRepository.save( new ProcessedEvent(event.getEventId(), Instant.now()) ); } Idempotency pattern: Check eventId before processing, store it after processing. If the same event arrives twice, the second one is ignored. Architecture Diagram: Contract-First Flow After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: JSON { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Architecture Diagram: Contract-First Flow Figure 1: Contract-first development flow showing parallel team development Note: Solid arrows represent generation or implementation flow. Dashed arrows represent validation relationships. Contract Type 3: Database Contracts With Flyway Contract Type 3: Database Contracts With Flyway Database schemas are contracts, too. Flyway migrations let you version and evolve schemas with the same contract-first approach. Base Schema Migration File: V1__create_orders.sql SQL CREATE TABLE orders ( id VARCHAR(32) PRIMARY KEY, customer_id VARCHAR(32) NOT NULL, status VARCHAR(16) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id VARCHAR(32) NOT NULL REFERENCES orders(id), sku VARCHAR(64) NOT NULL, quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, sku) ); CREATE INDEX idx_orders_customer_id ON orders(customer_id); Schema Evolution: Expand/Migrate/Contract When you need to add a field without breaking old code, use the expand/migrate/contract pattern: Expand (V2__add_order_source.sql): SQL ALTER TABLE orders ADD COLUMN source VARCHAR(32); Migrate (application code): SQL -- Backfill for existing rows UPDATE orders SET source = 'UNKNOWN' WHERE source IS NULL; Contract (V3__enforce_source_not_null.sql): SQL -- After all systems produce source ALTER TABLE orders ALTER COLUMN source SET NOT NULL; This three-phase approach lets you deploy schema changes without downtime. CI/CD Enforcement: Making Contracts Real Contracts only work if they're enforced. Here are the CI gates that prevent breaking changes: REST API: OpenAPI Breaking Change Detection Run openapi-diff in CI to catch breaking changes: YAML # .github/workflows/api-contract-check.yml - name: Check for API breaking changes run: | npx openapi-diff \ main:contracts/openapi/orders-api.v1.yaml \ HEAD:contracts/openapi/orders-api.v1.yaml \ --fail-on-breaking This fails the build if you: Remove required fieldsChange field typesRemove endpointsChange HTTP status codes Kafka: Schema Registry Compatibility Check Configure Schema Registry to enforce backward compatibility: YAML # Schema Registry configuration confluent.schema.registry.url=http://schema-registry:8081 spring.kafka.properties.schema.registry.url=http://schema-registry:8081 # Enforce backward compatibility spring.kafka.producer.properties.auto.register.schemas=true spring.kafka.producer.properties.use.latest.version=true When you try to publish an incompatible schema, the producer fails at startup, before reaching production. Database: Flyway Validation Flyway validates migrations on startup: Properties files spring.flyway.validate-on-migrate=true spring.flyway.baseline-on-migrate=false If someone manually modified the database schema, Flyway detects the mismatch and fails the deployment. Integration Flow Diagram Figure 2: End-to-end flow showing REST → DB → Kafka integration with contracts enforced at each boundary Real-World Benefits: Production Implementation Experience After implementing contract-first across multiple services (orders, billing, inventory), teams consistently observe significant improvements: Integration quality: Substantial reduction in integration bugs reaching productionMost remaining bugs are business logic issues, not contract mismatchesClear contracts eliminate ambiguity in integration expectations Development velocity: Integration cycles measured in weeks rather than monthsConsumer teams start integration work immediately instead of waiting for provider completionMock servers enable realistic integration testing without coordination delays Operational reliability: CI-enforced contract validation prevents breaking changes from reaching productionBreaking changes caught during PR review, not after deploymentAutomated validation ensures compatibility before merge Documentation accuracy: Documentation generated from OpenAPI specs stays synchronized with implementation by designSwagger UI always reflects actual API behavior as both derive from the same contractNo manual documentation maintenance required Common Pitfalls and How to Avoid Them Pitfall 1: Treating Contracts as Documentation Wrong approach: Write code first, generate OpenAPI from annotations later.Right approach: Write OpenAPI first, generate server stubs, or validate implementation against it. Pitfall 2: Skipping CI Validation Wrong approach: Trust developers to manually check compatibility.Right approach: Automate breaking change detection in CI. Make it a required check. Pitfall 3: No Schema Evolution Strategy Wrong approach: Add fields without defaults, breaking old consumers.Right approach: All new fields must be optional or have defaults. Test with multiple schema versions. Pitfall 4: Ignoring Idempotency Wrong approach: Assume exactly-once delivery in Kafka.Right approach: Design consumers to deduplicate by eventId. Assume at-least-once. Pitfall 5: Coupling Contracts to Implementation Wrong approach: Expose internal database schema directly in API contracts.Right approach: Design contracts for consumers, not internal convenience. Use DTOs that map between the external contract and the internal model. When Contract-First Makes Sense Contract-first isn't always the answer. Use it when: ✅ Multiple teams integrate: The coordination cost justifies the upfront contract design effort✅ Public APIs or partner integrations: External consumers need stability and clear documentation✅ Microservices architecture: Services must evolve independently without breaking dependents✅ High change frequency: CI validation catches breaking changes early when changes are frequent Skip contract-first when: ❌ Single small team: Coordination overhead is low, so formal contracts add friction❌ Prototyping: You're exploring the problem space and expect major pivots❌ Internal tools with one consumer: The provider and consumer are maintained by the same person Key Takeaways Contracts enable parallel development: Provider and consumer teams work simultaneously instead of seriallyCI validation prevents breaking changes: Automate OpenAPI diffs, schema compatibility checks, and migration validationIdempotency is not optional: At-least-once delivery in Kafka requires consumers to deduplicate by event IDSchema evolution requires backward compatibility: Use nullable fields with defaults to support gradual rolloutsContracts are design specifications, not afterthoughts: Define contracts first, generate code from them What's Next? If you're implementing contract-first integration: Start with one integration, pick your most painful cross-team dependencyWrite the OpenAPI spec or Avro schema first, before any implementationSet up CI validation for that contract (openapi-diff or Schema Registry)Measure reduction in integration bugs and coordination overheadExpand to other integrations once you've proven the pattern Contract-first isn't a silver bullet, but it transforms integration from a coordination nightmare into a governance problem that CI can solve. When you're coordinating three teams across two time zones, that shift makes the difference between shipping in weeks versus months. Full source code: github.com/wallaceespindola/contract-first-integrations Related reading: OpenAPI 3.0 Specification: spec.openapis.orgConfluent Schema Registry: docs.confluent.io/platform/current/schema-registryFlyway Documentation: flywaydb.org/documentationMastering Contract-First API Development: moesif.com/blogResearch on Microservices Issues: arxiv.org Need more tech insights? Check out my GitHub, LinkedIn, and Speaker Deck. Happy coding!

By Wallace Espindola
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions

When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.

By Harpreet Siddhu
Your AI Agent Tests Are Passing, But Your Agent Is Still Broken
Your AI Agent Tests Are Passing, But Your Agent Is Still Broken

I was building an AI agent that reads log files, calls APIs, and runs tools based on user instructions. Standard stuff for anyone working with LLM-based automation today. I wrote Playwright tests for it. The tests were green. The agent was lying. Here is what happened, and what I had to build to fix it. The Trap I Walked Into As covered in Building a New Testing Mindset for AI-Powered Web Apps, "unlike a rules-based form, the AI agent might phrase the same question differently each time — making it impossible to write a single pass/fail test script." I hit this immediately. My first test looked like this: TypeScript expect(output).toBe("I read logs/test-results.log. Summary: 2 tests failed, 8 passed."); It passed last week. It failed this week. The model said: Plain Text I checked logs/test-results.log. Summary: 8 passed, 2 failed. Same meaning, but different words, different order, and Test broken. So I switched to snapshots - same problem, bigger diffs. Then, regex is fragile and impossible to maintain. Then I checked only HTTP status and "no crash" — tests went green while the agent picked the wrong tool entirely or gave a confident, wrong answer. After all of that, I realized the issue: I was treating LLM output like fixed copy. I was testing the model's writing style, not the agent's behavior. The Bug That Changed How I Think About This This is the one that made the problem concrete for me. The task: "Read notes/meeting.txt and give me a one-line summary." My test: TypeScript expect(reply.trim().length).toBeGreaterThan(0); The agent returned a perfectly normal sentence. Test passed. What actually happened: the model never read the file. It guessed a plausible summary from the prompt alone and returned it as if it had done the work. The reply was non-empty, so the assertion was satisfied. That test wasn't checking agent behavior. It was checking that the model could generate a sentence, which it always can. The question I needed to answer was not "did it return text?" but "did it actually call the file-reader tool?" Those are different questions entirely. What to Test Instead Effectively Managing AI Agents for Testing puts it well: agents are best understood as a system prompt combined with state, memory, and a selection of tools. That definition is exactly why testing them requires a different approach — you are testing decisions, not return values. When I stepped back, I realized agent testing has three distinct layers that traditional assertions don't cover: Decisions – which tool did it pick, and did it pick the right one?Sequence – for multi-step tasks, did it follow a valid order?Output rules – does the answer satisfy flexible behavioral rules, not a frozen string? None of these maps cleanly to expect(output).toBe(...) What I Built I built AgentAssert - a Playwright-based reference implementation of five testing patterns for agents that call tools. The core idea: instead of asserting on the final text, assert on the trace — a complete log of every decision the agent made, every tool it called, and every result it received. TypeScript const trace = await agent.run("Read logs/app.log and summarize errors"); // Did it actually use the tool? AgentAssert.toolWasInvoked(trace, 'file-reader', { filePath: /.*\.log$/ }); // Did it say the right kind of thing? AgentAssert.satisfiesContract(trace.output, BehaviorContract.SUMMARIZATION); The five patterns the repo demonstrates: Pattern 1 – Tool Invocation: Did the agent call the right tool? This catches the meeting.txt class of bug - a confident-sounding answer with no actual work behind it. Pattern 2 – Behavior Contracts: Does the output satisfy flexible rules (required fields, must-include concepts, forbidden phrases) without requiring exact wording? The contract matcher is rule-based - keywords and patterns - not a second AI model. It is inspectable and cheap to run. Pattern 3 – Multi-Step Trace Verification: For tasks that require two tools in sequence, did the agent follow the right order? Browser tests check page state. These tests check the agent's internal reasoning path. Pattern 4 – Boundary Enforcement: Did the agent stay within its allowed tools, or did it hallucinate tool names and try to call things it shouldn't? This one catches scope creep early. Pattern 5 – Failure Observability: When a tool errors, does the agent report the failure honestly or claim success anyway? Most agent test suites never simulate tool failures. This pattern forces it. Why Playwright and not Jest This repo uses Playwright as the test runner, which surprised a few people who reviewed it. Playwright is usually a browser testing tool. The reason is practical. Agent tests are slow and flaky by nature — LLM responses vary, API calls take time. Playwright gives you per-test timeouts, built-in retries, HTML reports with attachments, and worker-level isolation. Jest requires plugins or manual configuration for all of that. When a behavioral test fails, the HTML report shows the full agent trace attached directly to the failure — which tool ran, in what order, and what the model said at each step. Playwright's capabilities go well beyond browser testing. Master API Testing with Playwright covers how it handles retries, timeouts, and network interception for backend flows. AgentAssert builds on those same strengths - applied to LLM tool-call loops instead of HTTP endpoints. Using Playwright without a browser is unconventional. But for this problem, it fits better than the alternatives. What This Doesn't Solve The contract matcher works on keywords and patterns. If the agent says "unable to locate the file" instead of "file not found" and your contract only lists one phrasing, it may fail even though the meaning is the same. This is a real limitation. More sophisticated approaches exist. 5 Agent CI/CD Evaluation Best Practices describes using an LLM-as-judge with soft and hard failure thresholds. That approach is more powerful but adds cost and latency. The contract matcher here is deliberately simpler - inspectable rules you can read and tune in one file. This repo also does not test security, production monitoring, or external system behavior. It tests what you define rules for. The value lies in catching common failures — wrong tool, wrong order, false success, scope violations— at a low cost and with repeatability in CI. The Shift in Mindset When I finished building this, the thing that stuck was not the code. It was the reframe. Software Testing in the LLM Era describes how the tester's role is moving from executing scripts to validating AI decisions. The five patterns in this repo are one practical step in that direction. Agents are not functions. You cannot test them the way you test a function that returns a fixed value for fixed input. An agent makes decisions. You need to test the decisions — what it chose to do, in what order, and whether it stayed honest when things went wrong. The code is at github.com/bireshpatel/agent-assert. It is a reference implementation, not a published library. Copy the patterns, adapt the framework to your agent, and replace the sample tools with your own.

By Biresh Patel

The Latest Testing, Deployment, and Maintenance Topics

article thumbnail
Conversational Risk Accumulation: Stateful Guardrails Beyond Single-Turn LLM Checks
Learn how Conversational Risk Accumulation (CRA) helps detect session-level risks in long AI chats using telemetry, drift tracking, and soft guardrails.
June 15, 2026
by Sanjay Mishra
· 492 Views
article thumbnail
From ETL to Lakeflow: Shifting to a Declarative Data Paradigm
The article focuses on moving away from traditional, "imperative" ETL processes to a modern, "declarative" approach using the Databricks Lakeflow platform.
June 15, 2026
by Seshendranath Balla Venkata
· 445 Views
article thumbnail
How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration)
Learn how a local LLM agent automates work list generation from reports, enriches tasks from Jira, detects duplicates, and keeps enterprise data secure.
June 11, 2026
by Sergey Laptick
· 1,823 Views · 1 Like
article thumbnail
Can We Build Elite Search Agents Without Massive Industrial RL Pipelines?
OpenSeeker-v2 is intended to push the limits of search agents working with informative and also high-difficulty trajectories.
June 11, 2026
by mike labs
· 1,134 Views
article thumbnail
The Repo Tracker: Automating My Daily GitHub Catch-Up
Automate GitHub repo tracking with a local agent using Python, SQLite, and cron. Learn how to build a lightweight monitoring system for open-source projects.
June 11, 2026
by Alain Airom
· 1,200 Views
article thumbnail
Deployment Lessons You Only Learn the Hard Way
Learn how to reduce deployment risk with effective monitoring, rollback strategies, incident response playbooks, and recovery practices.
June 10, 2026
by Sandesh Basrur
· 1,008 Views
article thumbnail
Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN
Learn how a RAG-powered bug triage agent uses AWS Bedrock, OpenSearch, and dynamic scoring to automate crash analysis and routing.
June 9, 2026
by Rajasekhar sunkara
· 822 Views
article thumbnail
Frame Buffer Hashing for Visual Regression on Embedded Devices
Learn how frame buffer hashing reduced visual regression storage from 18GB to 19KB while speeding up CI and eliminating flaky image diffs.
June 9, 2026
by Rajasekhar sunkara
· 608 Views
article thumbnail
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
A technical deep dive into Amazon Quick — how it works, how it connects to your tools via MCP, and where it sits in the AWS agent stack.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 1,067 Views
article thumbnail
Agentic AI Has an Observability Blind Spot Nobody Is Talking About
Production AI agents can trigger cascading failures when observability tracks what broke, but not whether the system can safely absorb remediation actions.
June 8, 2026
by Sayali Patil
· 1,173 Views · 2 Likes
article thumbnail
The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
This comprehensive technical guide breaks down the essential architectural, storage, and integration patterns required to scale enterprise big data platforms.
June 8, 2026
by Ram Ghadiyaram DZone Core CORE
· 1,463 Views
article thumbnail
How to Build an Agentic AI SRE Co-Pilot for Incident Response
Build an agentic SRE co-pilot using LLMs to autonomously reason, plan, and execute incident response across complex, multi-cloud infrastructure.
June 8, 2026
by Akshay Pratinav
· 1,047 Views
article thumbnail
How to Interpret the Number of Spring ApplicationContexts in Integration Tests
When optimizing Spring Boot integration tests, developers often focus on obvious metrics, but they do not always explain why an integration test suite is slow.
June 8, 2026
by Constantin Kwiatkowski
· 1,092 Views
article thumbnail
Mastering Fluent Bit: Beginners' Guide for Contributing to our CNCF Project Docs
This intro to mastering Fluent Bit covers the entry point for developers that want to contribute to a CNCF documentation project but are not sure how.
June 8, 2026
by Eric D. Schabell DZone Core CORE
· 908 Views · 1 Like
article thumbnail
Mastering Fluent Bit: Beginners' Guide for Contributing to Our CNCF Project Website
This intro to mastering Fluent Bit covers the entry point for developers that want to contribute to a CNCF project website but are not sure how.
June 5, 2026
by Eric D. Schabell DZone Core CORE
· 2,331 Views · 1 Like
article thumbnail
From 24 Hours to 2 Hours: How We Fixed a Broken BI System With Apache Airflow
Broken pipelines, inaccurate data, frustrated stakeholders. Here is what we did about it and what I wish I had known before we started.
June 5, 2026
by Chinni krishna Abburi
· 1,936 Views
article thumbnail
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Learn how to trace AI agents end to end, from prompts and tool calls to business outcomes, with observability practices for production workflows.
June 5, 2026
by Srinivas Chippagiri DZone Core CORE
· 2,705 Views · 1 Like
article thumbnail
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It
Most QA teams are stuck in a manual scripting loop. Here's the requirement-driven architecture that eliminates the coverage gap permanently.
June 5, 2026
by Waqar Hashmi
· 1,925 Views
article thumbnail
Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
A mutation testing pattern for analytics metrics that checks if validation catches realistic business logic errors early.
June 4, 2026
by Prateek Arora
· 2,368 Views · 1 Like
article thumbnail
Advanced Error Handling and Retry Patterns in Enterprise REST Integrations
Blind retries amplify outages fast. Classify failures first, jitter your backoff, and circuit-break early before cascading.
June 4, 2026
by Anil guntupalli
· 2,505 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×