DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Culture and Methodologies

In our Culture and Methodologies category, dive into Agile, career development, team management, and methodologies such as Waterfall, Lean, and Kanban. Whether you're looking for tips on how to integrate Scrum theory into your team's Agile practices or you need help prepping for your next interview, our resources can help set you up for success.

Functions of Culture and Methodologies

Agile

Agile

The Agile methodology is a project management approach that breaks larger projects into several phases. It is a process of planning, executing, and evaluating with stakeholders. Our resources provide information on processes and tools, documentation, customer collaboration, and adjustments to make when planning meetings.

Career Development

Career Development

There are several paths to starting a career in software development, including the more non-traditional routes that are now more accessible than ever. Whether you're interested in front-end, back-end, or full-stack development, we offer more than 10,000 resources that can help you grow your current career or *develop* a new one.

Methodologies

Methodologies

Agile, Waterfall, and Lean are just a few of the project-centric methodologies for software development that you'll find in this Zone. Whether your team is focused on goals like achieving greater speed, having well-defined project scopes, or using fewer resources, the approach you adopt will offer clear guidelines to help structure your team's work. In this Zone, you'll find resources on user stories, implementation examples, and more to help you decide which methodology is the best fit and apply it in your development practices.

Team Management

Team Management

Development team management involves a combination of technical leadership, project management, and the ability to grow and nurture a team. These skills have never been more important, especially with the rise of remote work both across industries and around the world. The ability to delegate decision-making is key to team engagement. Review our inventory of tutorials, interviews, and first-hand accounts of improving the team dynamic.

Latest Premium Content
Trend Report
Platform Engineering and DevOps
Platform Engineering and DevOps
Trend Report
Developer Experience
Developer Experience
Refcard #399
Platform Engineering Essentials
Platform Engineering Essentials
Refcard #008
Design Patterns
Design Patterns

DZone's Featured Culture and Methodologies Resources

Getting Started With Agentic Workflows in Java and Quarkus

Getting Started With Agentic Workflows in Java and Quarkus

By Shane Johnson
This post walks through building and running a real-world agentic workflow with Agentican and Quarkus. Specifically, an agentic workflow to automate market research and information sharing: Identify the top vendors within a market category.Research the positioning and strengths of each vendor.Classify the findings as either standard or urgent.Draft a brief to share with others in the company. Prerequisites QuarkusJava 25Maven (or Gradle)LLM provider API key Step 1: Add the dependency Create a Quarkus app, and add the Agentican Quarkus runtime module: XML <dependency> <groupId>ai.agentican</groupId> <artifactId>agentican-quarkus-runtime</artifactId> <version>0.1.0-alpha.3</version> </dependency> Step 2: Define Agents, Skills, and the Workflow Create an `agentican-catalog.yaml` file on the classpath. This is where you describe: Who does the work (agents)What they need to do it (skills)How they will do it (workflows) YAML agents: - id: researcher name: researcher role: | Expert at finding accurate, sourced information about companies and markets. Quotes sources. Distinguishes opinion from fact. - id: writer name: writer role: | Synthesizes research into structured, concise briefs. Avoids hedging language. Cites concrete evidence. skills: - id: web-search name: web-search instructions: | When a question requires external information, call the search tool first. Quote sources in your answer. Update the `agentican-catalog.yaml` file to define the workflow. YAML workflows: - id: market-brief name: market-brief description: Research vendors in a market and produce a structured brief outputStep: deliver params: - name: topic description: Market to research required: true - name: vendor_count description: Number of vendors defaultValue: "5" steps: - name: identify agent: researcher skills: [web-search] instructions: | Identify the top {{param.vendor_count} vendors in {{param.topic}. Return a JSON array of vendor names — names only, no commentary. - name: deep-dive type: loop over: identify steps: - name: analyze agent: researcher skills: [web-search] instructions: | Deep-dive vendor {{item}: positioning, key strengths, recent news. Quote sources. - name: classify agent: writer instructions: | Read the per-vendor deep-dives below. If any vendor has launched a competitive feature in the last 30 days, return the single word 'urgent'. Otherwise return 'standard'. Deep-dives: {{step.deep-dive.output} dependencies: [deep-dive] - name: deliver type: branch from: classify default: standard branches: - name: urgent steps: - name: urgent-brief agent: writer instructions: | Synthesize a vendor brief flagged URGENT for executive review. Lead with the recent competitive moves. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} - name: standard steps: - name: standard-brief agent: writer instructions: | Synthesize a vendor brief. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} A few things worth flagging: agent: researcher references the agent for a step, skills referenced by name, too.outputStep designates the step whose output becomes the workflow's typed result.{{param.X} interpolates workflow inputs into step instructions.{{step.X.output} interpolates an upstream step's output.{{item} is the current value inside a loop iteration.type: loop steps take an over reference (a step that produced a list, or a list-typed param).type: loop steps run their nested steps once per item, in parallel, and on virtual threads.type: branch steps take a from reference (a step whose output is used to select a branch).branches: mutually exclusive steps (or sets of steps) with default for unrecognized values. The framework loads agentican-catalog.yaml from the classpath, or you can define where it's loaded from: Properties files agentican.catalog-config=/etc/agentican/agentican-catalog.yaml Note: Agents, skills, and workflows can be defined via a fluent builder API as well. Step 3: Configure the Models Agentican reads the engine configuration from `application.properties`. The minimum is one LLM: Properties files agentican.llm[0].api-key=${ANTHROPIC_API_KEY} The provider defaults to `anthropic`, and the model defaults to `claude-sonnet-4-5`. Want OpenAI instead? Properties files agentican.llm[0].provider=openai agentican.llm[0].api-key=${OPENAI_API_KEY} agentican.llm[0].model=gpt-4o-mini Want to mix and match? Configure `name`s and reference them per-agent in the YAML catalog: Properties files agentican.llm[0].name=default agentican.llm[0].api-key=${ANTHROPIC_API_KEY} agentican.llm[1].name=efficient agentican.llm[1].provider=openai agentican.llm[1].api-key=${OPENAI_API_KEY} agentican.llm[1].model=gpt-4o-mini Step 4: Create a Typed Workflow Instance Define the workflow input and output records: Java public record ResearchParams(String topic, int vendorCount) {} public record VendorBrief(String topic, List<Vendor> vendors) { public record Vendor(String name, String positioning, List<String> strengths) {} } Then inject the typed workflow, and call it from a REST endpoint: Java @Path("/market-brief") public class VendorBriefResource { @Inject @AgenticanWorkflow(name = "market-brief") Workflow<ResearchParams, VendorBrief> brief; @POST @Path("/{topic}") public VendorBrief generate(@PathParam("topic") String topic) { return brief.start(new ResearchParams(topic, 5)).await(); } } Now, test the endpoint: Shell curl -X POST http://localhost:8080/market-brief/data%20observability%20platforms A few things worth flagging — they're what set this apart from a generic "call an LLM" library: ResearchParams.vendorCount becomes the workflow parameter vendor_count via SNAKE_CASE mapping.start() returns a WorkflowRun<VendorBrief> and await() parses the output step's text into a VendorBrief.@AgenticanWorkflow(name = "vendor-brief") resolves the registered workflow at injection time. Note: WorkflowRun itself exposes future() for a CompletableFuture<R>, and there's a ReactiveWorkflow<P, R> Mutiny variant for Vert.x stacks. Step 5: Add Agent Tools Agentican ships two integrations out of the box: MCP (Model Context Protocol) There is one config block per server. Tools are auto-discovered: Properties files agentican.mcp[0].slug=github agentican.mcp[0].name=GitHub agentican.mcp[0].url=https://mcp.github.com/sse agentican.mcp[0].headers.Authorization=Bearer ${GITHUB_TOKEN} Composio 100+ SaaS toolkits — Slack, Notion, Linear, Salesforce, GitHub, Google Workspace: Properties files agentican.composio.api-key=${COMPOSIO_API_KEY} agentican.composio.user-id=user-123 Tools are referenced by name within agent steps: YAML steps: - name: research agent: researcher tools: [github_search_repositories] instructions: "Profile open-source vendors in {{param.topic}." Structured agentic workflows for the JVM. Where to Go Next Getting Started — install, configure, and run workflowsCore Concepts — architecture, terminology, and data flowWorkflows & Steps — CDI surface, beans, qualifiers, override patterns.Agents — defining agents, skills, and rolesGetting Started (Quarkus) — dependency setup, config, first taskCDI Integration — injection, qualifiers, lifecycle events, bean overridesREST API — endpoints, SSE streaming, WebSocket, error codesObservability — Micrometer metrics, OTel tracing, Prometheus queries More
Identity in Action

Identity in Action

By Kapil Chakravarthy Sanubala
Switching from one single sign-on (SSO) vendor to another is a complex process that involves more than just changing technologies. This is a high-stakes identity operation that impacts security, user experience, following the rules, accessing applications, and keeping things running smoothly. It's not the same as moving a reporting tool or a collaboration platform because SSO is at the front door of every application in your environment. If you set it up wrong, everything will stop working. But the biggest danger of SSO migrations is not that they won't work. The little things that go wrong are the most annoying Users being locked out of apps that are important to the businessAccounts being left alone that were never deprovisionedMFA enrollments disappearing without a word and Helpdesk queues are getting longer on the morning of cutover because there was no communication about the change. This guide discusses the best ways to move to cloud SSO and the most important things to keep in mind. It discusses everything from getting the identity estate ready for the move of integrations to phased rollout strategies, making the user experience as smooth as possible, and planning for MFA migration. Why Businesses Change SSO Providers Companies don't usually change their SSO platforms on a whim. One of the following things usually makes it happen: Acquisition of a vendor or announcement of the end of a product's life. Cost consolidation or figuring out how to use enterprise licenses. Standardizing platforms under a broader cloud strategy. Requirements for compliance or regulation that the current business can't meet. Issues with scalability, performance, or missing features in the current platform.A merger or acquisition that introduces a second identity domain. Whatever the reason, migration causes compounding risk since SSO is foundational infrastructure, not an individual application. 3 Types of Migration Approaches and Their Differences There are three main ways to move to SSO, and each one has its risks and effects on governance. Federated Protocol Swap Retain the same IdP architecture but replace the vendor platform underneath. For example, moving from PingFederate to Entra ID External Identities. The protocol (SAML, OIDC, SCIM) may remain the same, but attribute mappings, claim transformations, and session behaviors differ in ways that are often not clear until something breaks in production. Full IdP Replacement The old IdP is completely removed, and a new one is put in its place. Need to set up, test, and cut over every connection with a service provider (SP) again. This type has the most risk, and it's also the one that most businesses don't consider. Consolidation Migration A single authoritative platform brings together many IdPs. Such an event can happen when companies merge or acquire another. There are technical and organizational problems, such as different business units having different app owners, SLAs, and levels of tolerance for disruption. Governance alignment needs to happen before any technical work can begin. Migration Process: The 7 Steps Audit and clean upPlan and PrepareMFA MigrationCommunication PlanningPhased RolloutGovernance ConsiderationDecommission and close out Step 1: Audit and Clean up Most organizations rush, ignore, and migrate everything, including unused applications, inactive users, orphaned accounts, and integrations that have remained unused for three years. These don't break, but leave a security risk. Following validations reduces testing and inventory. Create a complete, clean list of applications: Validate against the CMDB or application catalog.Validate apps being used.Validate access logs from SIEM.Validate against IGA platforms.Reduce redundant applications. Create a complete, clean list of valid users: Active users.Exclude accounts with no activity for 90 days. Exclude dormant accounts whose passwords were never changed.Validate against IGA platforms and HR systems. Mark the unused applications for the decommissioning process. Note down the protocols used (SAML, OIDC, WS-Federation, or legacy), application owners, attributes and claims, MFA requirements, CA policies, and session time-out configurations. Step 2: Plan and Prepare Every application that relies on SSO consumes identity attributes passed in SSO protocols. New IdPs rarely use the same attributes and often have case-sensitive and format changes. These mismatches cause silent authentication failures and will be extremely difficult to diagnose during cutover. Application Metadata Prepare the claims transformation registry. Confirm the case and formats.Validate transformation rules. Redirect URLs For each application, configure a transparent redirect from the legacy IdP login URL (or intranet homepage) to the new IdP's login endpoint. The user will not experience major changes. The only change a user would notice would be the new MFA prompt. Rollback Process Identify when you should roll back.Who will be able to make the rollback decision? Rollbacks generally occur in the following use cases: The rate of successful authentications drops below 95%.Validate SSO failures for major applications.More calls to the help desk than usual during the first 2 days of migration. Migration go-live Documentation regarding new login flow end-to-endPlan for extended staff during the migration. Validate helpdesk access to the new platform.Identify and set up escalation contacts for issues that the helpdesk cannot resolve. Step 3: MFA Migration Prepare a complete inventory of existing MFA enrollments that includes How many users have MFA enrolled vs. password only? What factors are in use? Authenticator Apps – Need to re-enrollSMS – Same phone number and email can be used. Hardware token – FIDO2/WebAuthn keys can be reused if the new vendor supports itBiometrics – Need to re-enroll.How many and which users have only a single factor enrolled? Follow the steps for re-enrollment: Open the self-service enrollment portal.Phone numbers and emails can be reused (since they remain the same).Send advance communications at least two weeks out, explaining what will change and why.Track re-enrollment completion rates by department and group.Send follow-up emails, including deadlines.Set up a plan to re-enroll privileged accounts. Step 4: Communication Plan Communication is a major step in the migration process and should be tracked as a separate workstream, treated with its timeline, owners, deadline, and success metrics. There are three different audiences involved in SSO migration. End users who simply need to know what will change and what to do.Helpdesk and IT staff who need operational readiness confirmations.Stakeholders who need status updates and risk visibility. Major email templates include: General UpdatesMFA-Enrollment NoticesCut Over Day notification Step 5: Phased Rollout Never perform a cutover for the entire organization. Instead, choose a phased rollout. This reduces risk, helps validate configurations in production with real users and real traffic, and provides time to identify issues before affecting most of the organization. First Phase—Technology users Internal IT staff.Identity administrator.Helpdesk personnel.power users.Second Phase - High-frequency application users like ERP applications CRM applications Collaboration platform BI toolsThird Phase—General user population Lower-risk departmentsExceptions and low-activity users ContractorsUsers who log in very lessThird-party users Step 6: Governance Considerations To ensure successful migration and validations, consider the following governance aspects: Changes to IGA Solutions JML changes Provisioning accounts in IDP with required attributes for SSO claims.Disabling or deletion of accounts during terminations.User transfers: changes to account attributes and group memberships.Changing birthright roles Update with new SSO groups.Cleanup of legacy vendor applications. Audit Log Monitoring Onboard logs from new vendor to SIEMSet up alerts for notifications, including Authentication failuresCA policy failuresPassword failuresToken expiration Non-Human Identities Create a separate inventory of NHA accounts and migrate their credentials to the new system. These include accounts with no owners. Step 7: Decommission and Close Out The process can move forward once all the checks are done and the MFA enrollments are at acceptable levels. Monitor the new system for 30 days and plan for the decommissioning of the old SSO solution. Conclusion SSO is the authentication layer for all the applications in the organization. Performing migration without a proper plan includes risk. Most companies follow one or a combination of the above-described approaches. Adhering to a proper plan with communication and the right strategies will never make you think about rollback strategies. More
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
By Kajol Shah
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
By Madhusudhan Chivukula
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
By Mirco Hering DZone Core CORE
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature Flag Debt: Performance Impact in Enterprise Applications

Feature flags have become standard practice in enterprise applications, enabling teams to release code into production environments without exposing new features to users. As teams leverage feature flags to increase delivery velocity, technical debt accumulates. Left unchecked, this debt will slowly and silently impact application performance, maintainability, and developer productivity. What Is Feature Flag Debt? Feature flag debt occurs when feature flags are left in the codebase after they’ve served their purpose. The most common symptoms of feature flag debt include: Dead code Context switching for developers Feature flag debt can go unnoticed because it typically doesn’t cause broken features. As a result, developers are often reluctant to clean up flags so they can focus on developing new features. Impact on Performance Feature flag debt can have serious consequences for application performance. In front-end applications, this is often overlooked. Once a feature flag has been introduced into a codebase, it incurs a long-term cost every time the application is loaded in the browser. Larger JS bundles: Each feature flag adds logic to the application. When feature flags are not cleaned up, the associated code is typically not removed from the final bundled app. This means more code for users to download and more memory used on the client.Reduced execution speed in client-side rendering: The browser must download, parse, and evaluate the entire bundle, even if certain code paths are never executed. This leads to slower parsing, longer load times, and slower interaction time. Impact on Developer Productivity Feature flag debt also negatively impacts developer productivity. Imagine having to read through an if/else statement that checks a feature flag that will never be true. Developers frequently encounter this scenario when working with feature flags. New engineers, in particular, often struggle to know which feature flags are safe to ignore. Should they be commenting out this code? What if they need it later? Why Aren’t Feature Flags Cleaned Up? It should be standard practice to remove feature flags from the codebase once they’re no longer needed. However, they often become a long-term liability for the application for several reasons: Nobody takes responsibility for cleaning up flags.People are afraid to remove code.There are no tools to help automate the process.There’s always something more pressing to work on. We often don’t see a defined feature flag lifecycle, which leads to indefinite accumulation. Example of Feature Flag Debt For example, let’s take a look at how a feature would typically look when wrapped in a feature flag: JavaScript const isAIAgentsFeatureFlagEnabled = isFeatureEnabled('ai-agents'); if (isAIAgentsFeatureFlagEnabled) { // lines of code // Code to run when the feature flag is enabled } else { // lines of code // Code to run when the feature flag is disabled } When first implemented, this doesn’t look too bad. When this feature is rolled out to production, there’s still the safety net of keeping the original functionality should something go wrong. However, after the feature flag is turned on for everyone and the feature reaches general availability (GA), there is no reason to keep both pathways in the application. The application still ships both pieces of code in the bundle, but only one will ever execute at runtime. The else block now represents dead code that will not get executed, but still takes up space in the bundle and adds to code complexity. Manage and Eliminate Feature Flag Debt Organizations need to take measures to prevent feature flag debt from slowing down their applications. Defining a feature flag life cycle is a great place to start. By enforcing that each feature flag has a description, owner, status, and expiration date, the team can ensure flags aren’t left to become debt. Treat feature flags as temporary and not part of the application's core architecture. When the feature is in GA, remove the flag and delete any code paths that are no longer needed. This results in a cleaner, more maintainable, and performant codebase. JSON [ { "feature_flag_name": "ai-agents", "description": "Feature flag that will allow AI agents to assist users with workflows and provide suggestions", "owner": "architecture crew", "status": "GA", "expiration_date": "2026-12-31" }, { "feature_flag_name": "smart-checkout", "description": "Feature flag that will allow smart checkout features, including dynamic pricing, custom offers", "owner": "architecture crew", "status": "Dev", "expiration_date": "2026-12-31" }, { "feature_flag_name": "ai-agents-eval", "description": "Feature flag to allow the evaluation framework to execute tests against AI agents to determine how accurate they are", "owner": "agent evaluation crew", "status": "QA", "expiration_date": "2026-10-12" }, { "feature_flag_name": "experiment-recommendation-v2", "description": "Feature flag for experimenting v2 recommendation version", "owner": "agent evaluation crew", "status": "GA", "expiration_date": "2026-12-31" } ] Having the feature flags stored in a format similar to the above can help identify who to contact to clean up old flags. Performance Gains From Cleanup Removing unused feature flags reduces bundle size and eliminates unnecessary code execution, resulting in faster load times, improved rendering performance, and a cleaner codebase. Conclusion For most enterprise applications, feature flags aren’t the problem; it’s forgetting to take them down. As the application grows over time, old feature flags accumulate, which will silently bloat the bundle size, degrade performance, and clutter the code.

By Poornakumar Rasiraju
DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. High-performing engineering organizations don’t scale through heroics. They scale through repeatable platform capabilities backed by evidence. This checklist reflects the shift from tool‑centric DevOps to product‑oriented platform engineering, focused on scale, reliability, and developer outcomes. It is intended for platform teams, cloud architects, and engineering leaders building internal developer platforms (IDPs) that deliver consistency, velocity, and control. Architecture and Platform Foundations Establishing standardized, versioned platform foundations makes workloads deployable, observable, and scalable by default while preventing drift and reducing risk. Core platform primitives are standardized: identity, networking, compute, storage, and secretsStandard blueprints exist and are version-controlled for common workloads with clear evolution pathsInfrastructure is provisioned via reusable IaC modules with policy validationEnvironments and clusters follow consistent topology and access modelsNetworking and service communication follow secure, consistent patternsSecrets and configurations are centrally managed and injected securelyArchitectures define scalability mechanisms and fault boundariesResilience is built in through redundancy and failoverShared services are centrally managed with defined ownership and SLAsPlatform capabilities are versioned for backward compatibility Platform Ownership and Operating Model A product‑oriented operating model enables scale without slowing teams. Define clear ownership, interfaces, and governance so the platform evolves without becoming a delivery bottleneck. A dedicated platform team owns roadmap, usability, reliability, and adoptionOwnership boundaries are defined (platform standardizes; app teams own service logic)Platform capabilities are easy to discover and use (e.g., templates, workflows, golden paths)A structured intake and support model exists (e.g., requests, issues, exceptions)Standards are enforced with governed exceptionsPlatform success is measured through adoption and delivery outcomesUsage data and feedback drive continuous improvementCapabilities are versioned and evolved predictably Environments and Golden Paths Translate platform architecture into opinionated, self-service workflows driven by organizational standards that reduce complexity and enforce best practices by default. Golden paths are effective only when they are widely adopted. Environment conventions are standardized across naming, configuration, and accessEnvironment state is enforced through IaC/GitOps to prevent driftGolden paths provide curated, reusable templates for common workloadsSecurity, observability, and policy defaults are built into golden pathsGolden paths balance strong defaults with controlled flexibilitySelf-service workflows enable scaffolding, provisioning, and deploymentEnvironment lifecycle is automated across provisioning, promotion, and teardownDocumentation and onboarding are well integrated into workflowsAdoption is measured through usage and coverageFeedback and production learnings drive continuous evolution Pipelines and Release Reliability Standardize delivery pipelines so every change is validated, traceable, and safely releasable, making delivery more predictable and recoverable, not just faster. Pipelines follow a standardized flow: build, test, package, deploy, and promoteQuality, security, and policy checks are embeddedArtifact promotion across environments is controlled and consistentEach release produces traceable, auditable evidenceRollback and recovery paths are implemented and testedFailures provide fast, actionable diagnosticsReliability metrics are tracked (e.g., success rate, change failure, rollbacks)Release ownership and escalation paths are clearly defined Toolchain and Self-Service Automation Provide consistent self‑service automation through curated tools and embedded guardrails that reduce fragmentation, risk, and operational complexity. A unified developer point of entry exists through an IDP or developer portalStandard workflows exist for deployment, environment setup, and accessReusable modules and templates prevent copy-paste sprawl and reduce cognitive loadProvisioning and deployments are automated with guardrailsRBAC and approvals are embedded into automationHigh-risk actions require audited approvalsWorkflow reliability, usage, and failures are measuredAutomation evolves continuously based on usage and feedback Observability and Operability Embed observability and operational guardrails into self-service automation so systems are consistent, measurable, diagnosable, and operable by default. Logs, metrics, and traces are included by default through templates and golden pathsMinimum observability standards are enforced for promotionDashboards and alerts are preconfigured and actionableTelemetry supports debugging, capacity planning, and optimizationService health targets (e.g., SLOs) guide operationsOperational ownership is defined across on-call, escalation, and boundariesRunbooks guide incident response and recoveryIncident learnings feed platform and template improvements Reliability, Resilience, and Recovery Design for failure up front so systems fail safely, degrade gracefully, and recover predictably, proving resilience through recovery, not uptime alone. Architectures isolate failures to limit blast radiusDependencies are evaluated for availability and fallback strategiesResilience patterns are built in by default (e.g., retries, timeouts, circuit breakers, degradation)Non-critical features degrade without impacting core functionalityRecovery objectives are defined and validatedBackup and recovery mechanisms are implemented and testedRecovery is automated to minimize manual interventionGame days, chaos experiments, or failure drills are conducted to validate system behavior under stressReliability metrics are tracked and optimized (e.g., recovery time, failure rate) Security Guardrails and Governance Enforce security and compliance through codified guardrails embedded in delivery workflows, with continuous monitoring to improve security posture over time. Access follows least-privilege principlesSecrets are centrally managed and securely injectedPolicies are codified and enforced consistently through Policy as CodeSecurity controls are embedded in pipelines, including scanning and config checksHigh-risk actions require controlled approvalsExceptions are time-bound, tracked, and reviewedAll changes are auditable and traceableCompliance requirements map to enforceable controls Developer Experience, Adoption, and ROI Improve DevEx by reducing friction, driving platform adoption, and linking usage to measurable delivery outcomes and business impact. Developer experience is consistent across services and environments Platform abstracts common concerns (e.g., infra, security, observability) through standardized defaultsOnboarding to first deploy is fast and frictionlessDocumentation, examples, and enablement drive consistent adoptionPlatform and golden path adoption are measured through usage, onboarding, and coverageKey DevEx metrics are tracked (e.g., lead time, change failure rate, MTTR, time to first deploy)Workflow usability and reliability are continuously optimizedFeedback and usage data drive platform improvementsROI is measured through delivery outcomes (e.g., reduced toil, incidents, faster releases) Platform Engineering Maturity and Assessment Platform engineering maturity can be assessed across three practical stages that reflect the consistent application, adoption, and improvement of platform capabilities: Foundation focuses on baseline standardization, safety, and operability, with reusable capabilities in place but adoption still uneven.Scale enables reliable self‑service through guardrailed golden paths, improving delivery without increasing operational overhead.Optimize treats platform engineering as a strategic differentiator, using data‑driven decisions to continuously improve resilience, developer experience, cost efficiency, and measurable ROI. Use the Maturity Scoring Matrix to assess maturity across core platform engineering capabilities. Rate each category once, on a scale of 1–5, based on available evidence rather than aspiration. Overall maturity is determined by the dominant scoring pattern across the matrix, with higher maturity requiring consistent strength across Foundation, Scale, and Optimize. The progression bar maps scores from Ad Hoc to Strategic and groups them across the Foundation, Scale, and Optimize stages. Repeat the assessment periodically to identify gaps, track progress, and guide platform roadmap priorities. Conclusion Treat this checklist as a baseline gate and a recurring review mechanism, not a one-time exercise. High-performing platforms evolve through continuous refinement of architecture, automation, governance, and developer experience. Use it to identify gaps, strengthen golden paths, and align platform capabilities with measurable delivery outcomes. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Josephine Eskaline Joyce DZone Core CORE
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata
Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. I am developing a reference guide for platform teams that want continuous optimization embedded directly into their internal developer platforms. In this proposed model, “done” means automated, full-stack tuning recommendations that fit safely and seamlessly into existing engineering workflows. Building golden paths for pre-deployment tasks is relatively straightforward because engineering teams share the primary goal of shipping applications faster. However, after deployment, sustained efficiency frequently becomes a neglected task that is “someone else’s job.” Developers prioritize shipping, SREs protect safety buffers, and FinOps pushes for cost reduction. The reference model proposes a dedicated efficiency layer as a required platform capability designed to reconcile those priorities without requiring a replatform. In this one-layer deep dive, we focus only on the embedded efficiency layer: its interfaces, interaction model, and what it requires to be credible. Project Constraints I anchor my design on the assumption that engineering teams are already managing their production deployments through established IaC and GitOps practices. Unlike pre-deployment pipelines that often enforce strict corporate standards, a post-deployment efficiency optimizer cannot be rigidly opinionated. Every microservice possesses unique architectural characteristics and operational requirements that demand a highly configurable approach to system optimization. I recommend allowing teams to define explicit parameters based on the workload context, dictating whether a particular service requires a specific operational profile. ProfileIntentTradeoff Cost-first Aggressive cloud cost reduction Less headroom, higher reliability risk Performance-first Maximum throughput performance Higher cost (maybe), tighter buffers Reliability-first Expanded reliability buffer for unpredictable traffic spikes Higher baseline spend Architecting the Day-Two Golden Path Effective efficiency optimization requires an architectural deep dive beyond superficial cloud scaling metrics. The framework I recommend orchestrates continuous tuning across the entire technological stack, cascading from the underlying infrastructure nodes down through Kubernetes configurations and directly into the application runtime. Adjusting CPU requests and memory limits at the container level is mathematically insufficient if the underlying Java Virtual Machine or application runtime parameters remain poorly calibrated for those newly allocated resources. Consequently, the guide treats the underlying correlation engine as a mandatory architectural component for producing holistic configuration recommendations. FLOW: infrastructure metrics + Kubernetes signals + app monitoring → correlation engine → recommendations (infra/k8s/runtime) Figure 1: Full-Stack Optimization Layers The Interaction Model The foundational principle governing this architectural layer is an explicit human-in-the-loop (HITL) model. Fully autonomous, black-box changes erode trust when operators can’t see the reasoning behind configuration updates. Instead, the multi-dimensional tuning recommendations surface inside the developer’s GitOps workflow, presenting clear explainability about how a change affects latency, reliability, and cost. HITL ensures engineers retain final approval over critical production changes, but it introduces review latency and requires significantly more comprehensive explainability documentation for every recommendation. Scenario Walkthrough A critical microservice begins experiencing rising cloud costs alongside escalating p95 latency. The embedded optimization engine detects the drift, correlates the cross-stack metrics, and proposes two runtime adjustments via an automated GitOps pull request. The application owner reviews the generated explainability visuals, verifies that the tuning resolves the latency issue without violating any existing rule, and manually merges the request. The platform seamlessly applies the validated configuration and continuously tracks the resulting operational benefits. Figure 2: The Interaction Model That workflow only holds if the following choices are true: Capabilitytradeoffwhat makes it workable Tuning profiles Requires explicit rules definition Profile selection per service or category Full-stack tuning More complexity than infra-only Correlation across infra + app metrics GitOps surfacing Adds workflow touchpoints PR-based delivery in existing process Human in the loop Review PRs and recommendation docs Explainability visuals + approval step Takeaways Based on the framework in this reference guide, here is what I would tell someone building an embedded efficiency layer next, based on their involvement: Designing the interaction model: Prioritize operator trust and mathematical transparency over fully autonomous, unexplainable actions.Defining the technical scope: Ensure your engine tunes the entire stack, from the underlying infrastructure down to the application runtime, rather than settling for superficial cloud resource constraints.Navigating the sociotechnical divide: Treat the optimization layer as a collaborative platform capability that grounds the competing priorities of developers, reliability engineers, and FinOps, not a financial audit mechanism. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Graziano Casto DZone Core CORE
Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale
Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale

Editor’s Note: The following is an article written for and published in DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery. Recent advances in tooling and automation have moved DevOps beyond a collection of siloed frameworks and tools toward a more unified delivery model. But the sprawl of disconnected tools and the cognitive load of constant context switching have also created analysis paralysis, slowing delivery and shifting attention away from technical progress toward coordination challenges. In response, platform engineering has become the delivery backbone for organizations. In 2026, scaling delivery and adopting AI successfully will require platforms to operate through a product-led model. This article explores how practitioners and leaders can adopt product-led approaches, using real examples and practical best practices to measure the impact of DevOps at scale, where reliability and compliance are both critical. It examines tradeoffs such as speed vs. standardization and autonomy vs. integration. What Breaks as DevOps Scales As DevOps scales across multiple teams and systems, challenges emerge across infrastructure, security, compliance, and observability. These challenges are not only technical or skills-based. A technical solution may work at a smaller scale but will likely fail at a larger one. In a regulated organization, responsibilities such as auditing, logging, data processing, and managing suppliers and contractors are often handled by different teams. This can lead to slower response times and increased errors in deployment and testing. At the same time, the growing number of tools, environments, and versions increases cognitive load and creates tool sprawl, both of which slow delivery. Context switching between disconnected systems adds further friction, reducing velocity and making it harder for teams to work effectively. Over time, these pressures affect delivery outcomes, contribute to burnout, and limit critical thinking. Platform Engineering as the Scaling Mechanism A common misconception is that teams and systems can be optimized individually. While this may be true in smaller organizations, it is not practical at scale. In this context, the platform-led model provides an umbrella under which systems and teams can be optimized as one unified unit, supported by self-service capabilities. If the platform is treated as a product, it comprises all the necessary components, including users, processes, and measurable outcomes. The goal is to simplify and standardize processes so nothing breaks down as DevOps scales. In practice, this creates a shared operating model in which DevOps, SRE, platform engineering, and security teams align around common defaults, guardrails, and delivery expectations. Figure 1 This can be implemented in practice through golden paths. For example, when a new service is requested, a workflow template can be created to add a new repository with all the required steps, including CI/CD pipelines, environment configuration, security, and alerts checks. This path can then be replicated and integrated with other services with minimal deployment effort. At the same time, compliance, resilience, and regulatory steps are implemented automatically. Instead of relying on tickets or legacy knowledge, teams can use these paved paths as self-service workflows with built-in defaults and guardrails. Golden paths reduce error and failure rates because each stage is predefined for release, deployment, and rollback. These pipelines require consistency across tools, environments, and release frameworks. Without it, incidents, cases, and handovers become more difficult to manage. At scale, standardization and integration make these workflows repeatable, reliable, and easier to adopt across teams. The following table compares the two approaches. Old vs. Platform-led DevOps Old DevOps modelwhy it breaks at scaleplatform-led devops Individual teams and pipelines Inconsistency and drift Replicated golden paths Documentation per team/system Outdated knowledge Centralized documentation High autonomy Missing interoperability Consistency is high Low standardization Expensive to maintain High standardization Challenging integration Increased error rates High integration Developer Experience Becomes a First‑Class Delivery Metric Developer experience (DevEx) helps identify friction across tools, teams, and workflows, while also providing a way to measure quantitative and qualitative productivity. This is critical for any platform at scale, where slow onboarding, manual approvals, and persistent development constraints can delay delivery. DevEx measures such as time to first deploy, failure rate, lead time, and MTTR can help uncover bottlenecks in DevOps. Improving them leads to better developer satisfaction, smoother scaling, and clearer platform priorities. Success criteria become even more important at scale, where multiple teams work closely together to produce similar services with similar pipelines under the same or similar compliance conditions. In those environments, friction is reduced, and practitioners benefit directly from a stronger developer experience. Automation and AI: Leverage With Guardrails Automation supports standardization and integration by handling repetitive tasks and default configurations. With the adoption of AI, its value is seen most clearly in assisting rather than replacing decision-making. Combined with automation, AI shortens feedback loops and makes processes easier to audit and monitor, reducing failure rates and improving the developer experience. In practice, platform teams can use AI to intelligently automate triage, reduce alert noise, provide context-aware suggestions, and support guided remediation. However, applying automation and AI requires guardrails so systems and tools operate within clear boundaries, avoid incorrect outputs, and allow immediate rollback where necessary. There is a significant tradeoff between risk and speed, and finding the right balance is one of the first concerns organizations must address when integrating AI. Measuring Platform Value Measuring platform value should be demonstrated through outcomes, with recommendations supporting teams rather than replacing them. Increased platform adoption can act as a leading indicator that teams are choosing to follow golden paths and standardization and integration practices. A low adoption rate, by contrast, may signal growing friction and silos across teams and tools. When done well, the platform’s value becomes apparent in the ability to deliver releases without unnecessary overhead or disruption. The focus should always be on measuring outcomes that reflect integrated and repeatable pipelines, strengthening service continuity, and raising the standard for auditing and compliance. Outcome-based measures validate adoption: reduced operational toil, fewer incidents, faster recovery, and more reliable delivery. These outcomes translate directly into service continuity and audit confidence. However, counting tools or templates say little about impact. Two Failure Modes to Avoid Not all failures are obvious. If teams continue to use old methods and approaches despite the introduction of golden paths, DevEx, automation, and AI, the result can be platform theater, where neither outcomes improve nor value is added. Here, the illusion of productivity is often caused by cultural resistance: Teams adopt new tools but continue using old methods, leading to minimal or no improvement. For example, a team may adopt an internal platform but still rely on tickets, manual approvals, and older team-specific processes to move work forward. Another less visible failure is platform paralysis, where teams are pushed to build pipelines in parallel, leading to slower delivery and more controlled decision-making rather than flexibility, enablement, and repeatability. Here, the loss of velocity is often caused by over-engineering or too many competing solutions, with complex parallel approaches slowing delivery rather than accelerating it. For instance, multiple teams may create overlapping workflows and tooling for the same problem, increasing complexity instead of reducing it. Avoiding these two failure modes requires a clear shift from treating the platform as a project with milestones to treating it as a unified product-led model, with DevEx, automation, and AI focused on improving how work is actually done. What Product-led Delivery Looks Like in 2026 In 2026, delivery is increasingly shaped by standardization, integration, automation, and AI adoption. The goal is to help teams move faster without increasing complexity or raising the risk of bottlenecks and pipeline failure. In platform-led models, golden paths become the norm, allowing teams to follow repeatable processes with a greater degree of confidence in the outcome. Many of the same tools and methods that were introduced to increase speed have also added cognitive strain, fragmentation, and delivery friction. The next step is to reduce that complexity through a platform-led model, where golden paths improve speed and reliability while lowering cognitive load. For organizations looking at the next quarter, two practical priorities are to establish a small number of reusable golden paths and to baseline a focused set of DevEx measures so bottlenecks can be identified and removed earlier. This is an excerpt from DZone’s 2026 Trend Report, Platform Engineering and DevOps: How Internal Platforms, Developer Experience, and Modern DevOps Practices Accelerate Software Delivery.Read the Free Report

By Fawaz Ghali, PhD DZone Core CORE
A Deep Dive into Tracing Agentic Workflows (Part 1)
A Deep Dive into Tracing Agentic Workflows (Part 1)

Asking Claude, ChatGPT or any other advanced LLM “What is AI?” produces a well structured response seemingly in a matter of seconds. But between the user keystrokes, and the first token appearing, a tightly coordinated system is in play to generate this output. Your request first hits an ingestion layer. It verifies your session, checks rate limits, and runs the query through a trust filter. Your location quietly determines which compliance policies apply. The request is then stamped with a trace ID — an immutable identifier that follows it through every step of execution (this becomes important later). From there, an orchestrator takes over. It doesn’t just read your message — it interprets intent. Are you looking for a conceptual explanation, a research-style answer, or something more procedural? Based on that, it selects both the model and the strategy for generating a response. The full prompt is then constructed with the help of a context assembler. It pulls in prior conversation history, layers in user preferences from memory, and shapes everything into something the model can reason over. Only then is the LLM invoked. Before a single token is streamed back, the response is checked again for policy and compliance issues. Meanwhile, every step along the way is being recorded — spans nested within spans, each carrying timing data, cost attribution, and links to its parent in the execution chain. All of this happens under the hood in seconds. A More Complex Scenario Now let’s change the question to: “How do I transition from software engineering to product management?” This is no longer a single, well-formed LLM call. The system begins to branch. It might fetch course recommendations, look up profiles of people who’ve made similar transitions, scan community discussions, and query external knowledge through a retrieval pipeline. Multiple agents operate at once, reading from and writing to a shared context object. A UI-facing layer, informed by user preferences, decides how the response should be structured and presented. What comes back is no longer the output of a single model call, but a response synthesized from several agents, tools and reasoning decisions made along the way. That’s an agentic system in motion. And without proper tracing, it’s operating without visibility. What to Trace and Why? Before getting into mechanics, it’s worth being precise about what tracing actually gives you. Simply saying “logs are useful” doesn’t justify the investment. A more accurate framing: without tracing, improvement is just guesswork, possibly misaligned with the actual state of the system. Space Timings, for Latency When a response is slow in an agentic system, the cause is rarely obvious. It could be a delayed model call, an upstream API under load, an agent stuck in a reasoning loop, or work that was executed sequentially when it could have been parallelized. Tracing separates these scenarios by exposing the critical path — the sequence of spans whose combined latency actually determined the response time — and makes it clear where time is really being spent. Such insights can help determine the “latency hotspots” to target to improve system latency. Token Counts per Span, for Usage and Cost In an agentic workflow, cost is not tied to a single computation. One user query can cascade into multiple model calls, each with different context sizes and complexity. Some are essential, some could be nice to have, and a few may simply be mismatched to the task. With proper tracing, token usage becomes attributable. You can see which agent triggered which call, how much context was included, and whether that cost was justified. Over time, patterns emerge: query types that are consistently expensive, agents that tend to over-reason or cut corners, or unnecessary use of a larger model where a cheaper one would suffice. Execution Replay, for Pipeline Debugging Failures in agentic systems surface as outputs that are subtly wrong, incomplete, or misaligned with intent — not as crashes. Without a trace, there is no reliable way to understand how that output came to be. With one, you can reconstruct the entire execution: which agents were invoked, what they returned, what context was assembled, and what the model produced before any filtering or formatting. What would otherwise be guesswork becomes a step-by-step replay — and that replay is also your audit trail when a user or regulator challenges a response. Model Config and Invocation, for Quality Debugging When a system produces incorrect or fabricated output, the reason may have nothing to do with the model's capability. Small parameter choices have outsized effects - like a model temperature set too high for a task that requires precision, a key context missing or a poorly structured prompt. Tracing the full invocation — model version, parameters, prompt composition, and token usage — makes it possible to connect these inputs to the outputs they produce, and to adjust them with intent rather than trial and error. Agent Transitions Counters, for Detecting Loops and Inefficient Invocations Agentic systems introduce failure modes that don’t exist in traditional pipelines. Agents can enter retry loops or bounce between each other without making progress. Each step may appear valid in isolation, but the system as a whole stalls. Tracing makes these patterns visible as repeated transitions, enabling detection and control through limits, backoff, or circuit breaking — before they become production issues that silently burn through tokens and GPU cycles. State Mutations, for Shared State Debugging The hardest bugs in agentic systems are inconsistencies in shared state. When agents share data, critical context can be overwritten, it could be wiped out before being read, it could be read from a stale state for tasks that required precision. None of these scenarios may produce explicit errors. They produce outputs that appear coherent but slightly off to be subtle enough to be caught. Without visibility into how the shared state evolved — what changed, when, and which component made the change — these issues are extremely difficult to diagnose. Tracing state mutations provides that missing layer. Compliance, for Trust and Security Sensitive data flows through tool outputs, gets assembled into prompts, and surfaces in generated responses. And many things can go wrong there: PII exposed where it shouldn't be,A security check skipped, leading to unauthorized access,A compliance rule evaluated too late violating legal terms Tracing validates that the required safeguards actually ran: which policy checks were applied, which ruleset was in effect, and how data was handled at each stage. This level of visibility is essential for auditing the system behavior and to prevent any compliance issues in production. Conclusion Without extensive tracing, an agentic system is effectively a black box making decisions on your behalf. You see the input and the output, but everything in between is opaque. That makes it difficult to debug, hard to optimize, and nearly impossible to audit with confidence. Tracing changes that. It turns the system into something you can inspect, reason about, and improve with intent. In Part 2, we’ll move from motivation to implementation: how to structure a trace context that propagates across agent boundaries, what to capture at each step — from orchestration to state mutations to model calls — and how to instrument the kinds of failures that don’t announce themselves, including silent loops, partial updates, and implicit checks like policy enforcement and PII handling.

By VIVEK KATARYA
11 Agentic Testing Tools to Know in 2026
11 Agentic Testing Tools to Know in 2026

Agentic testing tools help teams plan, generate, adapt, and run tests with far less manual effort. They’re quickly becoming part of how modern QA scales without slowing delivery. One thing to get right from the start is scope. Not all agentic testing tools operate at the same level of scope or strategic impact. They vary significantly in what they do and where they fit. Some are point solutions that help you author or run tests faster. Others sit inside broader AI-driven quality platforms that prioritize risk, optimize test portfolios, and enforce quality gates across the pipeline. This post covers 11 agentic testing tools to know about in 2026. They’re grouped so you can compare them based on scope, strengths, and fit for your organization. What Is an Agentic Testing Tool? An agentic testing tool is software that uses AI agents to autonomously plan, generate, maintain, and execute tests. It often makes decisions based on context, such as requirements, code changes, risk signals, or past results. It goes beyond AI-assisted automation by adding initiative and workflow-level decision-making. Instead of only suggesting what to do next, it takes action within defined boundaries. Here are 11 agentic testing tools grouped by scope. Each includes a summary and key strengths and considerations. Let’s go! Enterprise AI-Driven Quality Platforms These platforms extend beyond test creation to orchestrate automation, intelligence, and governance at scale. They are suited for organizations that require stability, risk prioritization, and release confidence across complex environments. 1. Tricentis Tosca Tricentis Tosca is designed for enterprise test automation where stability, scale, and governance matter. In an agentic context, the shift is moving from “write and maintain scripts” to “orchestrate outcomes,” especially across complex apps and high-change environments. Tricentis enables AI-driven testing and agentic quality engineering across your delivery pipeline. It also positions MCP as a way to bridge AI and testing tools through a universal integration approach, which matters if you’re thinking about agentic workflows that span multiple systems. Strengths Suitable for large regression suites and complex end-to-end workflows.AI-assisted resilience helps reduce long-term maintenance costs. Considerations The highest value shows up when teams commit to governance and standardization (not “ad hoc scripts”).Adoption typically requires alignment across QA, engineering, and release stakeholders. 2. SmartBear SmartBear is best viewed as a broad testing portfolio vendor that has been positioning around AI across testing workflows. Strengths Covers multiple testing disciplines.Suitable for consolidated vendor strategies. Considerations AI depth varies across products.Portfolio integration matters. 3. UiPath Test Suite UiPath Test Suite extends testing into broader automation ecosystems. In an agentic context, it is relevant for teams that want testing integrated into AI-driven business process automation and orchestration environments. Strengths Aligns testing with broader automation initiatives.Fits organizations standardizing around enterprise automation platforms. Considerations Strongest value when already invested in the UiPath ecosystem.Organizations must evaluate how deeply autonomous testing workflows integrate with CI/CD. AI-native testing platforms AI-native testing platforms are built with AI at the core of test creation and execution workflows. They aim to reduce friction from requirements to automation and help teams maintain speed and stability as systems evolve. 4. ACCELQ ACCELQ positions itself around AI-powered automation and end-to-end testing acceleration. For agentic buyers, the key question is whether the platform reduces friction from requirements to automation to execution and whether it can keep pace as systems change. Strengths Faster ramp-up for automation.Structured automation workflows. Considerations Like any platform, success depends on fit with your stack and operating model.Ensure governance and explainability are strong enough for enterprise release standards. 5. mabl mabl is an AI-native testing vendor geared toward continuous testing and reducing maintenance overhead. For agentic tool evaluation, focus on whether AI helps you run reliably at speed, not just generate tests during setup. Strengths CI/CD integration.Automation resilience focus. Considerations Primarily web-centric workflows.Enterprise governance depth varies. 6. Functionize Functionize is commonly positioned as AI-forward test automation focused on reducing manual work across authoring, execution, and maintenance. In a practical agentic sense, tools like this aim to do more of the work for you, especially around test upkeep as systems evolve. Strengths Lifecycle focus: value isn’t only authoring, but also keeping tests healthy over time.AI-forward orientation fits teams pushing toward higher autonomy. Considerations Scope depends on team maturity.Organizations may need to evaluate governance needs more deeply. Point-solution agentic tools Point-solution agentic tools focus on solving a specific testing bottleneck rather than managing the full quality lifecycle. They are often used to accelerate test authoring, execution, or UI interaction without requiring a broader platform shift. 7. testRigor testRigor is typically associated with natural-language-driven test creation and reducing scripting complexity. For agentic buyers, it often lands in the “make authoring easier” category. Strengths Lower barrier to authoring.Rapid initial automation. Considerations Primarily focused on UI regression.Potential trade-off between depth and creation speed. 8. QA Wolf QA Wolf is often positioned around fast test creation and managed execution models for teams that want results without building everything in-house. In an agentic tooling conversation, this fits as a way to compress time-to-value, especially when internal bandwidth is limited. Strengths Fast time to coverage.Managed execution support. Considerations The operational model differs from in-house-only tools.Evaluate long-term scaling fit. 9. Virtuoso QA Virtuoso is frequently grouped with AI-led UI testing approaches that aim to reduce manual scripting and increase resilience. Its relevance depends on whether it meaningfully adapts and maintains tests as the app changes, not just how quickly it creates them. Strengths Faster UI automation creation.Reduced scripting complexity. Considerations Validate the reality of flake handling and maintenance in your environment (dynamic UIs expose gaps quickly).Ensure pipeline integration and evidence output meet enterprise needs. 10. AskUI AskUI approaches automation through UI perception and interaction. That can matter when you test across varied front ends, remote desktops, or environments where DOM-level automation is not always feasible. Strengths Useful for UI-driven automation challenges.Works across heterogeneous UI surfaces. Considerations Typically narrower in scope than end-to-end platforms.Validate stability and evidence outputs for long-running regression usage. 11. CoTester by TestGrid CoTester lands in the agentic assistant space for testing workflows. Tools in this category typically let you offload specific tasks, helping your team by generating tests, suggesting validations, or scaling coverage with less effort. Strengths Assistant-style support for testing tasks.Accelerates defined QA activities. Considerations Not a full end-to-end platform.Best as a complementary capability. How Agentic Technology Applies to Modern Testing Agentic testing brings the agent loop into quality workflows. It decides what to test, executes the work, evaluates results, and adjusts based on context. Here’s what that looks like in real delivery pipelines: Planning: Interpreting requirements, code changes, and risk signals to select the right tests.Execution: Running tests and collecting evidence.Adaptation: Repairing brittle selectors and managing flakiness as systems change.Governance: Enforcing quality gates based on measurable signals such as coverage and change impact. Agentic testing is not AI that writes tests. It is AI that runs a quality workflow. How to Choose the Right Agentic Testing Tool Buying decisions usually fail for one of two reasons: teams choose a point tool when they actually need a platform, or they buy a platform when they need quick, targeted relief. Use this checklist to avoid both mistakes. 1. Start With Scope: Assistant, Point Solution, or Platform? Ask one blunt question: Do you need help authoring tests, or do you need help governing release confidence? 2. Demand Measurable Outcomes, Not Demos Demos can look impressive, but real value shows up in production metrics. Look for clear improvements in regression time, maintenance effort, flake rate, defect escapes, and coverage visibility. If success cannot be measured, ROI will be hard to prove. 3. Validate Governance: Explainability, Auditability, Control Agentic systems take action, so your team must understand why. You should be able to explain test selection, recent changes, and the evidence behind a release decision, especially in regulated and enterprise environments. If you want agentic testing that scales beyond a single team or application, you need more than a test generator. You need an AI-driven approach that connects automation, intelligence, and governance. FAQ: Agentic Testing Tools in 2026 What Makes a Testing Tool Truly Agentic? A testing tool is truly agentic if it can independently plan and execute testing actions based on context, such as code changes, requirements, or risk signals. It does not just suggest next steps. It selects tests after a pull request, generates tests from requirements, repairs broken locators, and enforces quality gates with minimal human input. Are Agentic Testing Tools the Same as AI Test Automation? No. AI test automation typically assists with parts of automation, such as smarter locators or faster script creation. Agentic testing tools go further by automating decision-making across workflows. They can decide which tests to run for a build, identify untested code changes, and prioritize high-risk areas without manual triage. What Results Should I Expect From Agentic Testing? Most teams see measurable improvements in regression cycle time and maintenance effort when agentic workflows are implemented correctly. A realistic benchmark is reducing regression runtime by 30–70% through change-based test selection and cutting maintenance effort by 30–50% through self-healing automation and flake reduction.

By Alvin Lee DZone Core CORE
Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing

TL;DR: Why A Former Micromanager Will Make AI Adoption Work Twenty years of Agile coaching failed to fix the micromanager who meddles with every draft, every meeting, every decision. This article shows where their distrust stops damaging teams and starts producing the verification work AI adoption actually needs. Welcome the Verification Architect! What Is a Verification Architect? A Verification Architect is the person responsible for deciding which AI tasks belong in Assist mode, which belong in Automate mode, and which belong in Avoid mode of the A3 framework; defining what review means in each mode; and running the verification loop that converts each AI failure into a sharper prompt, eval, or acceptance criterion. The role is not a compliance auditor: compliance asks whether rules were followed, while verification asks whether the system produces the claimed outcome under the conditions in which it operates. In smaller organizations, the work is often a responsibility carried by a Product Manager, Scrum Master, QA lead, or technical lead, rather than by someone holding the title. Learn more about why a micromanager might be an excellent fit for this role below. The Micromanager You know the type of manager: The micromanagers ask to see the draft before the team talks to the customer. They rewrite the acceptance criteria after refinement. They join the Slack thread “just to clarify” and leave with the decision back in their hands. They are not malicious. They genuinely believe the work needs their eyes before it ships. For 20-plus years, Agile coaches have tried to convince these people to trust the team, the people they hired themselves. The psychological safety workshops did not work. The servant-leadership reading lists did not work. Much of the coaching industry learned to work around this population and focus on the trainable middle. The micromanagers stayed. Now the same manager is being asked to delegate work to AI. They will not delegate without asking. But this time, their skepticism deserves a hearing. The Micromanagement Disposition Is Not the Defect There is a reason the AI industry uses the phrase human in the loop. Probabilistic systems running autonomously should not be trusted by default with consequential decisions in their current form. They hallucinate citations. They produce confident wrong code. They will follow an under-specified instruction into a wall and report success. The instinct to verify before accepting consequential output is not a defect in this domain. It is reliability engineering. This context exposes the problem with the standard Agile framing. Telling a chronic skeptic that they need to trust more works against the evidence. The skeptic micromanager looking at agentic AI sees what the engineers building it see: a powerful tool with known failure modes that has to be wrapped in observability, harnesses, evals, and verification before it produces reliable value. The skeptic’s posture toward AI is closer to reliability engineering than to the optimism that much AI adoption theater demands. Where the same instinct fails is with human colleagues, not because humans are reliably better than generative AI systems. Humans fail differently. The reason inspection often damages human work but can improve AI work is that inspection changes the system being inspected. People learn, adapt, withdraw, hide information, and protect themselves in response to how they are treated. Surveillance degrades the very capability the manager claims to protect. With AI, verification does not demotivate the model. The model produces what it produces, and the verification loop sharpens over time, as we feed back findings to improve prompts, skills, evals, constraints, and operating rules. From that perspective, the problem was never the micromanager’s distrust. The problem was where it was pointed: at humans. Two Patterns Wearing the Same Costume Two very different micromanager motives can produce the same behavior. The distinction matters because they respond to different interventions, and one of them is genuinely useful in an AI context while the other is not: The first pattern shows up as authority maintenance: The distrust is about keeping the decision in the manager’s hands, not about improving the output. Ask this manager what would count as evidence that a teammate’s work is trustworthy, and the answer is often operational nonsense: “I need to see it first.” The verification, when it happens, is performative. What gets inspected is compliance, not risk. AI tooling does not help this person because they do not actually want better evidence. They want to be the one who decides.The second pattern shows up as accumulated experience: The distrust is grounded in specific past failures. This manager can describe in detail what they have seen go wrong, what was promised and not delivered, and which verification step was skipped before the failure. With human teammates, this manifests as micromanagement because verifying human judgment is socially costly. You cannot run a unit test on a colleague’s reasoning. So they over-supervise, the team feels controlled, and the relationship degrades. With AI, verification is structured and cheap. The same disposition that damages a team produces useful work when pointed at a probabilistic system that actually benefits from repeated checks. A small diagnostic helps distinguish them: Question Authority maintenance Accumulated experience What would make this output trustworthy? “I need to see it first.” “It has to pass these three checks.” What failure are you trying to prevent? Vague loss of control. A specific failure mode they can name. When would you stop reviewing every step? Never. When the system demonstrates reliability under defined conditions. What do you inspect? The person’s compliance. The work product’s risk. What changes after your review? The decision returns to me. The system gets a sharper check, rule, prompt, or acceptance criterion. The difference is not whether the person distrusts. The difference is whether their distrust leaves behind better evidence, better criteria, and a sharper system, or merely a returned decision right. This is not permission to allow the micromanager to “direct” humans. Human work still needs verification, but the verification must be designed as a social contract: clear intent, explicit constraints, agreed-upon review points, and decision rights that do not silently migrate upward whenever the manager feels anxious. The same person who becomes useful in AI verification may still be destructive in a team context if they cannot make that shift. The disposition is not the license. The redirected target, however, provides a new perspective for the micromanager. A3 Is the Sorting Mechanism The A3 Framework (Assist, Automate, Avoid) is one way to test which pattern you are looking at. Authority maintenance can fill in the A3 boxes. It cannot use A3 honestly. The answers stay vague, reversible, and dependent on the micromanager’s comfort rather than on named risks. The accumulated-experience pattern can categorize a task in seconds, because the suspicion is grounded in specific past failures that map to specific risk profiles. In Assist, where AI drafts and a human decides, the contribution is defining what a genuine review looks like. Most teams using AI in Assist mode are rubber-stamping. The experienced skeptic refuses to. They will read the draft and tell you which two of the five suggestions contradict a constraint the model could not have known about. In Automate, where AI executes under explicit rules and audit cadences, the same person designs the audit. They will write the acceptance criteria with teeth, the failure modes worth alerting on, the rollback conditions, and the sample size for the weekly check. The team may look slower for two weeks because the work is finally visible. Six months later, that visibility is what prevents the incident everyone else would have called “unexpected.” In Avoid, where AI should not be used at all, the skeptic is the person qualified to make the call. Most organizations lack this authority. Optimistic adopters struggle to say no. Blanket skeptics say no too cheaply. The experienced skeptic can distinguish a stakeholder relationship in which one wrong AI-drafted phrase costs six months of trust from a low-stakes draft in which Assist is fine. The categorization is not the value in this case, but the decision authority is. Many AI adoption initiatives lack a qualified person with the authority to say we should not use this here, and they produce predictable failure modes as a result. Summary: AI Task Types and the Verification Mode Each Require Bound drafts a human reviews: A3 mode: Assist.What the Verification Architect does: Defines the specific criteria the draft must pass before acceptance. Repeated execution under explicit rules: A3 mode: Automate.What the Verification Architect does: Designs audit cadences, rollback conditions, and drift detection. High-trust or irreversible work: A3 mode: Avoid.What the Verification Architect does: Protects the boundary against convenience-driven AI adoption. Name the New Role for the Micromanager: The Verification Architect The piece this article has been circling is that AI creates a role the Agile movement never learned to name. Call it the Verification Architect. A Verification Architect does not ask: “Can AI do this?” They ask: “What would have to be true for AI to do this safely, repeatedly, and measurably in our context?” Their unit of work is not the prompt. It is the loop, the day-to-day work that compounds over months: Turn vague AI use cases into Assist, Automate, or Avoid decisions before anyone opens a prompt window.Define what review means in Assist mode, not as a vibe check, but as specific criteria the draft has to pass.Design audit cadences in Automate, including sample sizes, drift detection, and rollback conditions.Protect Avoid zones from convenience-driven erosion, which is the failure mode of every governance regime that lacks an enforcer.Convert each failure into a sharper prompt, a new eval, a tightened acceptance criterion, or an updated Definition of Done.Track drift over time, because models, data, and use cases all move. In smaller organizations, this may not be a job title. It may be a responsibility carried by a Product Manager or Owner, a Scrum Master/Agile Coach, a QA lead, a product operations person, or a technical lead. The title matters less than the loop. The Verification Architect is not a compliance role. Compliance asks whether the rules were followed. Verification asks whether the system produces the claimed outcome under the conditions in which it operates, with the named failure modes. The first is bureaucracy. The second is engineering judgment. The role is not new in the strict sense. Reliability engineers, design verification architects, and rigorous product operations leaders have been performing this work on traditional software for years. What is new is the application to AI-enabled work systems in non-technical organizational settings, where agentic workflows with non-deterministic outputs and rapid deployment cycles make verification load-bearing rather than nice-to-have. The organizations that ship AI without this capability produce demos. The organizations that build it produce systems that compound. The Work Inside the Dip The AI Spending Trap argued that organizations are often stuck in the J-curve dip because they buy tools and skip the intangible-capital investment that drives the eventual rise. The argument has a missing piece. The intangibles do not invest themselves. They need process redesign, retraining, restructuring, data plumbing and governance, and change management. Every category gets paid for by specific humans doing specific work. The part of the dip organizations most consistently underprice is verification work, eval design, output review, prompt or skill refinement, acceptance-criteria sharpening, and failure-mode cataloging. This is the place where the Verification Architect earns their salary. Done well, the loop becomes a compounding system. Each verification cycle encodes a little more organizational judgment about what good looks like in this specific context; the evals get sharper, and the acceptance criteria get more specific. The agent’s effective competence in this organization increases over time, not because the underlying model improves, but because the surrounding system encodes accumulated knowledge of where it fails. The trusting person ships v1 and moves on. The Verification Architect ships v1, watches it, catches the failures, refines the prompts, tightens the evals, updates the Definition of Done, and runs the loop again. Without this person, the deployment stays at v1 and degrades as conditions shift. With them, the system gets better while the headcount stays flat. That is the curve “The AI Spending Trap” described, and this is who pulls it upward. The work is currently underpriced. Eval design does not ship on Monday. Output review does not produce a launch announcement. Refining prompts in month four produces nothing that the quarterly board deck can show. That is exactly why the disposition is a competitive advantage for organizations that recognize it before the rest of the market does. A Warning About the Label The label “Verification Architect” will be hollowed out, as every useful role title in this industry eventually is. (Remember: Agile Coach, Product Owner, and Scrum Master?) Ask what the person last sent back for revision and why. Ask what they last protected from AI involvement and what would have to change for that decision to flip. Ask what their longest-running audit loop has caught. The genuine Verification Architect answers with names, dates, and specific failures. The fake one answers with frameworks and vocabulary. Conclusion: Move the Work, Not the Person If you have spent your career being told your skepticism was a problem, consider that the people telling you were trying to fit you as a micromanager into a role that does not need you. The agentic AI stack needs people who refuse to trust output they did not verify. It needs people who design the evals, who run the audit loop, who notice the failure that everyone else celebrated as a launch. The work is currently underpriced. That is the opportunity. The micromanager disposition was never the problem; shoehorning it into an unfitting role was. Pick a teammate you struggled to delegate to in the last six months. Pick an AI task that frustrated you in the same window. Compare the instructions you gave each. If the pattern is the same, you have found the problem. One system is being damaged by your inspection. The other may finally be receiving the discipline it needs. Does your distrust produce evidence, or does it merely preserve authority? My suggestion: Move the work, not the person. Key Questions This Article on Micromanagers Answers What Is a Verification Architect in AI Adoption? A Verification Architect is the person who decides which AI tasks belong in Assist, Automate, or Avoid mode, defines what review means in each mode, and runs the verification loop that converts each AI failure into a sharper prompt, eval, or acceptance criterion. Their unit of work is not the prompt; it is the loop. In smaller organizations, the responsibility may be carried by a Product Manager, Scrum Master, QA lead, or technical lead rather than someone holding the title. Why Do Micromanagers Struggle to Delegate to AI? Most do not, because their underlying distrust of probabilistic systems is engineering common sense, not a character defect. The reason inspection damages human teams but improves AI systems is that inspection changes the system being inspected: people adapt and withdraw under surveillance, models do not. The skeptic’s posture toward AI is closer to reliability engineering than to the optimism that much AI adoption theater demands. How Can I Tell If My Distrust Is Useful Verification or Authority Maintenance? Apply a five-question diagnostic. Useful verification can name a specific failure mode it prevents, define operational criteria for when to stop reviewing, assess the work product’s risk rather than the person’s compliance, and leave behind a sharper rule, prompt, or acceptance criterion after each review. Authority maintenance cannot answer those questions in operational terms; its only output is returning the decision to the reviewer. Who Does the Verification Work that Makes AI Adoption Compound over Time? The Verification Architect. The work includes eval design, output review, prompt and skill refinement, acceptance criteria sharpening, and failure-mode cataloging. Each cycle encodes more organizational judgment about what “good” looks like in a specific context, so the system’s effectiveness improves over time even when the underlying model does not. Without this person, deployments stay at v1 and degrade as conditions shift.

By Stefan Wolpers DZone Core CORE
Retesting Best Practices for Agile Teams: A Quick Guide to Bug Fix Verification
Retesting Best Practices for Agile Teams: A Quick Guide to Bug Fix Verification

Agile teams ship fast. Two-week sprints, daily standups, and continuous deployment pipelines have made speed the default. But speed without verification is just organized chaos. When a developer marks a bug as "fixed" and the ticket moves to QA, what happens next determines whether that fix actually reaches production — or quietly breaks something else. Retesting is often treated as a checkbox. It shouldn't be. In modern agile environments, retesting is a discipline that, when done well, catches regressions before users do, builds confidence in your release pipeline, and keeps velocity sustainable rather than suicidal. This guide walks through the practical retesting steps that high-performing agile teams follow to manage bug fix verification without slowing down their release cycles Why Retesting Deserves More Attention Than It Gets Most teams conflate retesting with regression testing. They're related but not the same. Retesting is the act of re-executing a specific test that previously failed, after a bug fix has been applied, to confirm the fix works. Regression testing is the broader process of running the existing test suite to ensure that new changes haven't broken previously working functionality. You need both. But retesting is the more surgical, targeted activity — and it's where a lot of agile teams cut corners under sprint pressure. The cost of that shortcut surfaces quickly: the same bug reopens in production, trust between devs and QA erodes, and hotfixes eat into the next sprint's capacity. According to IBM's Systems Sciences Institute, the cost of fixing a bug in production is up to 30x higher than fixing it during development. Retesting is the last cheap checkpoint. Step 1: Reproduce the Original Failure Before the Fix Before a QA engineer can verify a fix, they need to be able to reproduce the original bug reliably. This sounds obvious, but in practice, many teams move to testing the fix without confirming that the defect is consistently reproducible in the test environment. What to do: Check out the codebase before the fix is applied (or use a tagged build from the bug-filing sprint).Execute the exact test steps documented in the bug report.Confirm the defect manifests as described. If the bug can't be reproduced before the fix, you're not testing a fix — you're testing in the dark. Either the test environment differs from production, the steps in the bug report are incomplete, or the bug is environment-specific. Agile tip: Insist that bug reports include a "Reproduction Steps" section as a Definition of Done requirement for filing. No steps, no ticket. Step 2: Understand the Fix Before Testing It QA engineers who blindly run the original failing test after a patch is applied will catch only the most obvious failures. To test effectively, you need to understand what changed and why. Checklist before testing: Read the diff or PR description.Ask the developer: "What was the root cause, and what exactly did you change?"Identify any edge cases the fix might introduce.Note any dependent modules, APIs, or services the fix touches. This conversation between dev and QA — ideally a brief 5-minute sync during triage — dramatically improves the quality of retesting. It also surfaces cases where a fix is technically correct but introduces a new failure mode. Step 3: Retest the Exact Failing Scenario This is the core of retesting: execute the specific test case that originally failed, using the same inputs, environment, and conditions, and verify that the expected behavior now occurs. What "verified" looks like: The test passes in the current build.The output matches the acceptance criteria in the original ticket.No error messages, unexpected behavior, or degraded performance appear. Common mistakes to avoid: Testing a slightly different scenario than what was documented.Retesting only the happy path when the original bug was an edge case.Testing in a different environment than where the bug was reported. Document the result explicitly. "Tested and passed" is insufficient. Log: build number, test environment, tester name, date, and a brief description of what was verified. Step 4: Run Boundary and Negative Tests Around the Fix A fix that works for the main scenario may still break under boundary conditions or invalid inputs. After verifying the primary scenario, broaden your coverage. Boundary testing for bug fixes: Test the minimum and maximum values for any data the fix touches.Test empty inputs, null values, and unexpected data types.Test concurrent requests if the fix touches shared state. Negative testing: Attempt to trigger the original bug with slightly different inputs.Test that appropriate error handling occurs when inputs are invalid.If the bug was a security issue, probe related attack vectors. This step is where automated testing pays dividends. If you have a test framework in place, write parameterized tests that cover boundary conditions and commit them alongside the fix. Future sprints benefit from this coverage automatically. Step 5: Perform Targeted Regression Testing on Affected Components Once the original fix is verified, expand your scope to the components the fix touches. This is targeted regression testing — not a full suite run, but a deliberate sweep of adjacent functionality. How to scope it: Use code coverage tools or dependency graphs to identify which modules the fix modifies.Map those modules to existing test cases.Run only the test cases relevant to the impacted components. In a mature CI/CD pipeline, this happens automatically via change-impact analysis tools. In less mature environments, this is a manual judgment call that benefits from close communication between dev and QA. The goal: Verify that fixing bug A did not break feature B, where B shares code with A. Step 6: Validate in a Production-Like Environment Test environments lie. Configurations differ, third-party service mocks behave differently than the real thing, and database states diverge from production over time. For critical bug fixes — especially those related to data integrity, performance, or security — validation in a staging environment that mirrors production is essential. What to verify in staging: End-to-end flows that include the fixed component.Integration with real external services (or the closest available approximation).Performance under realistic data volumes. For teams using containerized deployments, spinning up a production-like environment per PR is increasingly achievable. Tools like Docker Compose, Kubernetes namespaces, or platform-native review environments (e.g., Heroku Review Apps, Vercel Preview Deployments) make this accessible even for smaller teams. Step 7: Close the Loop With Documentation Retesting that isn't documented didn't happen — at least not in any way that's auditable, transferable, or useful for future sprints. Minimum documentation per retested bug: Bug ID and description.Build/commit SHA where the fix was applied.Environment tested.Test cases executed (with pass/fail status).Tester and date.Any observations or follow-up items. Update the original bug ticket with a "Verified Fixed" status and attach the relevant test evidence. If the retest reveals that the fix is incomplete or introduces a new issue, reopen the ticket with clear notes and escalate before the sprint closes. Integrating Retesting Into Your Agile Workflow Ad hoc retesting doesn't scale. As sprint velocity increases and team size grows, you need retesting to be a structured part of the development lifecycle, not something that happens informally at the end of a sprint. Practical integration points: Definition of Done: Include "bug fix verified by QA in test environment" as a DoD item for any ticket filed as a bug. This prevents developers from closing tickets unilaterally. Bug Fix PRs: Require that bug fix PRs include a test case (automated or manual test script) that reproduces the original failure and passes after the fix. This makes regression coverage self-generating. Sprint Review Checklist: Add a retesting summary to your sprint review. How many bugs were fixed? How many were retested and verified? How many regressions were caught? Track this over time — it's a leading indicator of test quality. Shift-Left Retesting: Don't wait for QA to catch a fix in the QA phase. If developers write unit tests that reproduce the bug before fixing it (TDD-style), the fix is verified before it even reaches QA. This compresses cycle time significantly. Automating Retesting in CI/CD Pipelines Manual retesting is a bottleneck. For bugs with well-defined reproduction steps, automation is the right long-term answer. The workflow: Bug is filed with reproduction steps.Developer writes a failing test that reproduces the bug.Developer implements the fix; the test now passes.The test is committed to the repo and becomes part of the CI suite.Every subsequent build runs that test automatically. This approach converts bugs into permanent regression guards. The cost of writing the test is paid once; the coverage benefit persists indefinitely. For teams using API testing tools, contract testing frameworks, or behavior-driven development (BDD) tools, this workflow integrates naturally. Each fixed bug becomes a scenario in your test suite — a living record of issues your codebase has encountered and solved. Common Retesting Anti-Patterns to Avoid "It worked on my machine." Developer self-testing is not a substitute for independent QA verification. Fix and retest should involve different people, or at minimum, different environments. Retesting only the ticket, not the risk. QA engineers should ask: "What else could this change have broken?" Every fix carries blast radius. Don't test the fix in isolation. Closing bugs before staging validation. Moving a ticket to "Verified" after testing only in a local or dev environment is premature. Production-like validation is required for high-severity fixes. Skipping retesting under sprint pressure. This is the most common and most costly anti-pattern. The pressure to close tickets before a sprint ends is real, but retesting debt accumulates quickly and surfaces as production incidents. Final Thoughts Fast release cycles don't have to mean fragile ones. The teams that ship confidently at high velocity aren't testing less — they're testing smarter. Retesting, when treated as a structured, documented, and automated process rather than an afterthought, is one of the highest-leverage activities a QA team can invest in. The steps outlined here — from reproducing the original failure, to understanding the fix, to targeted regression sweeps, to production-like validation — create a repeatable process that scales with your team. Every bug that gets properly retested is a bug that doesn't come back. Build that muscle, and your sprint reviews start to look a lot less like incident retrospectives.

By Alok Kumar
Securing Everything: Mapping the Right Identity and Access Protocol (OIDC, OAuth2, and SAML) to the Right Identity
Securing Everything: Mapping the Right Identity and Access Protocol (OIDC, OAuth2, and SAML) to the Right Identity

Overview Identity and access security is built on two fundamental requirements: Authentication (AuthN) — who you are, andAuthorization (AuthZ) — what you are allowed to do. Every secure system must answer both questions clearly and consistently. In modern architecture, these questions are posed to two primary categories of actors trying to access applications: human — Challenged to provide direct credentials or to delegate their authority to another applicationmachines — Challenged to prove their own programmatic identity and permissions. Spanning these requirements and actors, the vast majority of Identity and access patterns align to four common workflows. Machine Machine-to-Machine (OAuth2 Client Credentials) Human Human User Authentication (OIDC)Delegated Third-Party Applications (OAuth2 Authorization Code)Enterprise SSO Federation (SAML 2.0). Together, these four workflow models account for nearly all modern enterprise application access patterns. Some Key Terms — Quick Reference Before we go into the Identity workflows, lets go over some key terms to get familiar with the Identity and Access jargon. Core Concepts AuthN (Authentication) — Establishes identity; verifies who the actor (human or machine) is. AuthZ (Authorization) — Defines permissions; determines what actions the actor is allowed to perform. Protocols OAuth 2.0 — Authorization framework that issues access tokens so applications can securely access APIs on their own behalf or on behalf of a user. OIDC (OpenID Connect) — Authentication layer built on OAuth 2.0 that introduces ID tokens and standardized identity claims. SAML (Security Assertion Markup Language) — XML-based federation protocol used primarily for enterprise single sign-on across organizational domains. FIDO2 / WebAuthn — Modern authentication standard enabling phishing-resistant, passwordless login using asymmetric cryptography and hardware-backed credentials. OAuth Flows 3LO (Three-Legged OAuth) — User + Client + Authorization Server; used when user identity and consent are involved. 2LO (Two-Legged OAuth) — Client + Authorization Server; used for machine-to-machine communication without human interaction. Key Roles IdP (Identity Provider) — System that authenticates identities and issues tokens. Client — Application, service, or AI agent requesting access to protected resources. Resource Server — API or system that validates tokens and enforces fine-grained access control. Resource Owner — Human user whose data or permissions are being accessed. RP (Relying Party) / SP (Service Provider) — Application that relies on the IdP to authenticate the actor (RP in OIDC, SP in SAML). Tokens & Security Plumbing ID Token — Identity token intended for the client to confirm who the user is. To use an analogy, the equivalent of an ID Token is the passport that contains your ID claims. Access Token — Authorization token sent to APIs to grant specific permissions. Have short-lived TTLs. To use an analogy, the equivalent of an Access Token is the visa that contains your access claims. Access Token — Authorization token sent to APIs to grant specific permissions. Have short-lived TTLs Refresh Token — Long-lived credential used to obtain new access tokens without re-authentication. JWT (JSON Web Token) — Digitally signed JSON token containing identity and authorization claims. ID Tokens are JWTs. Access Tokens could be JWT or opaque Authorization Controls Claims — Assertions inside a token (user ID, roles, audience, expiration, etc.). Scopes — Permission boundaries defining what a client can access. Typically these are claims in tokens Below is a diagram that illustrates some of the terms above: Machine-to-Machine (M2M) Authentication Machine-to-Machine authentication is designed for non-interactive clients — such as microservices, daemons, background jobs, and AI Agents that need to access APIs with their own established identity and permissions.. Unlike human flows, there is no browser and no “user” to provide a second factor. The system must ask the machine to prove its identity programmatically. The recommended standard for the M2M authentication is the OAuth 2.0 Client Credentials Grant to obtain an Access Token. M2M Auth is a 2LO flow. Key Characteristics of M2M Identity Verified: The machine/application itself (e.g., a billing service or search agent).Token Issued:Access Token only. (No ID Token is issued, as there is no human identity involved).Goal: To verify which machine is making the request and grant it permissions to perform tasks independently. While the OAuth 2.0 Client Credentials flow is the standard, the method of authentication determines the strength of the security posture. There are 4 methods of authentication and as we move from shared secrets to cryptographic binding, we increase the assurance level. Human User Authentication (OIDC) This is the standard consumer login where a person is present and interacting with a client application. Direct human authentication is designed for interactive users accessing an application via a browser or mobile device. In this model, the application doesn’t just need permission to act; it needs to know who the user is. The recommended standard for human user authentication is OpenID Connect (OIDC) built as an identity layer on top of OAuth 2.0. OIDC allows the system to ask the user for proof of identity through a trusted Identity Provider (IdP). Thus, OIDC = OAuth 2.0 (Authorization — Access Token) + Identity Layer (Authentication — ID Token) OIDC is a 3LO flow. Key Characteristics of OIDC Identity Verified: The End-User (e.g., a customer logging into a portal).Tokens Issued: ID Token (contains user profile info) + Access Token (to call APIs).Goal: To establish a secure session and obtain a verifiable “passport” (the ID Token) containing claims like name, email, and subject ID. The strength of an OIDC implementation is defined by the Authentication Method. As we move up this ladder, we shift from simple knowledge-based proof to cryptographic, phishing-resistant protocols. Delegated Third-Party Authorization (Third-Party Access) Delegated authorization is the process of granting a third-party application (an external client) scoped, limited access to a user’s resources without exposing the user’s credentials. This workflow covers scenarios where an application needs limited permission to access a user’s resources, but the application is not the owner of those resources (e.g., a photo printing service accessing your Google Photos, or a calendar app reading your Outlook events, or chatGpt agent needing to access your Confluence pages). The recommended standard for this workflow is the OAuth 2.0 Authorization Code Flow. It is functionally identical to the OIDC flow, with one critical distinction: the ID Token is not returned (the openid scope is omitted from OIDC request). The user first authenticates with the Identity Provider (IdP) and then explicitly approves the specific permissions requested by the third-party client (e.g., photos.read). The application receives an Access Token representing only those approved permissions, allowing it to act on the user's behalf within those strict boundaries. The Delegated Authorization flow uses state parameter and PKCE, but not nonce which is used only in OIDC flow (nonce protects ID Token which is not returned in OAuth 2.0 Authorization Code Flow). Nonce is only used when an ID Token is involved, and delegated OAuth 2.0 flows do not return an ID Token. (Refer my OIDC blog to understand state, PKCE and nonce) Thus, OAuth 2.0 Authorization Code Flow = OIDC without ID Token request This workflow is a 3LO flow. Key Characteristics of Delegated Access Identity Verified: Technically, the user authenticates with the Resource Server, but the focus is on the user given Consent to the third-party app.Token Issued: Access Token. No ID Token is issued.Goal: To grant “scoped” access to specific resources without sharing the user’s actual credentials or identity profile. Enterprise SSO Federation via SAML 2.0 (Human-to-Service SSO) SAML (Security Assertion Markup Language) is the established XML-based veteran standard for Enterprise Federation. It allows a corporate user to authenticate once with their central Identity Provider (IdP) — such as Ping, or Azure AD — and gain seamless access to external SaaS applications (Salesforce, AWS, Slack) or internal tools without re-entering credentials. Many enterprise applications — especially heavyweights like AWS Console, Salesforce, ServiceNow, and SAP — rely on SAML 2.0. In this model, when a user attempts to access a Service Provider (SP), such as Atlassian Confluence, the SP redirects the user to the IdP. The IdP then issues a SAML assertion containing user attributes which the SP trusts to verify the user. This is the technology behind the familiar “Tile” experience where enterprise apps appear as “tiles” in your IdP portal.. Because the IdP assigns users to specific applications and exchanges assertions , these apps appear as ready-to-use icons in a corporate portal. Key Characteristics of SAML Identity Verified: The Corporate Identity (Employee/Contractor).Token Issued: SAML Assertion (an XML document containing the user’s identity and attributes/roles).Goal: To establish a “Circle of Trust” between an Identity Provider (IdP) and a Service Provider (SP) enabling Enterprise SSO for corporate users. Why SAML Persists in the Enterprise SAML is older than OIDC but remains widely used because many enterprise platforms were built before modern OAuth/OIDC standards existed. While OIDC is lighter, SAML persists in the enterprise because it is deeply embedded in legacy SaaS integrations and enterprise identity providers, with mature federation trust models already in place. Despite newer protocols like OIDC, its broad vendor support, stability, and long-standing interoperability keep it operationally entrenched. However, it is fundamentally browser-based and XML-driven, relying on front-channel redirects and verbose assertion exchanges that reflect an earlier web architecture. As applications modernize toward API-first, mobile, and SPA-native models, many are gradually migrating to OIDC and OAuth 2.0 for lighter-weight tokens, JSON-based claims, and better support for modern client patterns. Conclusion: The Right Key for the Right Door Remember: OAuth2 = authorization onlyOIDC = authentication + authorization (OAuth2)SAML = Authentication + (attribute sharing which the client can use for determining Authorization) The selection of the correct identity protocol is not merely a technical detail but a foundational architectural security decision. By mapping each identity type — Human User (OIDC), Machine-to-Machine (OAuth2 Client Credentials), Delegated Third-Party Access (OAuth2 Authorization Code), and Enterprise SSO (SAML 2.0) — to its appropriate protocol, and by standardizing all API-bound access into a single, validated JWT Access Token at the API Gateway, architects create a scalable and trustworthy end-to-end security model. The rise of agentic AI frameworks and protocols like the Model Context Protocol (MCP) transforms AI from passive assistants into active agents. This means robust OAuth 2.0 flows are essential for treating these agents as distinct identities, ensuring their autonomous actions are governed by strict, token-based authorization and the principle of least privilege.

By Ananth Iyer

The Latest Culture and Methodologies Topics

article thumbnail
From "Vibe Coding" to Production: Setting Up an Evals Loop for Claude Agents
Replacing unreliable “vibe coding” with a rigorous automated evaluation loop using curated datasets, Claude judge agents, and metric tracking for production AI agents.
June 11, 2026
by Nikita Kothari
· 672 Views · 1 Like
article thumbnail
A Deep Dive into Tracing Agentic Workflows (Part 2)
Tracing agentic systems uses hierarchical IDs to form a System DAG, exposing performance and cost issues. Observer agents automate diagnosis and system self-correction.
June 10, 2026
by VIVEK KATARYA
· 520 Views
article thumbnail
Orchestrating Zero-Downtime Deployments With Temporal
Temporal provides the durable control plane for safe zero-downtime deployments across canaries, approvals, retries, and rollbacks.
June 10, 2026
by Akhil Madineni
· 491 Views
article thumbnail
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
A technical deep dive into Amazon Quick — how it works, how it connects to your tools via MCP, and where it sits in the AWS agent stack.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 782 Views
article thumbnail
How to Build an Agentic AI SRE Co-Pilot for Incident Response
Build an agentic SRE co-pilot using LLMs to autonomously reason, plan, and execute incident response across complex, multi-cloud infrastructure.
June 8, 2026
by Akshay Pratinav
· 985 Views
article thumbnail
Observability for Agents and Workflows: Tracing Prompts, Tool Calls, and Business Outcomes End-to-End
Learn how to trace AI agents end to end, from prompts and tool calls to business outcomes, with observability practices for production workflows.
June 5, 2026
by Srinivas Chippagiri DZone Core CORE
· 2,217 Views · 1 Like
article thumbnail
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It
Most QA teams are stuck in a manual scripting loop. Here's the requirement-driven architecture that eliminates the coverage gap permanently.
June 5, 2026
by Waqar Hashmi
· 1,813 Views
article thumbnail
Identity in Action
A practical guide to SSO migration covering risks, MFA, phased rollout, and governance to ensure secure identity transitions without disruption.
June 3, 2026
by Kapil Chakravarthy Sanubala
· 2,343 Views · 3 Likes
article thumbnail
Getting Started With Agentic Workflows in Java and Quarkus
A step-by-step tutorial on how to add agentic workflows to Quarkus applications with the Agentican framework via YAML and annotations.
June 3, 2026
by Shane Johnson
· 2,156 Views · 3 Likes
article thumbnail
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
Many MVPs get too big because teams treat several user-facing systems and vendor-dependent workflows as one app instead of planning one complete path first.
June 2, 2026
by Kajol Shah
· 1,333 Views
article thumbnail
The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
Agentic Agile Office uses autonomous AI agents to cut admin overhead, detect risks early, and shift teams from manual tracking to intelligent, high-velocity delivery.
June 1, 2026
by Madhusudhan Chivukula
· 1,316 Views · 1 Like
article thumbnail
Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
Learn how to build an internal developer platform with golden paths, GitOps, CI/CD, observability, and governance built into workflows.
May 28, 2026
by Mirco Hering DZone Core CORE
· 2,336 Views · 1 Like
article thumbnail
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature flags help teams move fast, but when they’re not cleaned up, they quietly add extra code, slow down performance, and make applications harder to maintain.
May 27, 2026
by Poornakumar Rasiraju
· 3,533 Views · 1 Like
article thumbnail
DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
A practical checklist for platform engineering teams to improve DevOps, golden paths, reliability, governance, and developer experience at scale.
May 27, 2026
by Josephine Eskaline Joyce DZone Core CORE
· 2,495 Views · 1 Like
article thumbnail
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Liquid Clustering replaces rigid partitioning and Z-Order with adaptive clustering in Unity Catalog, improving performance with less maintenance.
May 26, 2026
by Seshendranath Balla Venkata
· 2,447 Views · 1 Like
article thumbnail
Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
Learn how platform teams can embed continuous optimization into internal developer platforms using GitOps, HITL workflows, and full-stack tuning.
May 26, 2026
by Graziano Casto DZone Core CORE
· 1,981 Views · 1 Like
article thumbnail
Product-Led Software Delivery: Intelligent Platforms for DevOps at Scale
Platform engineering helps DevOps teams scale with golden paths, DevEx metrics, automation, and AI guardrails that reduce friction and improve delivery.
May 25, 2026
by Fawaz Ghali, PhD DZone Core CORE
· 2,049 Views
article thumbnail
A Deep Dive into Tracing Agentic Workflows (Part 1)
Agentic systems fail silently — loops, hallucinations, corrupted state. You can't debug or improve what you don't trace.
May 22, 2026
by VIVEK KATARYA
· 2,835 Views
article thumbnail
11 Agentic Testing Tools to Know in 2026
This article is a review of tools used to autonomously plan, generate, maintain, and execute tests.
May 22, 2026
by Alvin Lee DZone Core CORE
· 2,197 Views
article thumbnail
Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
Dear micromanager, your distrust has a job — it’s just not the one you’re doing; learn about the role of the Verification Architect.
May 21, 2026
by Stefan Wolpers DZone Core CORE
· 2,143 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×