DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Coding

Also known as the build stage of the SDLC, coding focuses on the writing and programming of a system. The Zones in this category take a hands-on approach to equip developers with the knowledge about frameworks, tools, and languages that they can tailor to their own build needs.

Functions of Coding

Frameworks

Frameworks

A framework is a collection of code that is leveraged in the development process by providing ready-made components. Through the use of frameworks, architectural patterns and structures are created, which help speed up the development process. This Zone contains helpful resources for developers to learn about and further explore popular frameworks such as the Spring framework, Drupal, Angular, Eclipse, and more.

Java

Java

Java is an object-oriented programming language that allows engineers to produce software for multiple platforms. Our resources in this Zone are designed to help engineers with Java program development, Java SDKs, compilers, interpreters, documentation generators, and other tools used to produce a complete application.

JavaScript

JavaScript

JavaScript (JS) is an object-oriented programming language that allows engineers to produce and implement complex features within web browsers. JavaScript is popular because of its versatility and is preferred as the primary choice unless a specific function is needed. In this Zone, we provide resources that cover popular JS frameworks, server applications, supported data types, and other useful topics for a front-end engineer.

Languages

Languages

Programming languages allow us to communicate with computers, and they operate like sets of instructions. There are numerous types of languages, including procedural, functional, object-oriented, and more. Whether you’re looking to learn a new language or trying to find some tips or tricks, the resources in the Languages Zone will give you all the information you need and more.

Tools

Tools

Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.

Latest Premium Content
Trend Report
Platform Engineering and DevOps
Platform Engineering and DevOps
Trend Report
Developer Experience
Developer Experience
Trend Report
Low-Code Development
Low-Code Development
Refcard #400
Java Application Containerization and Deployment
Java Application Containerization and Deployment

DZone's Featured Coding Resources

Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch

Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch

By Krishnaveni Musku
Threat intelligence becomes operationally valuable when indicator data can be collected continuously, normalized into a consistent schema, and queried fast enough to support enrichment and detection workflows. Standardized exchange formats such as STIX and transport protocols such as TAXII exist specifically to make machine-readable cyber threat intelligence easier to distribute at scale, while preserving enough structure for downstream correlation and context. Operational Requirements That Shape Intelligence Pipelines A threat intelligence pipeline is best treated as data engineering with security-specific constraints: provenance must remain intact, updates and revocations must be representable, and “freshness” should be measurable rather than assumed. STIX is explicitly designed to model cyber threat intelligence using typed objects with attributes, and it supports building richer context by linking objects through relationships rather than shipping flat indicator lists. A practical pipeline design often separates raw ingestion from normalized storage. Raw ingestion preserves the original feed payload for auditability and reversibility, while normalized storage produces documents that are easy to match against telemetry. This split aligns with STIX’s modeling approach, where producers may publish Indicators expressed as STIX patterns and connect them to other objects through relationship constructs, enabling consumers to choose between lightweight atom extraction for matching and graph-style context for analysis. Pulling From TAXII and Other APIs Without Losing Provenance TAXII 2.1, published by OASIS Open, defines a RESTful API and related requirements for TAXII clients and servers to exchange cyber threat information in a scalable manner, with STIX 2.1 support described as mandatory to implement in the TAXII context. The IANA media type registration for application/taxii+json also documents that the older application/vnd.oasis.taxii+json name is a deprecated alias, which matters in real integrations because content negotiation and strict header validation vary by server implementation. TAXII 2.1 also formalized mechanics that directly affect pipeline correctness under load. The CTI documentation notes that TAXII 2.1 added limit and next URL parameters and updated content negotiation and media types, reflecting a move toward pagination patterns that can handle large or rapidly changing datasets more safely than item-based offset pagination. A Python pipeline can either implement paging logic directly or delegate it to a client library; the taxii2client project documents a TAXII 2.1 client API that uses application/taxii+json;version=2.1 for Accept handling and provides an as_pages helper for TAXII 2.1 endpoints that support pagination, including “Get Objects” and “Get Manifest.” Python def iter_taxii_objects(collection, cursor, page_size=2000): accept = "application/taxii+json;version=2.1" for page in as_pages(collection.get_objects, per_request=page_size, added_after=cursor, accept=accept): envelope = page if isinstance(page, dict) else page.json() for obj in envelope.get("objects", []): yield obj This pattern avoids embedding server-specific pagination tokens into pipeline logic while still enabling incremental collection reads. The cursor argument can be persisted as an ISO-8601 timestamp when the upstream provides a timestamp filter, a model commonly used by TAXII-feed vendors; for example, ESET documents STIX 2.1 feeds delivered via TAXII 2.1 collections and describes an added_after filter parameter for retrieving objects added after a specified timestamp, alongside retention constraints that make incremental pulls operationally necessary. Not all threat intelligence sources are TAXII-first. MISP Project documentation describes a REST-accessible STIX export capability and explicitly notes that STIX XML export can be slow and lead to timeouts with large events or collections, while STIX JSON avoids that regime, making JSON a more stable transport choice for high-volume automation. The same ecosystem provides a published OpenAPI specification and a dedicated converter library, misp-stix, which supports bidirectional conversion across STIX versions, including STIX 2.1, and includes features such as pattern parsing and indicator-observable fingerprinting, reducing the cost of maintaining bespoke mapping logic for every upstream source. Normalization Into ECS and STIX-Aware Semantics Normalization is where a pipeline either becomes queryable or becomes another archive. The Elastic Common Schema (ECS) threat field guidance explicitly frames threat.* as the mapping layer that normalizes threat intelligence indicators from many structures into consistent fields, and it links that normalization to detection and enrichment workflows such as indicator match rules. In particular, the guidance calls out normalizing indicators into threat.indicator.* so that disparate feeds can be queried consistently and used to build indicator matching logic without treating every provider as a special case. Atomic indicators benefit from being stored both as “typed value” and as vendor identifiers. ECS defines threat.indicator.type values aligned with cyber observable types and documents threat.indicator.id as a place to store indicator IDs, noting that a STIX 2.x indicator ID is a common approach and that the field can hold multiple values to represent the same indicator across systems. The practical implication is that a pipeline can preserve the upstream STIX identifier, attach a stable provider-local identifier when necessary, and still normalize the matchable indicator value into fields such as threat.indicator.ip or other threat.indicator.* subfields. Python def stix_confidence_to_nlmh(value): if value is None: return "Not Specified" v = int(value) if v == 0: return "None" if 1 <= v <= 29: return "Low" if 30 <= v <= 69: return "Medium" if 70 <= v <= 100: return "High" return "Not Specified" def extract_atomic_from_pattern(pattern): p = (pattern or "").strip() if "ipv4-addr:value" in p and "'" in p: return ("ipv4-addr", p.split("'")[1]) if "domain-name:value" in p and "'" in p: return ("domain-name", p.split("'")[1]) if "url:value" in p and "'" in p: return ("url", p.split("'")[1]) return (None, None) def stix_indicator_to_ecs(indicator_obj, provider, fetched_at_iso): itype, ivalue = extract_atomic_from_pattern(indicator_obj.get("pattern")) if not itype: return None doc = { "@timestamp": fetched_at_iso, "event": {"kind": "enrichment", "category": ["threat"], "type": ["indicator"]}, "threat": { "indicator": { "type": itype, "provider": provider, "name": indicator_obj.get("name") or ivalue, "description": indicator_obj.get("description"), "confidence": stix_confidence_to_nlmh(indicator_obj.get("confidence")), "reference": indicator_obj.get("id"), "id": [indicator_obj.get("id")], } }, "labels": {"feed": provider}, } if itype in {"ipv4-addr", "ipv6-addr"}: doc["threat"]["indicator"]["ip"] = ivalue return doc The extraction logic deliberately scopes itself to common “atomic” patterns to keep parsing deterministic and to minimize the risk of silently incorrect field derivation. This constraint matches the operational intent of ECS indicator guidance, which emphasizes consistent querying and reuse for indicator match rules after normalization, rather than attempting to fully interpret every possible composite STIX pattern in real time. Indexing Strategy in Elasticsearch That Avoids Accidental Cost Explosion Elasticsearch storage decisions are not purely operational preferences because they alter what update patterns are safe. Data streams consist of one or more hidden backing indices and require a matching index template; every document indexed into a data stream must include an @timestamp field mapped as a date-type (or date_nanos). Data streams are described as a good fit for most time-series use cases, while the documentation explicitly flags that frequent reuse of the same _id expecting last-write-wins can indicate a better fit for an index alias with a write index rather than a data stream. Threat intelligence pipelines often straddle that boundary: indicator state changes and revocations benefit from upsert semantics, while ingestion audits benefit from append-only history. Retention should be tied to query strategy. Elastic Security documentation warns that indicator match rules can consume significant resources and recommends limiting the indicator index query time range to the minimum necessary for coverage, with a default example query of the past 30 days. Even outside an alerting engine, a time-bounded indicator set tends to be operationally safer: it reduces scan cost, makes cache behavior more predictable, and avoids matching against long-expired infrastructure that is no longer relevant. When vendor retention is narrower, such as the 14-day retention window described for some TAXII feeds, the pipeline should persist that constraint as a policy and avoid relying on “full historical replay” as a recovery mechanism. Ingestion-Time Guardrails With Python, Ingest Pipelines, and Bulk Writes Ingest pipelines provide an explicit place to enforce normalization rules at ingest time. Elastic documentation describes ingest pipelines as a sequence of processors that run sequentially to transform data before it is indexed into a data stream or index, supporting operations such as removal, extraction, and enrichment. In addition, ingest processors can access ingest metadata under the _ingest key, and Elasticsearch notes that pipelines create _ingest.timestamp by default and that indexing ingest metadata requires explicitly setting it via a processor. JSON PUT /_ingest/pipeline/ti_normalize { "description": "Normalize threat intel indicators into ECS threat.indicator.*", "processors": [ { "set": { "field": "event.kind", "value": "enrichment" } }, { "set": { "field": "event.category", "value": ["threat"] } }, { "set": { "field": "event.type", "value": ["indicator"] } }, { "set": { "field": "event.ingested", "value": "{{{_ingest.timestamp}}" } }, { "fingerprint": { "fields": ["threat.indicator.provider", "threat.indicator.type", "threat.indicator.ip"], "target_field": "threat.indicator.fingerprint", "method": "SHA-256", "ignore_missing": true } } ] } Bulk ingestion should align with Elasticsearch’s wire format rules. The bulk API documentation describes NDJSON requirements, including that the final line must end with a newline character and that JSON actions and sources should not be pretty printed because newlines are literal delimiters. A Python producer can serialize documents into bulk batches, assign a deterministic _id derived from provider and atomic indicator value to make writes idempotent, and optionally route documents through the normalization pipeline configured above. Python def build_indicator_id(provider, itype, ivalue): return (provider + ":" + itype + ":" + ivalue).lower() def bulk_index_indicators(es_http, index_name, docs): lines = [] for d in docs: ti = d.get("threat", {}).get("indicator", {}) doc_id = build_indicator_id(ti.get("provider", "unknown"), ti.get("type", "unknown"), ti.get("ip", ti.get("name", "unknown"))) lines.append(encode_json({"index": {"_index": index_name, "_id": doc_id, "pipeline": "ti_normalize"})) lines.append(encode_json(d)) payload = "\n".join(lines) + "\n" return es_http.post("/_bulk", body=payload, headers={"Content-Type": "application/x-ndjson"}) The NDJSON newline termination is not optional, so building the payload in a way that always emits a trailing newline avoids a class of partial-ingest failures that are hard to diagnose under load. For enrichment use cases, ingest-time join behavior should be applied cautiously: Elastic warns that the enrich processor can impact ingest speed, recommends benchmarking, and explicitly states that it is not recommended for appending real-time data, instead working best with reference data that does not change frequently. This guidance aligns with threat intelligence practice: fast-changing indicators typically work better as a queried dataset, joined at search or detection time, rather than as an ingest-time enrichment applied to every event. Conclusion A threat intelligence pipeline built on Python, APIs, and Elasticsearch becomes reliable when it treats schemas, media types, and update semantics as core engineering concerns instead of integration details. STIX and TAXII provide standard object modeling and transport expectations, including content negotiation and pagination mechanics, while ECS provides a target schema that makes indicators consistently queryable and directly usable by matching workflows such as indicator match rules. High-quality implementations preserve provenance, normalize into threat.indicator.* with STIX-aligned confidence semantics, choose an indexing strategy that matches expected update patterns, and enforce ingestion guardrails through ingest pipelines, simulation, and NDJSON-correct bulk writes. More
Getting Started With Agentic Workflows in Java and Quarkus

Getting Started With Agentic Workflows in Java and Quarkus

By Shane Johnson
This post walks through building and running a real-world agentic workflow with Agentican and Quarkus. Specifically, an agentic workflow to automate market research and information sharing: Identify the top vendors within a market category.Research the positioning and strengths of each vendor.Classify the findings as either standard or urgent.Draft a brief to share with others in the company. Prerequisites QuarkusJava 25Maven (or Gradle)LLM provider API key Step 1: Add the dependency Create a Quarkus app, and add the Agentican Quarkus runtime module: XML <dependency> <groupId>ai.agentican</groupId> <artifactId>agentican-quarkus-runtime</artifactId> <version>0.1.0-alpha.3</version> </dependency> Step 2: Define Agents, Skills, and the Workflow Create an `agentican-catalog.yaml` file on the classpath. This is where you describe: Who does the work (agents)What they need to do it (skills)How they will do it (workflows) YAML agents: - id: researcher name: researcher role: | Expert at finding accurate, sourced information about companies and markets. Quotes sources. Distinguishes opinion from fact. - id: writer name: writer role: | Synthesizes research into structured, concise briefs. Avoids hedging language. Cites concrete evidence. skills: - id: web-search name: web-search instructions: | When a question requires external information, call the search tool first. Quote sources in your answer. Update the `agentican-catalog.yaml` file to define the workflow. YAML workflows: - id: market-brief name: market-brief description: Research vendors in a market and produce a structured brief outputStep: deliver params: - name: topic description: Market to research required: true - name: vendor_count description: Number of vendors defaultValue: "5" steps: - name: identify agent: researcher skills: [web-search] instructions: | Identify the top {{param.vendor_count} vendors in {{param.topic}. Return a JSON array of vendor names — names only, no commentary. - name: deep-dive type: loop over: identify steps: - name: analyze agent: researcher skills: [web-search] instructions: | Deep-dive vendor {{item}: positioning, key strengths, recent news. Quote sources. - name: classify agent: writer instructions: | Read the per-vendor deep-dives below. If any vendor has launched a competitive feature in the last 30 days, return the single word 'urgent'. Otherwise return 'standard'. Deep-dives: {{step.deep-dive.output} dependencies: [deep-dive] - name: deliver type: branch from: classify default: standard branches: - name: urgent steps: - name: urgent-brief agent: writer instructions: | Synthesize a vendor brief flagged URGENT for executive review. Lead with the recent competitive moves. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} - name: standard steps: - name: standard-brief agent: writer instructions: | Synthesize a vendor brief. Topic: {{param.topic} Deep-dives: {{step.deep-dive.output} A few things worth flagging: agent: researcher references the agent for a step, skills referenced by name, too.outputStep designates the step whose output becomes the workflow's typed result.{{param.X} interpolates workflow inputs into step instructions.{{step.X.output} interpolates an upstream step's output.{{item} is the current value inside a loop iteration.type: loop steps take an over reference (a step that produced a list, or a list-typed param).type: loop steps run their nested steps once per item, in parallel, and on virtual threads.type: branch steps take a from reference (a step whose output is used to select a branch).branches: mutually exclusive steps (or sets of steps) with default for unrecognized values. The framework loads agentican-catalog.yaml from the classpath, or you can define where it's loaded from: Properties files agentican.catalog-config=/etc/agentican/agentican-catalog.yaml Note: Agents, skills, and workflows can be defined via a fluent builder API as well. Step 3: Configure the Models Agentican reads the engine configuration from `application.properties`. The minimum is one LLM: Properties files agentican.llm[0].api-key=${ANTHROPIC_API_KEY} The provider defaults to `anthropic`, and the model defaults to `claude-sonnet-4-5`. Want OpenAI instead? Properties files agentican.llm[0].provider=openai agentican.llm[0].api-key=${OPENAI_API_KEY} agentican.llm[0].model=gpt-4o-mini Want to mix and match? Configure `name`s and reference them per-agent in the YAML catalog: Properties files agentican.llm[0].name=default agentican.llm[0].api-key=${ANTHROPIC_API_KEY} agentican.llm[1].name=efficient agentican.llm[1].provider=openai agentican.llm[1].api-key=${OPENAI_API_KEY} agentican.llm[1].model=gpt-4o-mini Step 4: Create a Typed Workflow Instance Define the workflow input and output records: Java public record ResearchParams(String topic, int vendorCount) {} public record VendorBrief(String topic, List<Vendor> vendors) { public record Vendor(String name, String positioning, List<String> strengths) {} } Then inject the typed workflow, and call it from a REST endpoint: Java @Path("/market-brief") public class VendorBriefResource { @Inject @AgenticanWorkflow(name = "market-brief") Workflow<ResearchParams, VendorBrief> brief; @POST @Path("/{topic}") public VendorBrief generate(@PathParam("topic") String topic) { return brief.start(new ResearchParams(topic, 5)).await(); } } Now, test the endpoint: Shell curl -X POST http://localhost:8080/market-brief/data%20observability%20platforms A few things worth flagging — they're what set this apart from a generic "call an LLM" library: ResearchParams.vendorCount becomes the workflow parameter vendor_count via SNAKE_CASE mapping.start() returns a WorkflowRun<VendorBrief> and await() parses the output step's text into a VendorBrief.@AgenticanWorkflow(name = "vendor-brief") resolves the registered workflow at injection time. Note: WorkflowRun itself exposes future() for a CompletableFuture<R>, and there's a ReactiveWorkflow<P, R> Mutiny variant for Vert.x stacks. Step 5: Add Agent Tools Agentican ships two integrations out of the box: MCP (Model Context Protocol) There is one config block per server. Tools are auto-discovered: Properties files agentican.mcp[0].slug=github agentican.mcp[0].name=GitHub agentican.mcp[0].url=https://mcp.github.com/sse agentican.mcp[0].headers.Authorization=Bearer ${GITHUB_TOKEN} Composio 100+ SaaS toolkits — Slack, Notion, Linear, Salesforce, GitHub, Google Workspace: Properties files agentican.composio.api-key=${COMPOSIO_API_KEY} agentican.composio.user-id=user-123 Tools are referenced by name within agent steps: YAML steps: - name: research agent: researcher tools: [github_search_repositories] instructions: "Profile open-source vendors in {{param.topic}." Structured agentic workflows for the JVM. Where to Go Next Getting Started — install, configure, and run workflowsCore Concepts — architecture, terminology, and data flowWorkflows & Steps — CDI surface, beans, qualifiers, override patterns.Agents — defining agents, skills, and rolesGetting Started (Quarkus) — dependency setup, config, first taskCDI Integration — injection, qualifiers, lifecycle events, bean overridesREST API — endpoints, SSE streaming, WebSocket, error codesObservability — Micrometer metrics, OTel tracing, Prometheus queries More
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
By Sangharsh Agarwal
Building AI-Powered Java Applications With Jakarta EE and LangChain4j
Building AI-Powered Java Applications With Jakarta EE and LangChain4j
By Otavio Santana DZone Core CORE
MuleSoft MCP and A2A in Production: What 17 Recipes Reveal
MuleSoft MCP and A2A in Production: What 17 Recipes Reveal
By Balachandra Shakar Bisetty
Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes
Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes

In this tutorial, you’ll run a small LangGraph agent locally, then migrate its hardcoded prompts, model choice, and tools into LaunchDarkly AI Configs. After the migration, every prompt tweak, model swap, or tool change ships as a LaunchDarkly update instead of a code deploy. The migration takes about 20 minutes. When you finish, the codebase will: Pull its system prompt, model name, and parameters from a LaunchDarkly AI Config on every request.Load its Tavily search tool definition from the same Config instead of a hardcoded module-level list.Emit duration, token, success, and error metrics to LaunchDarkly on each user turn.Have one offline-eval dataset staged for pre-rollout regression testing in the LaunchDarkly Playground.Fail gracefully by falling back to the original hardcoded values if LaunchDarkly is unreachable.Run A/B tests on models, prompts, parameters, and tool sets by creating variations and targeting them at user segments. Tutorial Summary The agent you’ll run is the official langchain-ai/react-agent template: a single-node React agent that uses Claude Sonnet and a Tavily search tool. The migration will pull three files into LaunchDarkly: The prompt in prompts.py,The model name in context.py, andthe tool list in tools.py. The aiconfig-migrate agent skill completes the work in five stages (audit, wrap, move tools, instrument, and attach evaluators). It pauses at the end of each stage for you to review. The provider call and the routing logic stay where they are. react-agent is one LLM that decides, one ToolNode that runs the tools the LLM asks for, and one conditional edge that loops between them. When you add a second agent with a handoff, you move the topology into a LaunchDarkly Agent Graph. This is a reviewer’s workflow, not a coding exercise. You ask your agent to run the aiconfig-migrate skill, then read the diffs and verify the skill got the audit, fallback, and tool schemas right. Every code sample below is an example of what your agent should produce, not something you should copy and paste. If you’d rather compare your migration to a finished one, the aiconfig-migrate branch of launchdarkly-labs/react-agent is the reference end state for this tutorial: the five stages applied against the upstream template, with AI Config-driven model, prompt, and tool wiring already in place. Prerequisites You’ll need: Python 3.11 or higher with uvA LaunchDarkly account with an AI project and access to your LaunchDarkly SDK keyAn Anthropic API key for Claude SonnetA Tavily API key for the search toolClaude Code (or another Claude Agent SDK client) with the LaunchDarkly agent skills installed and the LaunchDarkly MCP server configured. If you haven’t used skills before, the agent skills quickstart completes the setup in under 10 minutes. Clone the hardcoded starting point: Shell git clone https://github.com/langchain-ai/react-agent cd react-agent uv sync cp .env.example .env Specify an ANTHROPIC_API_KEY and TAVILY_API_KEY in .env. Then identify what’s hardcoded. The aiconfig-migrate skill’s first step is a read-only audit. Knowing the shape from the beginning makes the audit output easier to read. Here’s a table of the hardcoded values in react-agent: TitleFile:lineCurrent valueSystem promptsrc/react_agent/prompts.py:3"You are a helpful AI assistant.\n\nSystem time: {system_time}"Default modelsrc/react_agent/context.py:25"anthropic/claude-sonnet-4-5-20250929"max_search_resultssrc/react_agent/context.py:3310Toolsrc/react_agent/tools.py:17Tavily search function.bind_tools(TOOLS)src/react_agent/graph.py:37Binds the module-level listToolNode(TOOLS)src/react_agent/graph.py:73Runs the same list Skill Stage 1: Audit the Hardcoded Values Open Claude Code inside the cloned repo and run: Plain Text Migrate this app to LaunchDarkly AI Configs using the aiconfig-migrate skill. The skill starts by performing a read-only audit. It scans for hardcoded model and prompt values, identifies your package manager and provider, and produces a structured summary. For react-agent, the summary will look similar to this example: Python Language: Python 3.11+ Package manager: uv LLM provider: LangChain (init_chat_model) -> Anthropic Existing LD SDK: none Target mode: agent (LangGraph custom StateGraph) Hardcoded targets: - src/react_agent/prompts.py:3 SYSTEM_PROMPT (templated with {system_time}) - src/react_agent/context.py:25 model = "anthropic/claude-sonnet-4-5-20250929" - src/react_agent/context.py:33 max_search_results = 10 - src/react_agent/tools.py:29 TOOLS = [search] - src/react_agent/graph.py:37 .bind_tools(TOOLS) - src/react_agent/graph.py:73 ToolNode(TOOLS) Proposed plan: - Single AI Config key `react-agent` in agent mode - Stage 3 (tools) required, one tool (search) with schema extracted from the function signature via StructuredTool.from_function - Stage 4 (tracking) inline via LangChain callback handler - Stage 5 (evals) attached programmatically via create_judge - Existing Context dataclass becomes the fallback shape The skill stops here. Reply “continue” (or whatever affirmative response is appropriate for your shape) to begin Stage 2. Audit Output Can Vary If your audit output doesn’t match this, don’t continue without making improvements. The skill is designed to adapt. Read what it produces, reconcile that output against the table in Step 1, and tell the skill where it’s wrong. Iterate until the audit output addresses all the hardcoded values in the table. Skill Stage 2: Wrap the Call in the AI SDK This is the first stage where the skill writes code. It installs the SDK, creates the AI Config in LaunchDarkly, rewrites the hardcoded prompt to Mustache syntax, and adds a new ld_client.py module. To read the finished file, visit ld_client.py. Three things to check in the diff: The fallback mirrors the audit exactly. Every value you captured in Step 1 appears in FALLBACK with the same model name, provider, instruction text, and knob values. A drifted fallback silently changes behavior when LaunchDarkly is unreachable. max_search_results belongs in ModelConfig(custom={...}), not parameters={...}. parameters is forwarded to the provider SDK, and Anthropic, OpenAI, and Gemini all reject unknown kwargs.Model construction goes through create_langchain_model(ai_config), not a hand-rolled init_chat_model or load_chat_model wrapper. Hand-rolled builders only pass the model name, so variation parameters such as temperature, max_tokens, and top_p silently drop. If the template’s utils.load_chat_model is still present, have the skill delete it.{{ system_time } interpolation goes through the SDK, not a manual .replace(). The fourth argument to agent_config(...) is {"system_time": system_time}. If you see .replace("{{ system_time }", ...) at the call site, the skill missed the built-in interpolation. Verify both paths run before continuing. The skill won’t move to Stage 3 until both work. Here’s how to do that: In one terminal, start the dev server with your SDK key: Shell LD_SDK_KEY=sdk-... uv run --with "langgraph-cli[inmem]" langgraph dev --no-browser In a second terminal, invoke the graph once via the local API: Shell curl -s http://127.0.0.1:2024/runs/wait \ -H "Content-Type: application/json" \ -d '{ "assistant_id": "agent", "input": {"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}]} }' | jq '.messages[-1].content' A natural-language answer should appear. To make the LaunchDarkly-served path visually distinct from the fallback path, open the react-agent AI Config in LaunchDarkly, edit the default variation’s instructions, and append a sentence like: Always respond in over-the-top 1980s slang. Use words like “totally,” “rad,” “gnarly,” and “tubular.” Drop a “righteous!” somewhere. Save the variation, then re-run the curl command. Within a few seconds, you should see the answer come back with added 80s slang. That’s proof the LaunchDarkly-served prompt is winning over the hardcoded fallback. Next, stop the server, unset LD_SDK_KEY, restart it, and run the same curl call again. The slang should disappear, and the answer should read in the original neutral voice. That’s proof that the fallback, which still follows the pre-migration prompt exactly, runs when LaunchDarkly is unreachable. If you’d rather click through a chat UI, LangGraph Studio (free LangSmith login) and the hosted Agent Chat UI (point it at http://127.0.0.1:2024 with the graph id agent) both work against the same local server. Skill Stage 3: Move the Tool into the Config Stage 3 attaches the tool schema to the LaunchDarkly variation and rewires graph.py and tools.py to read the tool list from the AI Config using the skill’s tool factory pattern. Each tool is built by a factory that takes the per-run ai_config and returns a closure. The closure captures max_search_results, or any other model.custom knob, one time at the start of the turn, so the tool body never re-evaluates the AI Config. For the finished shape, visit tools.py and graph.py. The pattern, drawn verbatim from the reference repo: Python # Source of truth: launchdarkly-labs/react-agent@aiconfig-migrate src/react_agent/tools.py:15-42 def make_search(ai_config: AIAgentConfig) -> Callable[..., Any]: """Build a search tool that closes over this run's max_search_results. Capturing the value at run setup keeps it stable across the turn, so a mid-run flag flip won't change it between two tool calls. The tool body never re-evaluates the AI Config, which would emit an extra $ld:ai:agent_config event per tool call. """ max_results = ai_config.model.get_custom("max_search_results") or 10 async def search(query: str) -> dict: """Search for general web results. This function performs a search using the Tavily search engine, which is designed to provide comprehensive, accurate, and trusted results. It's particularly useful for answering questions about current events. """ return await TavilySearch(max_results=max_results).ainvoke({"query": query}) return search # Registry of tool factories keyed by the LD AI Tool name. Each factory takes # the per-run AI Config and returns the actual callable. graph.py materializes # this into {name: callable} on the first call_model tick. TOOL_FACTORIES: Dict[str, Callable[[AIAgentConfig], Callable[..., Any]]] = { "search": make_search, } graph.py materializes the factories inside call_model’s first-tick branch: built = {name: factory(ai_config) for name, factory in TOOL_FACTORIES.items()}, then update["tools"] = build_structured_tools(ai_config, built). Subsequent ticks read state.tools and pass it to create_langchain_model(ai_config).bind_tools(tools). For an exact sample, visit graph.py:50-63. Verify three things: The registry exports TOOL_FACTORIES and not a plain TOOL_REGISTRY of callables,Each factory returns a closure that reads model.custom values at construction time, not from inside the tool body, andbind_tools reads the materialized tool list off state instead of referencing the registry directly. build_structured_tools from ldai_langchain.langchain_helper wraps each built callable as a LangChain StructuredTool with the LD-served schema. Why the Factory Pattern Matters Reading ai_config.model.get_custom(...) from inside a tool body fires get_agent_config() on every tool invocation, inflating $ld:ai:agent_config event counts proportional to tool-call volume and letting a mid-turn flag change swap max_search_results between the first and second tool call. The factory captures the value one time at the start of the turn, preserves turn-level atomicity, and keeps agent_config evaluations at one per turn. Skill Stage 4: Wire the Tracker This is the stage where the graph topology changes. The migration adds a finalize node so every metric event for a user turn shares one runId, the unit LaunchDarkly bills and groups by in the Monitoring tab. A React agent turns loops through call_model several times to pick a tool, execute, and summarize. The at-most-once events, such as duration, tokens, success, and error, fire one time across that whole loop, not one time per tick. The three things to understand: Run-scoped state. On the first call_model tick of a turn, the migration resolves the AI Config, mints one tracker with ai_config.create_tracker(), materializes the tool factories into concrete callables, starts a perf_counter_ns timer, and stashes all of it on state. Every subsequent tick reuses what’s on state. The same tracker uses the same runId and results appear in one row per turn in Monitoring.Per-step events stay in call_model. tracker.track_tool_calls(...) is explicitly not at-most-once. It runs every tick that the LLM dispatches tools. Token usage accumulates into Annotated[int, add] state fields across ticks.Run-level events move to a new finalize node. track_duration, track_tokens, track_success, and track_error all fire there, one time per turn, reading totals off state. Read state.py for the run-scoped fields (ai_config, tracker, tools, start_perf_ns, three token counters, errored) and graph.py for the lazy-init prelude in call_model, the finalize node, and other details. Two SDK Details You Should Know ai_config.create_tracker() is a factory method as of launchdarkly-server-sdk-ai 0.18.0. If your skill emits ai_config.tracker instead of ai_config.create_tracker, regenerate. This migration workflow uses get_ai_usage_from_response rather than get_ai_metrics_from_response so the graph can accumulate tokens across ticks into state fields rather than tracking them synchronously per-call. Test this yourself by sending one request through the graph, then opening the AI Config in LaunchDarkly and reviewing the Monitoring tab. Within one or two minutes, you should see one row per user question with non-zero duration and token counts. If the tab fills up with multiple rows per question, the skill minted a tracker inside call_model instead of threading one through state. The Monitoring tab shows duration, token, and generation metrics for a migrated AI Config. Two Simplifications Compared to the Skill This repo collapses the setup steps of resolving the config, minting the tracker, and building the tools into the first tick of call_model instead of a dedicated setup_run node. It also skips track_metrics_of_async around ainvoke, which would fire duration and success per call rather than per turn. This helps produce a legible code diff, but production code should follow the skills setup_run and finalize factoring. If your app has a thumbs-up/down UI, the skill will also wire tracker.track_feedback(...). Feedback usually arrives in a later request from a different process, so pass tracker.resumption_token out to your frontend at call time and rebuild the tracker with LDAIClient.create_tracker(token, context) in the feedback handler. react-agent doesn’t have a feedback UI, so we’ve intentionally skipped this step. Keep Going The migration is done. The payoff is what you can do next without another code deploy: Reference implementation. Diff your own run against launchdarkly-labs/react-agent on the aiconfig-migrate branch to validate fallback shape, tool wiring, and tracker placement.Regression-test before rollout. Agent-mode Configs don’t support UI-attached automatic judges, so run an offline evaluation against a fixed dataset. The skill generates a starter datasets/react-agent-tests.csv from your audit; take it to the Offline Evaluation of RAG-Grounded Answers tutorial. The Accuracy judge at threshold 0.85, on a different model family than the agent, is the right starting point.Zero-code changes in production. Swap models per cohort, A/B test prompts or tool sets on 50/50 traffic, disable a tool for a segment, or watch duration, token spend, and eval scores land in the Monitoring tab in real time. All from the LaunchDarkly UI.Scale to a second agent. The moment you add a supervisor plus specialists or any routing handoff, move the topology itself into LaunchDarkly via ai_client.agent_graph("key", ld_context). The Beyond n8n tutorial walks the full pattern, and launchdarkly-labs/devrel-agents-tutorial (agent-skills branch) is the production-grade reference with three agents, per-user targeting, and dynamic routing.

By Scarlett Attensil
Alternative Structured Concurrency
Alternative Structured Concurrency

Java structured concurrency has been under development for a span of 5 years, weaving through 8 (!) distinct JEPs (JEP 428, JEP 437, JEP 453, JEP 462, JEP 480, JEP 499, JEP 505, JEP 525). To me, this feels rather excessive for what could be considered a fairly concise feature. My goal here is to experiment with an alternative approach that leverages Java's tried-and-tested, robust functionality available since JDK 1.5. It's possible this pathway could achieve better outcomes than what is proposed in JEP 505, which, from my perspective, introduces a suite of redundant interfaces and classes that replicate pre-existing ones. No doubt, developers need some governance, even in a relatively safe development environment like Java, with its automatic garbage collection, memory management, and strict typing. No matter how safe the provided path is, developers will still make mistakes, such as dereferencing nulls, using out-of-bound indexes, swallowing exceptions, and who knows what else. And, undoubtedly, concurrency is the hardest thing to get right — it's an endless source of bugs. But first, let me introduce some helper code that we will use throughout the article. Java // Example Proto package net.tascalate.concurrentx; // imports here public class FuturesDemo { static final ScopedValue<String> DEMO_SV = ScopedValue.newInstance(); // This emulates long-running calls // we need to execute asynchronously -- // all we do is returning value after the delay // or throw a supplied exception to emulate error private static <T> Callable<T> produceValue(T value, long delay) { return () -> { var start = System.currentTimeMillis(); try { System.out.println(">> Waiting value: " + value + " (SCOPED VALIUE IS " + DEMO_SV.orElse("<UNBOUND>") + ")"); Thread.sleep(delay); System.out.println(">> Producing value: " + value); if (value instanceof Exception) { throw (Exception)value; } else { return value; } } finally { var finish = System.currentTimeMillis(); System.out.println(">> Exiting " + value + ", " + Thread.currentThread() + ", done in " + (finish - start) + "ms, vs " + delay + "ms specified"); } }; } public static void main(String[] argv) { // implementation will be here } } According to Oracle, the majority of Java developers tend to approach concurrency execution in the following way (excerpt courtesy JEP 505, modified to use a helper code from above): Java // Example A - "unstructured concurrency" public static void main(String[] argv) throws InterruptedException, ExecutionException { var executor = Executors.newVirtualThreadPerTaskExecutor(); var start = System.currentTimeMillis(); try { Future<String> a = executor.submit( produceValue("A", 1000)); Future<LocalDateTime> b = executor.submit( produceValue(LocalDateTime.now(), 1500)); Future<BigInteger> c = executor.submit( produceValue(BigInteger.valueOf(42), 500)); var result = List.of(a.get(), b.get(), c.get()); System.out.println("*** ALL result: " + result); } finally { var finish = System.currentTimeMillis(); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); executor.shutdownNow(); } } Here, a range of critical problems lurk, several of which are detailed in the "Motivation" section of the JEP: In contrast to the above example, Oracle proposes the use of its structured concurrency API as a solution that, hypothetically, addresses these concerns: Java // Example B -- structured concurrency @SuppressWarnings("preview") public static void main(String[] argv) throws InterruptedException, ExecutionException { var start = System.currentTimeMillis(); try (var scope = StructuredTaskScope.open( StructuredTaskScope.Joiner.allSuccessfulOrThrow())) { var a = scope.fork(produceValue("A", 1000)); var b = scope.fork(produceValue(LocalDateTime.now(), 1500)); var c = scope.fork(produceValue(BigInteger.valueOf(42), 500)); scope.join(); var result = List.of(a.get(), b.get(), c.get()); System.out.println("*** ALL result: " + result); } catch (StructuredTaskScope.FailedException ex) { System.out.println("*** ALL exception: " + ex.getCause()); } finally { var finish = System.currentTimeMillis(); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); } } Let’s shift our focus back to the original code. After putting in diligent QA efforts, writing useful tests with good code coverage, and completing a thorough code review, what’s the developer’s next move? Most likely, they'll refine the initial code block to resemble the updated version below: Java // Example C - fixed "unstructured concurrency" from Example A public static void main(String[] argv) throws InterruptedException, ExecutionException { Future<String> a = null; Future<LocalDateTime> b = null; Future<BigInteger> c = null; var executor = Executors.newVirtualThreadPerTaskExecutor(); var start = System.currentTimeMillis(); try { a = executor.submit(produceValue("A", 1000)); b = executor.submit(produceValue(LocalDateTime.now(), 1500)); c = executor.submit(produceValue(BigInteger.valueOf(42), 500)); var result = List.of(a.get(), b.get(), c.get()); System.out.println("ALL result: " + result); } finally { var finish = System.currentTimeMillis(); Stream.of(a, b, c) .filter(Objects::nonNull) .forEach(f -> f.cancel(true)); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); executor.shutdownNow(); } } At a glance, this approach seems fairly effective — any remaining Features are canceled in the instance of an intermediate error, and all execution threads are properly terminated. However, there's still a fair amount of boilerplate code, which remains cumbersome to implement consistently. No problem, let's extract common functionality into some reusable class. Please see the TaskScope class in the Gist. By doing so, the code undergoes a noticeable transformation: Java // Example D - fixed "unstructured concurrency" from Example A // with a reusable TaskScope class public static void main(String[] argv) throws InterruptedException, ExecutionException { var start = System.currentTimeMillis(); try (var scope = new TaskScope( Executors.newVirtualThreadPerTaskExecutor())) { var a = scope.fork(produceValue("A", 1000)); var b = scope.fork(produceValue(LocalDateTime.now(), 1500)); var c = scope.fork(produceValue(BigInteger.valueOf(42), 500)); var result = List.of(a.get(), b.get(), c.get()); System.out.println("*** ALL result: " + result); } finally { var finish = System.currentTimeMillis(); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); } } Upon inspecting the Gist sources — which you absolutely should for understanding — you’ll notice something important: this implementation relies on Java version 1.8, released over 12 years ago. Furthermore, if it does not use java/util/stream/Stream, it can even run seamlessly on JDK 1.5! But hold on — why incorporate java/util/stream/Stream here? Quite frankly, it's the core of the proposal. Take example D above: it efficiently handles just one scenario, namely, waiting for all tasks to finish while throwing an error if any fail along the way. Support for different scenarios requires something a bit more sophisticated. The TaskScope implementation shared in the Gist translates a queue of completed Futures (irrespective of whether completion came via a result, error, or cancellation) directly into a Stream. Curious why this may be useful? Let's rewrite this boring example once again: Java // Example E - same as Example D but with Stream pipeline public static void main(String[] argv) { var start = System.currentTimeMillis(); try (var scope = new TaskScope( Executors.newVirtualThreadPerTaskExecutor())) { scope.fork(produceValue("A", 1000)); scope.fork(produceValue(LocalDateTime.now(), 1500)); scope.fork(produceValue(BigInteger.valueOf(42), 500)); var result = scope.completions() .map(Future::resultNow) .toList(); System.out.println("*** ALL result: " + result); } finally { var finish = System.currentTimeMillis(); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); } } This way, we just convert all the completed features into the list of results and keep our fingers crossed that there were no errors. Let’s turn all successfully completed futures into a result list, disregarding potential errors entirely. No exceptions will ever be thrown within this scope: Java var result = scope.completions() .filter(f -> f.state() == Future.State.SUCCESS) .map(Future::resultNow) .toList(); Or simply find the first result available: Java var result = scope.completions() .filter(f -> f.state() == Future.State.SUCCESS) .map(Future::resultNow) .findAny() .orElse("<NONE>"); Or, alternatively, select no more than the first N results: Java var N = 5; var result = scope.completions() .filter(f -> f.state() == Future.State.SUCCESS) .map(Future::resultNow) .limit(N) .toList(); In these two recent examples, any remaining futures will automatically be terminated once the try-with-resources block in the main method exits. Clearly, we can also handle errors while gathering results and terminate prematurely — if the code logic doesn't permit intermediate errors: Java var result = scope.completions() .peek(f -> { if (f.state() == Future.State.FAILED) throw new CompletionException(f.exceptionNow()); }) .map(Future::resultNow) .limit(2) .toList(); If you're already acquainted with JEP 505, you’ll understand what is being replaced here: StructuredTaskScope.Joiner. Now, you can mimic any type of "join" behavior without the need to subclass/implement StructuredTaskScope.Joiner. The Stream pipeline API over the completions queue serves as an expressive tool to achieve this out of the box. Plus, with the introduction of Gatherers, there’s room for truly ad hoc scenarios, such as managing result windows — think fixed-size batches of completed results processed as soon as they are ready. It’s also worth noting that in JEP 505, a certain StructuredTaskScope.Joiner implementations produce streams as their output. However, it’s the Joiner that determines when all forks have finished processing and opens the resulting stream post-join. In the alternative methodology described here, the decision of where and how joins occur resides within user-defined scope-flow logic. It’s a lazy, on-demand process — guided by conditions that may take more into account than just Future results. For instance, elements like internal object state or in-scope variables can directly influence decisions about which results to collect and which errors, if any, can be disregarded in the operation. Now to the real challenge. A notable limitation with the code given is its inability to propagate context, namely, the current ScopedValue-s bindings. This characteristic is sometimes cited as a primary strength of JEP 505 StructuredTaskScope. To be fair, one might argue it's an unfair advantage, one that exists solely because JDK-internal mechanisms make it achievable. Current bindings are captured and propagated by using jdk/internal/misc/ThreadFlock — a utility inaccessible to code outside of the JDK. Perhaps, in a more ideal universe, there is a JDK 25, equipped with the following official API for java/util/concurrent/ThreadFactory, introducing possibilities for bridging this gap: Java public interface ThreadFactory { abstract Thread newThread(Runnable code); default ThreadFactory captureContext() { ThreadFactory delegate = this; Object currentScopedValueBindings = SomeInternalClass.captureValueBindingsForTheCurrentThread(); return new ThreadFactory() { public Thread newThread(Runnable code) { Thread result = delegate.newThread(code); SomeInternalClass.applyValueBindings(result); return result; } }; } } But that's not the case for us. Thankfully, the classes from the java/util/concurrent package offer immense customizability and are remarkably adaptable tools (a big nod to Dr. Douglas S. Lea for this). So you can find another class, TaskScopeContextual, in the same Gist. This class adopts StructuredTaskScope to the ExecutorService API, solely aimed at promoting ScopedValue bindings for forked tasks. The following example highlights all the advantages of employing this alternative structured scope design: Java // Example F - true structured concurrency with context passing public static void main(String[] argv) { var start = System.currentTimeMillis(); ScopedValue.where(DEMO_SV, "VALUE_DEFINED_IN_MAIN").call(() -> { try (var scope = new TaskScopeContextual()) { scope.fork(produceValue("A", 1000)); scope.fork(produceValue("B", 2000)); scope.fork(produceValue("C", 2000)); scope.fork(produceValue("D", 2000)); var timeout = scope.fork(produceValue(null, 2500)); scope.fork(produceValue("E", 2000)); scope.fork(produceValue("F", 3000)); scope.fork(produceValue("G", 3000)); var result = scope.completions() .takeWhile(f -> f != timeout) .filter(f -> f.state() == Future.State.SUCCESS) .limit(6) .map(Future::resultNow) .sorted() .toList(); System.out.println("*** ALL result: " + result); } finally { var finish = System.currentTimeMillis(); System.out.println( "*** Exiting main, executed in " + (finish - start) + "ms"); } return null; }); } Take note of the elegant handling of timeouts with Streams. Unlike the current approach in JEP 505, there's no necessity to incorporate it into the API. In summary, here’s a recap: There's no requirement for StructuredTaskScope.Subtask — the existing java/util/concurrent/Future API already does the job adequately. Consequently, the inclusion of StructuredTaskScope.Subtask.State is redundant — even with the current JEP 505, Future.State is more than sufficient. StructuredTaskScope.Joiners demand subclassing for all but the simplest cases. A java/util/stream/Stream pipeline over the completed futures would serve as a much more convenient solution. The StructuredTaskScope.FailedException feels unnecessary — even in the current API, java/util/concurrent/CompletionException fulfills the same purpose just fine. Built-in StructuredTaskScope timeouts possess timing characteristics that are challenging to predict (e.g., try adding lengthy blocking calls before the initial fork). It's far simpler and more controlled to handle timeouts explicitly. I'm really interested to hear readers' opinions. Do you share my ideas or do you support JDK team's statement that Futures "are counterproductive in structured concurrency" (see the "Alternatives" section of JEP 505)? Would you say that the well-known and adaptable Stream API is superior to Joiners or strict set of Joiners is simpler?

By Valery Silaev
Optimizing Databricks Spark Pipelines Using Declarative Patterns
Optimizing Databricks Spark Pipelines Using Declarative Patterns

If you've ever inherited a Spark job that runs in 35 minutes and someone asks you to make it faster, you know the routine. You start by checking partition counts, then file sizes, then shuffle stages, then broadcast hints. You find a handwritten OPTIMIZE schedule from 2022, a Z-ORDER on the wrong column, and a cluster sized for last year's data volume. By the time you've made the job fast, you've absorbed three new things to maintain. The next person to inherit it will absorb four. This pattern — call it the hand-tuning treadmill — is what the declarative optimization story on Databricks is trying to break. It's not a single feature; it's a cluster of capabilities that collectively let teams describe what a table should look like and let the engine handle the physical optimizations. What follows is the practical view of those patterns: where they fit, what they replace, and how to migrate without a rewrite weekend. 1. The Hand-Tuning Treadmill: Why Imperative Optimization Doesn't Scale Before getting into the declarative side, it's worth being concrete about what "imperative Spark optimization" actually means in production. The shape is consistent across teams I've audited: Layout decisions frozen on day one. Somebody picks a partition column when the table is created. The data shape changes a year later. Nobody re-partitions because the migration is scary. Query plans drift toward full scans.Maintenance jobs that nobody owns. An OPTIMIZE / Z-ORDER / VACUUM script lives in a notebook scheduled at 3 AM. It runs on a cluster that's slightly mis-sized. When data volume grows, the job runs into the morning workload, and people complain about latency.Cluster sizing as a guess. Worker count is a heuristic from a senior engineer's memory of last year's spike. Half the time it's too big, half the time it's too small, and the cost discussion gets emotional.Hint-driven plans. Broadcast hints, repartition hints, coalesce (N) — sprinkled through pipelines to fix yesterday's problem, kept indefinitely because removing them feels risky. None of these are bugs. They're symptoms of the imperative model: the team owns the layout, the maintenance, the sizing, and the plan tuning. In small pipelines, ownership is fine. At scale, it becomes the bottleneck that the team can't outsource. 2. What "Declarative" Means in the Spark Optimization Context Declarative is a word that gets used in two different ways here, and it's worth pulling them apart. Within Lakeflow pipelines (formerly DLT), it means "describe the tables, not the steps" — the engine builds the DAG and runs it. But in the broader optimization story, declarative also means "describe the desired property of the table or workload, not the operations to maintain it": Layout: I want this table clustered by these columns; figure out when and how to re-cluster.Maintenance: I want this table optimized and vacuumed; figure out the schedule.Ingestion: I want all new files in this path picked up exactly once; figure out checkpointing and listing.Quality: These rows must satisfy these expectations; enforce them and report what gets dropped.Compute: I want this query fast and not wasteful; size and scale appropriately. Each one of those bullets corresponds to a piece of the declarative stack. Used together, they replace a remarkable amount of the boilerplate that has historically lived in Spark pipelines. The mental shift: You stop writing operations against the table and start writing properties of the table. The engine becomes the actor; you become the editor. 3. The Declarative Optimization Stack on Databricks The chart below maps each thing the team declares to the engine capability that handles it, ending at the physical Delta table. It's the picture I draw on whiteboards when teams ask, "What's the order to adopt these in?" Figure 1. The declarative optimization stack: each user-facing intent at the top maps to a continuous engine behavior, which keeps the underlying Delta tables well-clustered, compacted, and statistically up-to-date — without human intervention. Two things are worth highlighting in this picture. First, every box in the engine row is something that runs continuously, not on a cron — there is no daily "optimization window" anymore. Second, the bottom layer is identical to what you'd get from any well-tuned imperative pipeline: 256 MB Parquet files with current statistics. The declarative path doesn't change what good looks like; it changes who does the work to keep things looking good. 4. Layout: Liquid Clustering Replaces Hand-Maintained Z-ORDER Liquid Clustering is the change with the largest practical impact, because partition-key choices are where most lakehouse pipelines accumulate the most technical debt. The declarative version: you specify the columns the data is most often filtered or joined by, and the engine maintains a layout that supports those access patterns — incrementally, as new data arrives, without a full rewrite. When access patterns change, you change the cluster columns, and the engine re-clusters in the background. Defining Liquid-Clustered Tables SQL -- New table, clustered by the columns most commonly filtered on. -- No more PARTITIONED BY, no more guessing at partition cardinality. CREATE TABLE prod.gold.daily_totals ( account_id STRING, region STRING, ingest_date DATE, daily_total DECIMAL(18,2), txn_count BIGINT ) USING DELTA CLUSTER BY (region, ingest_date, account_id); -- Even better: let the engine pick the clustering columns by -- observing real query patterns over time. CREATE TABLE prod.gold.events_clustered USING DELTA CLUSTER BY AUTO AS SELECT * FROM prod.silver.events; Migrating an Existing Partitioned/Z-ORDER Table SQL -- Convert a legacy partitioned table to liquid clustering. -- Existing data files are not rewritten immediately; the engine -- rebalances incrementally on subsequent writes + maintenance. ALTER TABLE prod.silver.transactions CLUSTER BY (account_id, ingest_date); -- Force the first clustering pass for a freshly converted table OPTIMIZE prod.silver.transactions FULL; Why this matters: the recurring 2 AM Slack thread of "can we re-partition this table?" goes away. Layout becomes a property you change with one DDL statement, not a multi-week rewrite project. 5. Maintenance: Predictive Optimization Replaces Cron-Driven OPTIMIZE/VACUUM Predictive optimization is the part that retired the most legacy code in the pipelines I've migrated. Once enabled at the catalog or schema level, the engine monitors each table's read and write patterns and decides on its own when to compact files, re-cluster, vacuum, and refresh statistics. The big win isn't the operations themselves — the imperative pipeline could already run those — it's that the timing is observed-driven, not schedule-driven. Tables that get heavy ingestion get more frequent maintenance. Cold tables get left alone. SQL -- Turn it on at the catalog level once; new tables inherit. ALTER CATALOG prod SET PREDICTIVE OPTIMIZATION = ENABLED; -- Or at the schema level for a phased rollout ALTER SCHEMA prod.gold SET PREDICTIVE OPTIMIZATION = ENABLED; -- Inspect what the engine has been doing on a given table SELECT operation, operation_metrics.numFilesAdded AS files_added, operation_metrics.numFilesRemoved AS files_removed, operation_metrics.numOutputBytes AS output_bytes, timestamp FROM (DESCRIBE HISTORY prod.gold.daily_totals) WHERE userMetadata IS NULL -- engine-driven, not user AND operation IN ('OPTIMIZE', 'VACUUM') AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; What you should delete after enabling this: the nightly notebook that runs OPTIMIZE on every table in a schema, the VACUUM cron job, the ANALYZE TABLE wrapper, and the alerting that wakes someone up when those jobs run long. None of them are needed anymore, and leaving them on creates duplicate work that the engine and the cron will fight over. 6. Ingestion: Auto Loader Replaces Listing-Based File Detection Auto Loader is the declarative answer to the perennial "which files have we processed already?" problem. Instead of listing a directory, comparing it to a state file, and figuring out the new bits, you describe the source location and the format and let the engine maintain its own incremental state. It uses cloud-native event notifications (S3 events, ADLS notifications, or efficient directory listing as a fallback), and the checkpoint is just another piece of state the engine owns. Python from pyspark.sql.functions import current_timestamp # Streaming ingest from S3 with schema inference + evolution. # Replaces hand-maintained checkpointing, listing logic, and # whatever file-tracking table the team built two years ago. (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaLocation", "s3://acme-checkpoints/txns_schema") .option("cloudFiles.schemaEvolutionMode", "addNewColumns") .load("s3://landing/txns/") .withColumn("_ingest_ts", current_timestamp()) .writeStream .format("delta") .option("checkpointLocation", "s3://acme-checkpoints/txns_writer") .trigger(availableNow=True) # batch-style; runs to completion .toTable("prod.bronze.txns")) Two notes from production. First, schemaEvolutionMode is the option that prevents the silent-data-loss class of bugs when partner schemas change; pick the policy explicitly rather than letting it default. Second, trigger(availableNow=True) gives you batch ergonomics on a streaming source — the job runs until it has consumed everything and exits, which is what most teams actually want for daily ingestion. 7. Transforms and Quality: Declarative Pipelines Replace Bare Spark + External DQ The final piece is the transformation layer. Lakeflow pipelines (the rebrand of Delta Live Tables) let you declare each table as a Python or SQL definition, and add expectations as a first-class concept. The engine derives the DAG from the dependencies and enforces the expectations on every write — the data quality framework, the lineage layer, and the orchestration glue collapse into a single artifact. Python import dlt from pyspark.sql.functions import sum as _sum, col @dlt.table( name="silver_txns", table_properties={ "delta.enableChangeDataFeed": "true", "delta.tuneFileSizesForRewrites": "true", }, cluster_by=["account_id", "ingest_date"], ) @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP')") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id", "region") .agg(_sum("amount").alias("daily_total"))) The decorators do four things at once: define the table, declare its layout (cluster_by), declare its quality rules, and let the engine infer that gold_daily_totals depends on silver_txns from the dlt.read call. There is no DAG file. There is no separate Great Expectations suite. Lineage is generated for free in Unity Catalog, including column-level edges. If you want to query how the expectations have been performing — useful for SLO dashboards or alerting — the event log surfaces it directly: SQL -- Pass / fail / drop counts per expectation, last 24 hours SELECT flow_name, details:flow_progress.data_quality.expectations[0].name AS exp_name, details:flow_progress.data_quality.expectations[0].passed_records AS passed, details:flow_progress.data_quality.expectations[0].failed_records AS failed, details:flow_progress.data_quality.expectations[0].dropped_records AS dropped, timestamp FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 1 DAY ORDER BY timestamp DESC; 8. Putting It Together: Where to Start, What to Measure Adopting all of this at once is a recipe for pain. The order I've seen work, and a small set of metrics to verify the change is paying off: Step Adopt Retire Verify with 1 Predictive optimization at schema level Nightly OPTIMIZE / VACUUM jobs Reduction in maintenance-cluster cost 2 Liquid clustering on top 5 tables Static partitioning + Z-ORDER p95 query latency on the same workloads 3 Auto loader for 1-2 ingestion pipelines Custom file-tracking + listing logic End-to-end data freshness 4 Lakeflow pipelines for new pipelines only External DQ + DAG glue (for new work) Lines of pipeline code per table 5 Serverless compute for SQL warehouses + DLT Hand-sized job clusters Cost-per-query, scale-up time What you do not need to migrate: imperative pipelines that already work and aren't growing. Declarative patterns are about new work and high-pain hot spots, not a heroic rewrite of every notebook ever shipped. 9. Honest Limitations and Where Imperative Still Wins Three places where the declarative model still bites — worth knowing before you commit: Procedural logic still belongs in Jobs. If your pipeline is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table. Don't try to bend dlt around it.Predictive optimization needs observation time. On a table that's a week old, the engine hasn't seen enough patterns to make great decisions. For tables under heavy initial load, an explicit OPTIMIZE FULL after the first big ingest still helps.Cluster-by-column choice still matters. CLUSTER BY AUTO is great for stable workloads with predictable filters. For tables whose access pattern is genuinely heterogeneous across teams, an explicit cluster-by based on the dominant query is usually faster.Hint-driven escapes are still allowed. If a particular query benefits from a /*+ BROADCAST(t) */ hint and AQE isn't catching it, the hint is fine. Just keep them rare and document why. Conclusion The declarative optimization story isn't a single feature you toggle — it's a quiet shift in who owns the boring parts of a Spark pipeline. Layout, maintenance, ingestion bookkeeping, plan tuning, cluster sizing, data quality enforcement: every one of those was traditionally a thing the team owned and paid for in toil. The current Databricks stack lets you express each as an intent and let the engine handle the operations underneath. Adopt them in order, retire what they replace, and the optimization treadmill slows from a daily concern to a quarterly review. That's the actual win, and it's the reason the declarative paradigm has gone from a Lakeflow detail to the default mental model for new pipelines on Databricks.

By Seshendranath Balla Venkata
MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction

This article will help developers, architects, and readers understand MuleSoft's Intelligent Document Processing capabilities and functionality. After reading this article, the reader will understand how to use MuleSoft Intelligent Document Processing and the different use cases where it can be used. MuleSoft Intelligent Document Processing (IDP) helps you read and understand documents like invoices, purchase orders, and other structured or unstructured files. Using AI, it analyzes these documents and extracts key information, converting it into a clean, structured format. It uses AWS Textract to pull data from PDFs and images, making it easy to handle different document types. The extracted information can then be seamlessly integrated with tools like Anypoint Platform, MuleSoft RPA, Salesforce Flow, and Anypoint Composer, helping automate processes and improve efficiency. Use Cases of MuleSoft Intelligent Document Processing In many organizations, documents such as purchase orders and invoices arrive in a variety of formats and layouts, which makes manual processing time-consuming and error-prone. MuleSoft Intelligent Document Processing (IDP) addresses this challenge by automating the extraction and processing of information from both structured and unstructured documents. Reducing repetitive manual tasks helps teams save time, improve accuracy, and scale operations efficiently as document volumes increase. Beyond invoices and purchase orders, MuleSoft IDP can be applied across multiple industries and use cases. It is commonly used for processing healthcare documents and patient information, managing contracts, handling loan and insurance claim documents, and working with education and government records. It is also useful for organizing and extracting insights from legal documents, making it a versatile solution for any document-heavy workflow. MuleSoft IDP supports commonly used file formats such as PDF, PNG, and JPEG, allowing businesses to work with both digital and scanned documents. It also supports multiple languages, including English, Spanish, and German, making it suitable for global use cases. To ensure secure and efficient operations, MuleSoft IDP follows defined data retention policies along with certain limits and quotas. These controls help manage how long data is stored and how much processing can be performed, depending on the specific configuration and usage. For precise details, refer to MuleSoft’s official documentation. The following are the data retention policy and limits/quotas supported by the MuleSoft Intelligent Document Processing: Figure 1: MuleSoft intelligent document processing limit and supported The following is the data retention policy for the MuleSoft Intelligent Document Processing: Document Action Editor Modified files are temporarily stored in the Document Action Editor while testing is underway.When the editor is closed or a new file is uploaded, the extracted data is automatically deleted. Document Action Execution Endpoint Data is safely stored while it is being executed, and files are kept in an S3 bucket.Keeps data on successful executions for 7 days.Data is kept for seven days following task completion for executions that need human review. Keeps unfinished work for 60 days.Users are unable to set retention periods. IDP extracts data by default using its natural language processing model (IDP NLP) in response to the preset prompts. You can choose Einstein to examine the document and extract the data when you create a document action. Use Einstein to extract information from unstructured documents, such a driver’s license, insurance claim record or a medical record with handwriting, or to find the answer to complicated queries concerning the document, like how much an invoice will cost after taxes and other considerations. To analyze and understand documents, MuleSoft Intelligent Document Processing (IDP) uses a combination of AI models rather than relying on a single one. Through Salesforce Einstein, it leverages multiple advanced large language and multimodal models via the Einstein Trust Layer, which ensures secure and governed access to AI capabilities. Some of the key supported models include: Einstein OpenAI GPT-4o – A strong general-purpose model suitable for most document processing tasks. It performs well even with non-Latin languages and can identify layout elements like font sizes and styles. However, it has lower accuracy when reading checkboxes in forms, so prompting it clearly (for example, asking it not to assume missing data) helps improve results.Einstein OpenAI GPT-4o Mini – A faster model designed for more focused tasks. While it delivers quick responses, it may sometimes show less detailed reasoning. It also has limitations in accurately interpreting checkboxes in forms.Einstein Gemini 2.0 Flash 001 – Particularly effective for image-heavy documents, offering better accuracy in visual analysis. It provides moderate accuracy for checkbox detection, especially when documents are processed one page at a time, and supports structured outputs.Einstein Gemini 2.5 Flash – An improved version of the Gemini Flash model, offering faster performance and higher accuracy, especially for image-based documents and complex layouts. In addition to these models, MuleSoft IDP uses AWS Textract to extract text, tables, and key-value pairs from documents such as PDFs and images. It may also incorporate other AI services from Salesforce and AWS to improve tasks like document classification, entity recognition, and data extraction. By combining these models and services, MuleSoft IDP can not only extract information from documents but also understand context and structure, making the data ready for seamless integration and automation across platforms. A document action is a multi-step procedure that scans a document, filters out fields, and returns a structured response in the form of a JSON object using several AI engines. Every document action specifies the fields to be extracted, the fields to be filtered out of the response, and the kinds of documents it accepts as input. The following are the components of document action: Document types: outlines the kinds of documents that are acceptable for input.To extract fields: specifies which document fields should be extracted.Fields to filter out: Specifies which response fields should be excluded.Confidence score: Indicates the likelihood that the value was accurately extracted by IDP.Prompts: Asks questions in natural language to improve the data extraction procedure.Reviewers: Specifies who examines documents that fall below the threshold for confidence scores. You can set the minimum confidence score that is acceptable for each field to be extracted, mark fields as necessary, hide fields, and set up prompts to improve and hone the data-extraction process by posing natural language queries. The probability that IDP correctly extracted the value from a document is indicated by the confidence score. A 100% confidence score, for instance, indicates that IDP extracted the value completely accurately. A 70% confidence level, on the other hand, indicates that there is a 20% possibility that the extracted value is incorrect. Publishing to Anypoint Exchange MuleSoft Intelligent Document Processing allows you to publish the document action into the Anypoint Exchange as an asset that provides the following endpoints. POST/executions  –  This endpoint allows you to post the document to MuleSoft Intelligent Document Processing for scanning and extracting data. The following curl command can be used to post the document: Shell curl -H "authorization: Bearer <Bearer_token>" \ -F "file=@\"test.pdf\"" \ https://{idp_domain}.us-east-1.anypoint.mulesoft.com/api/v1/organizations/{organizationId}/actions/{documentActionId}/versions/{assetVersion}/executions GET /executions/{executionId}  –  This endpoint will allow you to retrieve the execution status (Success or Manual Validation Required) with the fields and prompt response for the document that has been posted to the IDP. The following curl command can be used to retrieve IDP execution status: Shell curl -H "Authorization: Bearer <Bearer_Token>" \ https://<idp_domain>.us-east-1.anypoint.mulesoft.com/api/v1/organizations/{organizationId}/actions/{documentActionId}/versions/{assetVersion}/executions/{executionId} To access the above APIs, you need to generate an authorization token. To generate an authorization (Bearer) token, you need to create a connected app in the Anypoint Platform with the scope “Execute Document Actions.” Once you have registered the connected app, it will provide the clientId and clientSecret, which can be used in the curl command below to generate the Authorization (Bearer) token. Shell curl -X POST -H "content-type: application/json" -d "{\"grant_type\": \"client_credentials\", \"client_id\": \"<Client_Id>\", \"client_secret\": \"<Client_Secret>\"}" \ https://anypoint.mulesoft.com/accounts/api/v2/oauth2/token Bearer token received in the response that can be used in the above POST and GET requests. Benefits of MuleSoft Intelligent Document Processing The following are the benefits of MuleSoft Intelligent Document Processing: Reduce cost: Intelligent document processing can lower the cost by automating the manual document data extraction.Improve efficiency and productivity: Intelligent document processing can work around the clock and process documents more quickly than manual methods. Intelligent document processing can work around the clock and process documents more quickly than manual methods.Reduce human errors: By using automated extraction and validation, Intelligent document processing can reduce human error and guarantee data consistency.Improve accuracy: By using automated extraction and validation, Intelligent document processing can reduce human error and guarantee data consistency.Easy to integrate: MuleSoft IDP can be easily integrated with Robotic Automation Process (RPA) by using the REST APIs provided by the IDP. Conclusion In conclusion, MuleSoft Intelligent Document Processing (IDP) offers a powerful and practical way for organizations to modernize how they handle documents. By combining AI-driven extraction with seamless integration capabilities, it helps reduce manual effort, improve data accuracy, and accelerate business processes. As businesses continue to deal with large volumes of unstructured data, solutions like IDP become increasingly important. They not only streamline operations but also support better compliance, scalability, and decision-making. By adopting MuleSoft IDP, organizations can enhance productivity, lower operational costs, and ultimately deliver a better experience to their customers. MuleSoft Intelligent Document Processing (Invoice Processing Using MuleSoft IDP): Part I MuleSoft Intelligent Document Processing (Invoice Processing via IDP REST APIs):  Part II MuleSoft Intelligent Document Processing: Generic Document Processing With IDP: Part III MuleSoft Intelligent Document Processing: Generic Document Processing via IDP REST APIs: Part IV MuleSoft Intelligent Document Processing: IDP Callback and Automation: Part V MuleSoft Intelligent Document Processing: Supercharge Document Automation With Einstein AI: Part VI

By Jitendra Bafna
Jakarta EE 12: Entering the Data Age of Enterprise Java
Jakarta EE 12: Entering the Data Age of Enterprise Java

For decades, Jakarta EE has addressed the challenge of building enterprise systems that endure technological change. The platform has evolved from monoliths to microservices, from application servers to Kubernetes, and from relational databases to distributed data platforms, all while maintaining its core strength: compatibility. Jakarta EE 12 marks another significant transition, shifting the focus beyond cloud-native infrastructure and APIs to prioritize data. Modern enterprise systems now operate in diverse environments that extend beyond relational databases and synchronous CRUD applications. Current architectures integrate SQL, document databases, graph engines, key-value stores, event streams, vector databases, and AI-driven workflows. The primary challenge is to provide a unified programming model that manages fragmented data ecosystems without vendor lock-in or frequent application rewrites. Jakarta EE 12 addresses this by elevating querying, data access, initialization, and semantic consistency to platform-level concerns. This release marks the beginning of the “Data Age” for Enterprise Java. Central to this evolution is Jakarta Query, a unified semantic query model that connects Jakarta Persistence, Jakarta Data, and Jakarta NoSQL through a common abstraction. Rather than having each specification define its own querying semantics, Jakarta EE 12 introduces a shared language that spans multiple persistence technologies while supporting specialized execution models. This architectural shift reduces ecosystem fragmentation and delivers a more consistent developer experience for polyglot persistence systems. Jakarta EE 12 also extends beyond traditional dependency injection and request processing. CDI now offers more predictable startup and lifecycle management, which is essential for cloud-native deployments, serverless runtimes, AI orchestration, and agent-based architectures. With Java 21 as the new platform baseline, Jakarta EE is positioned as a modern platform that supports long-lived, adaptive systems in a data- and AI-driven world. This article will examine how Jakarta EE 12 transforms the enterprise Java ecosystem through Jakarta Query, Jakarta Data, Jakarta NoSQL, CDI 5.0, Persistence 4.0, and new initiatives such as Jakarta Agentic AI. We will also discuss how these specifications form a unified platform strategy that simplifies enterprise development while maintaining the stability and interoperability that have made Java a leading software ecosystem. The Evolution of Enterprise Complexity Software architecture has consistently evolved to address complexity. Initially, organizations relied on centralized mainframes, where applications, infrastructure, and data resided in a single environment. The shift to client-server and three-tier models introduced distributed systems, separating presentation, business logic, and persistence into distinct layers. Today, cloud-native systems span clusters, distributed networks, containers, Kubernetes, edge devices, and globally replicated databases. Modern enterprise software functions as an ecosystem of interconnected services across infrastructure that developers may not fully control. This evolution has significantly increased the cognitive demands on software engineers and architects. Today’s technology landscape includes a wide array of frameworks, runtimes, databases, APIs, messaging systems, orchestration platforms, and AI-driven tools. Developer experience is now a competitive market, with platforms promising productivity, simplicity, and scalability. Engineers must continually balance trade-offs among performance, consistency, scalability, operational complexity, and vendor lock-in. The industry also faces the “hype effect,” where technologies gain popularity before their long-term impacts are fully understood. As systems became more distributed, architectural styles proliferated. Traditional layered architectures now exist alongside microservices, event-driven systems, CQRS, orchestration platforms, microkernels, and domain-driven designs. Each style addresses specific challenges. Microservices enhance deployment independence, event-driven systems improve scalability and resilience, and CQRS manages complex read-and-write workloads. However, this variety has led to fragmentation. Developers must now master not only programming languages and frameworks, but also distributed systems theory, consistency models, observability, fault tolerance, asynchronous communication, and operational automation. Data complexity has evolved similarly. For decades, enterprise applications relied primarily on relational databases and SQL. Today, organizations use document databases, graph databases, key-value stores, wide-column engines, streaming systems, vector databases, and combinations of these. This trend, known as polyglot persistence, reflects the fact that different data models address different business needs. For example, recommendation engines may require graph traversal, financial systems depend on transactional consistency, and AI systems increasingly use vector similarity search. As a result, enterprise development now extends beyond writing business logic. Engineers must manage distributed architectures, multiple persistence models, cloud-native infrastructure, security, asynchronous communication, and increasingly, AI-driven workflows. In this environment, standards are essential. Growing complexity makes fragmentation a significant long-term risk. Without common abstractions and interoperable APIs, organizations risk costly migrations, vendor lock-in, and operational instability. Jakarta EE 12 addresses these challenges. Instead of treating persistence, querying, dependency injection, and runtime behavior as separate concerns, the platform adopts a unified model for modern distributed systems. Its goal is not to eliminate architectural diversity, but to offer a stable and coherent foundation that supports it. Why Jakarta EE Still Matters Enterprise Java has evolved for nearly three decades. Launched in the late 1990s, Java EE aimed to standardize enterprise application development amid a fragmented landscape of proprietary technologies. The ecosystem progressed from J2EE to Java EE and, now, to Jakarta EE under the Eclipse Foundation. Each transition mirrored broader industry shifts, including the emergence of web applications, distributed systems, cloud-native computing, and AI-driven architectures. Java’s dominance in enterprise environments stems from more than the language itself. Its success lies in uniting two elements that rarely coexist: open standards and open source. Many ecosystems offer only one. Some are open source but lack governance and interoperability. Others provide standards but evolve slowly or lose touch with developer needs. Jakarta EE bridges these worlds, delivering both specification-driven consistency and open-source innovation. Historically, standards have been essential for human scalability. Shared languages enabled cooperation, writing systems preserved knowledge, and standard units like the metric system supported global trade and science. Software faces similar challenges. As systems expand and teams become more distributed, shared abstractions and interoperability are crucial. Standards reduce ambiguity, improve team communication, and allow technologies to evolve without frequent rewrites. This is especially important in enterprise environments, where systems often outlast the technologies used to build them. Enterprise applications are rarely rewritten. Banks, governments, healthcare providers, airlines, and retailers operate systems that may persist for decades while evolving internally. In this context, open standards and open source are strategic choices. They reduce operational lock-in, improve vendor portability, support long-lived systems, and enable incremental modernization rather than risky rewrites. Jakarta EE addresses these needs by not imposing a single architecture, runtime, or deployment model. The platform supports monoliths, modular systems, microservices, reactive architectures, and cloud-native deployments. It integrates seamlessly with modern frameworks and runtimes, including those many developers use daily, often without realizing Jakarta EE specifications underpin them. Technologies such as Spring, Quarkus, Micronaut, Hibernate, Tomcat, and Payara implement, extend, or depend directly on Jakarta EE specifications. This is precisely what makes Jakarta EE uniquely relevant today. In a market flooded with, this unique combination makes Jakarta EE especially relevant today. In a market filled with rapidly changing frameworks and infrastructure trends, Jakarta EE offers stability without stagnation. The platform evolves thoughtfully, maintaining compatibility while adapting to new realities such as cloud-native computing, polyglot persistence, and AI-driven systems. Jakarta EE 11 established a modern foundation, with specifications such as Jakarta Data. Jakarta EE 12 builds on this, moving Enterprise Java into what can be called the Data Age. Jakarta EE 12 and the Rise of Unified Data Access A key change in Jakarta EE 12 is the acknowledgment that data access can no longer be limited to relational databases. Modern enterprise applications now span SQL databases, NoSQL engines, distributed caches, event streams, and AI-focused data stores. The primary challenge has shifted from persistence alone to ensuring consistent developer interaction across diverse data systems. Jakarta EE 12 addresses this by introducing a unified semantic model for querying and data access. Central to this is Jakarta Query, a new abstraction that serves as a common query foundation for Jakarta Persistence, Jakarta Data, and Jakarta NoSQL. Rather than each specification defining separate query semantics, Jakarta Query provides a shared language for filtering, ordering, restrictions, and query composition across multiple persistence technologies. Enterprise Java has evolved through several generations of query languages, from JDBC’s direct SQL focus to JPA’s JPQL and various framework-specific abstractions. These independent developments have led to fragmentation. Jakarta EE 12 seeks to address this by separating semantic intent from execution strategy, enabling developers to use a common conceptual model for queries while allowing each technology to optimize execution as needed. This is especially important in polyglot persistence architectures. Relational databases optimize joins and transactions, document databases offer schema flexibility, and graph databases emphasize relationship traversal. Jakarta Query does not eliminate these differences but provides a consistent developer experience across technologies, reducing reliance on vendor-specific APIs. Jakarta Data 1.1 exemplifies this approach with its fluent, type-safe query model. Developers can dynamically compose queries using semantic restrictions and ordering rules in Java, rather than relying on string-based query construction. Java List<Product> found = products.findAll( Restrict.all( _Product.type.equalTo(ProductType.PHYSICAL), _Product.price.greaterThan(10.00f), _Product.name.contains("Jakarta") ), Order.by( _Product.price.desc(), _Product.name.asc() ) ); This approach enhances readability and reduces runtime query errors often found in string-based query languages. More importantly, it aligns queries with the domain model, supporting a core principle of domain-driven enterprise applications. Jakarta Data 1.1 also extends the repository model beyond basic CRUD operations. Stateful repositories now include lifecycle-oriented operations, such as persist, merge, refresh, detach, and remove, within their abstractions. Java @Repository public interface Products extends DataRepository<Product, String> { @Persist void add(Product product); @Merge Product merge(Product product); @Remove void remove(Product product); @Refresh void reload(Product product); @Detach void detach(Product product); } This evolution is significant because repositories are no longer just convenience wrappers for persistence operations. They now serve as standardized data access contracts, consistently supporting both query semantics and entity lifecycle management across implementations. More broadly, Jakarta EE 12 is guiding enterprise Java toward a unified data platform. Instead of requiring developers to switch mental models between persistence technologies, the platform unifies how applications express intent for querying, filtering, lifecycle management, and data interaction. As distributed systems and polyglot persistence become more prevalent, this semantic consistency may become a key architectural advantage for Enterprise Java.

By Otavio Santana DZone Core CORE
A Hands-On ABAP RESTful Programming Model Guide
A Hands-On ABAP RESTful Programming Model Guide

In the SAP ABAP world, BAPIs (Business Application Programming Interfaces) have long been the standard for programmatic access to business functionality. These are essentially RFC-enabled function modules representing operations on business objects. However, with the advent of SAP S/4HANA and the new ABAP RESTful Application Programming Model (RAP), the classic BAPI approach is being phased out in favor of modern, RAP-based APIs. SAP now provides a new generation of released APIs built on RAP, often referred to as RAP BO interfaces or behavior definitions, which align with the clean core and cloud-ready strategy of S/4HANA. Why Replace BAPIs With RAP-Based APIs? For expert ABAP developers, the motivation to migrate from BAPIs to RAP interfaces comes down to modernization, maintainability, and future-proofing. In fact, SAP considers RAP business object interfaces the modern alternative to classic BAPIs. Here are some key reasons to make the switch: Clean Core and Upgrade Safety RAP interfaces are part of SAP’s clean core strategy. They are released and maintained by SAP, ensuring that your custom code calls stable, upgrade-safe APIs instead of custom wrappers or unofficial function modules. This means less retrofitting when you upgrade your S/4HANA system. Cloud Readiness In SAP S/4HANA Cloud, many traditional BAPIs are not whitelisted or available. RAP-based APIs, on the other hand, are designed for cloud use and adhere to strict SAP release contracts. They can be used in both private and public cloud editions as officially supported integration points. Modern API Design RAP uses contemporary ABAP development paradigms, CDS views for data modeling, behavior definitions (BDEF) for business logic, and OData/HTTP exposure by design. Instead of procedural function calls, you interact with entities through a standardized interface. This leads to clearer, more maintainable code and aligns with how SAP Fiori apps communicate. In essence, RAP interfaces provide a modern, RESTful facade over business objects, whereas BAPIs are older RPC-style calls. Built-in Transaction Handling Classic BAPIs often require manual transaction control. RAP introduces the Entity Manipulation Language (EML), which handles operations in a transactional context. You can bundle multiple operations on related entities and then perform one commit for all — simplifying the code and reducing errors. We’ll see this in the example below. Extensibility and Lifecycle Management RAP business object interfaces allow you to expose only what is needed to consumers, supporting versioning and gradual enhancements. They act as a stable facade while the underlying logic can evolve without breaking external contracts. This is analogous to how BAPIs provided stable interfaces in the past, but RAP makes it more flexible. In short, RAP BO interfaces let you keep your core objects clean and internal, exposing a controlled API layer to the outside world. Given these benefits, SAP recommends using a released RAP API whenever one is available for your scenario. Next, let's walk through a hands-on example of replacing a common BAPI with an RAP-based API in an S/4HANA 2022 system. ABAP RAP in S/4HANA 2022 at a Glance SAP S/4HANA 2022 includes the ABAP Platform 2022, which significantly expands RAP’s capabilities and the number of standard business objects exposed via RAP. Many operations that used to be performed with BAPIs now have equivalent released RAP BO interfaces provided by SAP. These interfaces are essentially pre-built RAP business objects that SAP delivers for developers to use instead of calling older BAPIs. For example: Bank creation: BAPI_BANK_CREATE to I_BANKTP (RAP Business Object interface)Cost center creation: BAPI_COSTCENTER_CREATE to I_COSTCENTERTP_2 (RAP BO interface)Sales order creation: BAPI_SALESORDER_CREATEFROMDAT2 to I_SALESORDERTP (RAP BO interface successor) These are just a few examples; SAP provides a growing list of such RAP-based APIs for S/4HANA 2022 and beyond. Each interface typically comes with documentation and code snippets to demonstrate how to consume it. If you are unsure whether a particular BAPI has an RAP equivalent, SAP’s Clean Core Object search tool can help identify if a released RAP interface (BDEF) exists for that business object. Hands-On Example: Modernizing a BAPI to RAP To make this concrete, let’s walk through replacing a classic BAPI with a RAP-based implementation. We’ll use the task of creating a Sales Order as our example. In the past, an ABAP developer might use the BAPI BAPI_SALESORDER_CREATEFROMDAT2 to programmatically create a sales order in SAP ECC or S/4HANA. In S/4HANA 2022, SAP provides the RAP business object interface I_SALESORDERTP as the official successor for this BAPI. We’ll demonstrate how to identify the replacement and implement the create operation using RAP. Step 1: Identify the RAP Replacement Interface First, determine if a released RAP API exists for the BAPI in question. In our case, we discovered that I_SALESORDERTP is the released RAP interface corresponding to the Sales Order business object (as a replacement for BAPI_SALESORDER_CREATEFROMDAT2). This interface is delivered by SAP as part of S/4HANA 2022. You can find such mappings via SAP documentation or tools; for instance, SAP's help docs and community posts confirm that I_SALESORDERTP covers create/read/update operations for sales orders via RAP. Once the appropriate interface is identified, you can plan to refactor your code to use it. Step 2: Review the Classic BAPI Usage (For Comparison) Let's briefly recap how the classic BAPI would be used to create a sales order. Typically, one would call the function module and then commit the work, for example: Plain Text DATA: lt_return TYPE TABLE OF bapiret2. CALL FUNCTION 'BAPI_SALESORDER_CREATEFROMDAT2' EXPORTING order_header_in = ls_order_header " header data structure TABLES order_items_in = lt_items " item lines to create return = lt_return. " return messages IF sy-subrc = 0 AND ( NOT line_exists( lt_return[ type = 'E' ] ) ). " If no errors, commit the transaction to finalize the create CALL FUNCTION 'BAPI_TRANSACTION_COMMIT' EXPORTING wait = 'X'. ENDIF. In the above pseudocode, we populate structures for the sales order header and items, call the BAPI, then explicitly call BAPI_TRANSACTION_COMMIT to ensure the new order is saved. We also have to check the return table for errors or messages. This imperative style works, but it requires handling commits manually and ensuring the data structures match what the BAPI expects. It also doesn’t automatically integrate with OData or modern Fiori UI frameworks without creating a custom OData service on top. Step 3: Implement the RAP-Based Create via EML Now, let's perform the same operation using the RAP interface I_SALESORDERTP and ABAP’s Entity Manipulation Language. With RAP, you don't call a function module; instead, you use EML statements on the interface’s entities. Suppose we want to create a sales order with one line item. We can do it in one shot as follows: Plain Text DATA(ls_failed) = REF #( ). DATA(ls_reported) = REF #( ). MODIFY ENTITIES OF i_salesordertp ENTITY salesorder CREATE FIELDS ( salesordertype salesorganization distributionchannel organizationdivision soldtoparty purchaseorderbycustomer ) WITH VALUE #( ( %cid = 'H001' " temporary ID for the new order %data = VALUE #( salesordertype = 'TA' " Order type salesorganization = '1010' " Sales Org distributionchannel = '10' " Distribution Channel organizationdivision = '00' " Division soldtoparty = '0010100001' " Sold-to Customer ID purchaseorderbycustomer= 'PO-12345' " Customer PO reference ) ) ) CREATE BY \_item " now specify the item(s) FIELDS ( product requestedquantity ) WITH VALUE #( ( %cid_ref = 'H001' " link to the order with temp ID %target = VALUE #( ( %cid = 'I001' " temp ID for new item product = 'MAT-1000' requestedquantity = 5 ) ) ) ) FAILED DATA(ls_failed) REPORTED DATA(ls_reported). In this code snippet, we use the MODIFY ENTITIES OF i_salesordertp statement to perform a deep create of a SalesOrder along with one SalesOrderItem. A few things to note from an engineer’s perspective: We specify the fields we want to set for the sales order and its item, similar to filling out the BAPI’s parameters, but here it's done with a structured syntax using WITH VALUE #(...) expressions. This ensures compile-time checking of field names and data types.The use of %cid (client ID) and %cid_ref is a mechanism to link the parent and child in one request. We assign a temporary ID 'H001' to the new sales order, and then reference it (%cid_ref = 'H001') when creating the item, so that the framework knows this item belongs to the new order. This allows creating a header and its items together in one call (a deep insert), which is handled seamlessly by RAP.The FAILED DATA(ls_failed) and REPORTED DATA(ls_reported) clauses serve to capture any errors or messages from the operation. This is analogous to checking the BAPI return table. If the create fails the ls_failed structure would contain the failed keys and error details. The ls_reported can hold messages or generated keys. At this point, the sales order is created in memory. To finalize and commit it to the database, we need to execute a commit in the RAP context. Step 4: Finalize With Commit In an interactive RAP scenario (like a Fiori app), the commit is managed by the framework when the user saves their changes. But in our manual ABAP code (above), we explicitly trigger the commit. In RAP, this is done using COMMIT ENTITIES rather than the classic BAPI_TRANSACTION_COMMIT. For example: Plain Text COMMIT ENTITIES EXPORTING RESPONSE OF i_salesordertp FAILED DATA(ls_save_failed) REPORTED DATA(ls_save_reported). IF sy-subrc <> 0. " Handle the error (e.g., log the messages in ls_save_reported) ENDIF. COMMIT ENTITIES will finalize all pending changes made by EML statements. Here, we also capture any final errors or messages in ls_save_failed/ls_save_reported for error handling. After this commit, the sales order is persisted, and the actual Sales Order number is available. SAP RAP supports late numbering, meaning the real keys might only be assigned at commit time, a detail to be aware of. You can retrieve the final keys using the CONVERT KEY statement if needed, but for our purposes, assume the commit gives us the new order number in the reported data. Now we have successfully created a sales order using the RAP API instead of the classic BAPI. The code is more declarative and leverages the RAP framework for integrity checks and relationships. We didn't have to call any function modules directly; everything was done through the RAP interface of the Sales Order business object. Engineer’s Perspective: Key Takeaways From the above process, a few important differences stand out when replacing BAPIs with RAP-based APIs: Finding the right interface: It may require a bit of research to find the correct RAP interface name for a given legacy BAPI. SAP provides documentation to help with this. Once found, using the RAP interface ensures you are calling an SAP-supported API that will remain stable across updates.Refactoring effort: Replacing a BAPI call with a RAP call is not a one-to-one code replacement; it often involves refactoring. You will define data structures according to RAP’s CDS-based types and use EML syntax. There is a learning curve, but the benefit is cleaner ABAP code that is easier to maintain. Tools in ABAP Development Tools (ADT) can help by autocompleting structure components and EML keywords, since the interface and its types are known in the dictionary.Behavior and validation: When you use RAP interfaces, you automatically leverage SAP’s implemented behavior logic. For instance, any checks or defaulting logic SAP built into the Sales Order RAP BO will execute. You no longer manually call multiple BAPIs or perform intermediate checks; the RAP framework triggers all the necessary business logic as part of the modify/commit process. This leads to fewer errors and a more consistent outcome with standard SAP behavior.Transaction management: As shown, instead of BAPI_TRANSACTION_COMMIT, you use COMMIT ENTITIES. This aligns with the unit of work paradigm in RAP. It also allows bundling multiple operations.Upgrade and extensibility: By adopting RAP-based APIs, your custom code is positioned for the future. SAP will continue to enhance these interfaces, and you can opt in to those changes by switching to a new interface version or extending the CDS views. In contrast, classic BAPIs in S/4HANA are largely in maintenance mode; they exist mainly for backward compatibility and might not cover new business scenarios introduced in S/4HANA. Moving to RAP ensures you can take advantage of new SAP innovations with minimal effort. Conclusion Replacing BAPIs with RAP-based APIs is a strategic move for any ABAP developer working on S/4HANA 2022 and beyond. It aligns your custom developments with SAP’s modern, cloud-ready programming model. As we’ve seen, a classic scenario like sales order creation can be accomplished with RAP’s EML in a way that is more in line with modern development practices, yielding cleaner and more robust code. SAP itself positions RAP Business Object interfaces as the evolution of BAPIs, a way to build stable, public APIs that keep the core clean and extensible. While there is an upfront effort to learn RAP and refactor existing code, the payoff is significant in the long run, with better maintainability, fewer upgrade headaches, and the ability to run your extensions in the cloud. Importantly, whenever SAP provides a released RAP API to replace a classic BAPI, you should take that path. By doing so, you leverage SAP’s latest technology and ensure your solutions remain supported in future releases.

By Deepika Paturu
Event-Driven Pipelines With Apache Pulsar and Go
Event-Driven Pipelines With Apache Pulsar and Go

A Practical Walkthrough Most distributed systems eventually hit a wall with their messaging layer, whether it's Kafka's tight coupling between compute and storage, RabbitMQ's limited replay capabilities, or the operational overhead of managing multiple tools for queuing and streaming. Apache Pulsar was engineered to address these gaps from the ground up. In this article, we'll dissect a working Go-based demo that wires together a Pulsar producer, consumer, and Prometheus monitoring layer into a cohesive, observable messaging pipeline. The full source is on GitHub. Why Pulsar Deserves a Closer Look Pulsar's architecture makes a deliberate trade-off that most messaging systems avoid: it physically separates the broker tier (which handles routing, subscriptions, and protocol) from the storage tier (Apache BookKeeper, which handles persistence). This isn't just an implementation detail. It means you can independently autoscale message routing capacity without touching your storage cluster, and vice versa. Beyond the architecture, a few capabilities stand out for engineering teams: Multi-tenancy at the protocol level – tenants, namespaces, and topics form a three-level hierarchy, making it practical to run a single Pulsar cluster for multiple teams or services without namespace collisions.Four distinct subscription semantics – Exclusive, Shared, Failover, and Key_Shared give you precise control over how messages are distributed across consumer instances, something Kafka's consumer group model doesn't natively offer.Cursor-based message retention – Pulsar retains messages based on subscription cursors, not time-based log compaction. A consumer that falls behind doesn't lose messages; it simply catches up from its last acknowledged position.Native schema enforcement – The built-in schema registry validates message payloads at the broker level before they reach consumers, catching contract violations at the boundary rather than deep inside application logic. What the Demo Builds The project is structured as three independent Go binaries, each with a single responsibility: reStructuredText ├── producer/ │ ├── main.go # HTTP server → Pulsar publisher │ ├── go.mod │ └── go.sum ├── consumer/ │ └── main.go # Pulsar subscriber → message processor ├── monitor/ │ └── main.go # Pulsar producer + Prometheus metrics server ├── prometheus.yml # Scrape configuration └── README.md All three use `github.com/apache/pulsar-client-go/pulsar` — the official, Apache-maintained Go client. The client is not a thin wrapper; it implements the full Pulsar binary protocol, connection pooling, producer batching, and automatic Prometheus metric registration. Component 1: The Producer The producer exposes a single HTTP endpoint. An incoming HTTP request triggers a Pulsar publish operation, decoupling the caller from any direct knowledge of the messaging infrastructure. Go package main import ( "context" "fmt" "log" "net/http" "github.com/apache/pulsar-client-go/pulsar" ) var client pulsar.Client func main() { var err error client, err = pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6650", }) if err != nil { log.Fatal(err) } defer client.Close() http.HandleFunc("/publish", func(w http.ResponseWriter, r *http.Request) { msg := r.URL.Query().Get("msg") err := publishMessage(msg) if err != nil { w.Write([]byte("msg failed to published")) } else { w.Write([]byte("msg successfully published")) } }) if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatal(err) } } func publishMessage(msg string) error { producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "my-topic", }) if err != nil { log.Fatal(err) } _, err = producer.Send(context.Background(), &pulsar.ProducerMessage{ Payload: []byte("Hello"), }) return err } A few architectural observations worth unpacking: The HTTP-to-Pulsar bridge pattern is deliberately pragmatic. Rather than requiring every upstream service to embed a Pulsar client, you expose a thin HTTP adapter. This is particularly valuable when integrating with systems that speak HTTP natively — CI/CD pipelines, third-party webhooks, or legacy services that can't easily adopt a new client library. `pulsar.NewClient` establishes a connection pool, not a single TCP connection. The client maintains persistent connections to the broker and handles reconnection, load balancing across broker nodes, and TLS negotiation transparently. Calling `client.Close()` via `defer` ensures all in-flight messages are flushed before the process exits. `producer.Send` with `context.Background()` submits the message to the producer's internal send queue. The Pulsar client batches outgoing messages by default (configurable via `BatchingMaxMessages` and `BatchingMaxPublishDelay`), which significantly improves throughput under load without any changes to application code. For production use, the producer instance should be created once at startup and reused across requests. Creating a new producer per request incurs connection overhead and bypasses the batching optimization entirely. Component 2: The Consumer The consumer subscribes to a topic and processes messages with explicit acknowledgment. The subscription type chosen here — `pulsar.Shared` — has meaningful implications for how the system scales. Go consumer, err := client.Subscribe(pulsar.ConsumerOptions{ Topic: "topic-1", SubscriptionName: "my-sub", Type: pulsar.Shared, }) if err != nil { log.Fatal(err) } defer consumer.Close() for i := 0; i < 10; i++ { msg, err := consumer.Receive(context.Background()) if err != nil { log.Fatal(err) } fmt.Printf("Received message with Id: %#v -- content: '%s'\n", msg.ID(), string(msg.Payload())) consumer.Ack(msg) } if err := consumer.Unsubscribe(); err != nil { log.Fatal(err) } Subscription Semantics in Depth Pulsar's subscription model is one of its most differentiating features. Here's how the four types behave at the broker level: Exclusive – The broker enforces that only one consumer holds the subscription at any time. A second consumer attempting to subscribe with the same name will receive an error. This guarantees strict message ordering but eliminates horizontal scaling.Shared – The broker distributes messages across all active consumers in round-robin order. Any number of consumers can join or leave the subscription dynamically. This is the right choice for stateless workloads where processing order doesn't matter and throughput is the priority.Failover – The broker designates one consumer as the active receiver. Others remain connected but idle, ready to take over if the active consumer disconnects. This preserves ordering while providing high availability — a pattern common in financial transaction processing.Key_Shared – The broker routes messages with the same key consistently to the same consumer instance. This enables stateful processing (e.g., per-user session aggregation) without external coordination, as long as the consumer count remains stable. Why Explicit Acknowledgment Matters `consumer.Ack(msg)` signals to the broker that the message has been durably processed and can be removed from the subscription's cursor. If the consumer process crashes between `Receive` and `Ack`, the broker will redeliver the message to another consumer in the subscription. This is the mechanism behind at-least-once delivery. For workloads that require exactly-once semantics, Pulsar supports transactional acknowledgment, where the `Ack` and any downstream writes are committed atomically. That's a more advanced topic, but the foundation is the same `Ack` call shown here. Component 3: The Monitor The monitor is architecturally the most interesting component. It runs two HTTP servers concurrently — one for the application endpoint, one for Prometheus metrics — and uses a Pulsar producer to generate observable traffic. Go package main import ( "context" "fmt" "log" "net/http" "strconv" "github.com/apache/pulsar-client-go/pulsar" ) func main() { client, err := pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6605", }) if err != nil { log.Fatal(err) } defer client.Close() prometheusPort := 2112 go func() { if err := http.ListenAndServe(":"+strconv.Itoa(prometheusPort), nil); err != nil { log.Fatal(err) } }() producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "topic-1", }) if err != nil { log.Fatal(err) } defer producer.Close() ctx := context.Background() webPort := 8082 http.HandleFunc("/produce", func(w http.ResponseWriter, r *http.Request) { msgId, err := producer.Send(ctx, &pulsar.ProducerMessage{ Payload: []byte(fmt.Sprintf("hello-world")), }) if err != nil { log.Fatal(err) } else { log.Printf("Published message: %v", msgId) fmt.Fprintf(w, "Message Published: %v", msgId) } }) if err := http.ListenAndServe(":"+strconv.Itoa(webPort), nil); err != nil { log.Fatal(err) } } How the Pulsar Client Registers Prometheus Metrics When `pulsar.NewClient` is called, the Go client automatically registers a set of Prometheus collectors with the default `prometheus.DefaultRegisterer`. No additional instrumentation code is required. The metrics are served at `/metrics` on whatever port you bind `http.DefaultServeMux` to port — in this case, port `2112`. The metrics exposed include: `pulsar_client_producers_opened` / `pulsar_client_producers_closed` – producer lifecycle counters`pulsar_client_consumers_opened` / `pulsar_client_consumers_closed` – consumer lifecycle counters`pulsar_client_messages_published_total` – cumulative publish count per topic`pulsar_client_publish_latency_seconds` – histogram of end-to-end publish latency`pulsar_client_bytes_published_total` – total bytes written to the broker Running the Prometheus metrics server in a goroutine while the main goroutine handles the application HTTP server is idiomatic Go concurrency. Both servers share the same `http.DefaultServeMux`, which is why the Prometheus `/metrics` handler (registered automatically by the client library) is accessible on the metrics port without any explicit route registration. Prometheus Scrape Configuration YAML scrape_configs: - job_name: pulsar-client-go-metrics scrape_interval: 10s static_configs: - targets: - localhost:2112 The `scrape_interval: 10s` is a reasonable starting point for development. In production, you would typically align this with your alerting resolution requirements — a 30-second interval is common for dashboards, while 10 seconds or less is appropriate for latency-sensitive alerting rules. With these metrics flowing into Prometheus, you can build Grafana panels that surface producer throughput, consumer lag, and publish latency percentiles with three signals that matter most when diagnosing messaging pipeline issues. Running the Full Pipeline Prerequisites Apache Pulsar standalone Go 1.18+Prometheus (optional) Startup Sequence 1. Launch Pulsar in standalone mode: Shell bin/pulsar standalone 2. Start the producer service: Shell cd producer && go run main.go 3. Start the consumer service: Shell cd consumer && go run main.go 4. Start the monitor service: Shell cd monitor && go run main.go 5. Start Prometheus: Shell prometheus --config.file=prometheus.yml 6. Publish a message via the producer endpoint: Shell curl "http://localhost:8080/publish?msg=test_pulsar_message_publish_event" # msg successfully published 7. Trigger the monitor producer: Shell curl http://localhost:8082/produce # Message Published: (messageId) 8. Inspect raw Prometheus metrics: Shell curl http://localhost:2112/metrics | grep pulsar Engineering Takeaways Decouple publish triggers from client library dependencies. The HTTP-to-Pulsar adapter pattern used in the producer is not just a demo convenience. It is a legitimate architectural boundary. Services that need to emit events don't need to know anything about Pulsar's protocol, topic naming, or client configuration. They make an HTTP call; the adapter handles the rest.Match subscription type to processing semantics, not just throughput. A common mistake is defaulting to `Shared` for everything because it scales horizontally. If your processing logic is stateful, for example, aggregating events per user session. `Key_Shared` gives you the same horizontal scalability while preserving per-key ordering without any application-level coordination.Treat the acknowledgment boundary as your consistency boundary. Everything between `Receive` and `Ack` is your processing window. Any side effects (database writes, downstream API calls, cache updates) that happen in this window must be idempotent, because Pulsar will redeliver the message if the consumer fails before acknowledging. Design your processing logic around this constraint from the start, not as an afterthought.Zero-cost observability is a genuine advantage. The fact that `pulsar-client-go` registers Prometheus metrics automatically means you get throughput, latency, and connection health data from the moment your application starts, without writing a single line of instrumentation code. This is a meaningful operational advantage over client libraries that require manual metric registration. Extending the Demo The current implementation is intentionally minimal. Here are technically meaningful extensions worth exploring: Schema enforcement – Replace raw `[]byte` payloads with Pulsar's schema-aware producer/consumer API. Using `pulsar.NewAvroSchema` or `pulsar.NewJSONSchema` moves payload validation to the broker, preventing malformed messages from ever reaching consumers.Dead-letter topic routing – Configure `DeadLetterPolicy` on the consumer to automatically route messages that exceed a maximum redelivery count to a separate topic. This prevents poison-pill messages from blocking the subscription indefinitely.Producer batching tuning – Set `BatchingMaxMessages`, `BatchingMaxSize`, and `BatchingMaxPublishDelay` on `ProducerOptions` to optimize the throughput/latency trade-off for your specific workload profile.Graceful shutdown – Add `os/signal` handling to flush in-flight messages and close the producer cleanly before the process exits. The current `defer client.Close()` handles the happy path but won't fire on `SIGKILL`.Kubernetes-native deployment – Package each component as a container and deploy using the official Pulsar Helm chart. The producer and monitor can be exposed as Kubernetes Services; the consumer can run as a Deployment with HPA scaling based on the `pulsar_client_messages_published_total` metric exported to a custom metrics adapter. Conclusion The `apache-pulsar` project demonstrates that building a production-grade messaging pipeline with Apache Pulsar and Go doesn't require much code. It requires understanding the right abstractions. The producer-consumer-monitor triad covers the three concerns that matter in any event-driven system: getting data in, getting data out, and knowing what's happening in between. Pulsar's architecture, decoupled storage, flexible subscription semantics, and built-in observability make it a strong candidate for teams that have outgrown simpler messaging systems and need more precise control over delivery guarantees, scaling behavior, and operational visibility. Source Code https://github.com/shivik/apache-pulsar-demo

By Shivi Kashyap
Zero-Downtime Deployments for Java Apps on Kubernetes
Zero-Downtime Deployments for Java Apps on Kubernetes

This article provides a comprehensive guide to achieving zero-downtime deployments for Java-based applications on Kubernetes. We cover deployment strategies, Kubernetes primitives, Java-specific considerations, session state handling, database migrations, traffic shifting techniques, CI/CD pipelines, GitHub Actions, Jenkins with automated rollbacks, observability (Prometheus, Grafana, Jaeger), Helm/ArgoCD examples, testing strategies (canary analysis, chaos, smoke tests), and troubleshooting. Deployment Strategies Kubernetes offers several strategies for deploying new versions without downtime: Rolling Update Incrementally replace old pods with new ones, maintaining availability. Kubernetes Deployment object uses rolling updates by default. You can control maxUnavailable and maxSurge to tune the rollout. Blue-Green Deployment Run two separate environments: Blue = current, green = new. Only one serves live traffic at a time. Once the Green version is verified, switch the Service or Ingress to point at Green, then scale down Blue. This allows instant rollback by redirecting traffic back to Blue. Argo Rollouts defines a blue/green strategy with an active and preview Service. Traffic flows only to the active version until promotion. Canary Deployment Gradually shift a small percentage of traffic to the new version. Start with a few pods of v2, monitor, then incrementally increase. Tools like Istio or Argo Rollouts can control traffic weights. For instance, sending 10% of traffic to v2 can be done by running 9 v1 pods and 1 v2 pod (10%). Argo defines a canary rollout with setWeight steps and pauses for analysis. Shadow/Mirroring The new version receives a copy of live requests for testing under real load, but its responses are not returned to users. This is low risk but does not assist in rollback decisions since users don’t see the new behavior. Kubernetes Primitives for Zero Downtime Deployment A Deployment naturally performs rolling updates. By default, it creates a new ReplicaSet and scales it up while scaling down the old one controlled by maxUnavailable/maxSurge. This ensures some pods always serve traffic. To use blue/green, you would deploy two separate Deployments (e.g., app-blue, app-green) and switch Services. Service and Ingress A Service fronts pods. For blue/green, you can point a single Service at either the blue or green pods. Ingress can also switch between backend services. E.g., label selectors can be adjusted to redirect traffic from version blue to version green pods. PodDisruptionBudget Ensures a minimum number of pods stay running during voluntary disruptions. For instance, setting minAvailable 1 ensures at least one pod remains during a rolling update. To avoid complete downtime during maintenance. Horizontal Pod Autoscaler (HPA) Scales pods based on CPU/memory or custom metrics. It automatically updates a workload to match demand. An HPA can be attached to the Deployment so that if traffic spikes during a rollout, new pods will be created to handle the load. Example: YAML apiVersion autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 Liveness and Readiness Probes Critical for zero downtime. A liveness probe checks if the app is alive; if it fails, K8 restarts the pod. A readiness probe tells if the app is ready to serve traffic. During startup or shutdown, the readiness probe should fail, causing the pod to be removed from the service load balancer. Spring Boot Actuator provides /actuator/health for this. In K8S YAML: YAML livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 5 Spring Boot exposes health/liveness and health/readiness groups by default. Quarkus and Micronaut have similar health endpoints. Spring Boot supports graceful shutdown by setting server.shutdown is equals to graceful and tuning spring.lifecycle.timeout-per-shutdown-phase. This causes the embedded server, either Tomcat/Jetty/Undertow, to stop accepting traffic and wait up to the timeout for active requests. Java @Component public class ShutdownListener implements SmartLifecycle { private boolean running = true; @Override public void stop() { running = false; } @Override public boolean isRunning() { return running; } } Quarkus provides graceful shutdown configuration. By setting quarkus.shutdown.timeout=10s, Quarkus will wait up to 10 seconds for current requests to finish before exiting. You can annotate a bean method with @Shutdown to run cleanup code. Micronaut has @EventListener for ShutdownEvent: Java @Singleton public class ShutdownBean { @EventListener void onShutdown(ShutdownEvent event) { } } Kubernetes Hooks You can use a preStop hook in the Deployment spec to run a script before SIGTERM. YAML lifecycle: preStop: exec: command: ["/bin/sh","-c","sleep 5"] terminationGracePeriodSeconds: 30 The grace period (default 30s) should be tuned to let the app finish. K8S doc 77†L99-L107 describes the sequence container enters Terminating, runs preStop, sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. JVM Tuning Set -XX +ExitOnOutOfMemoryError to avoid hanging. Tune thread pools so they drain quickly. Monitor GC pause times, consider using low-latency GC to minimize pause before shutdown. Session and State Handling To maintain zero downtime when pods switch: Stateless services: Best practice is to keep services stateless. Store session state or user data in an external store, such as Redis or a database. This way, any pod can handle any request, and pods can be replaced without losing the user session.Sticky sessions: If an app uses in-memory sessions, you can enforce sticky sessionsService affinity: Set sessionAffinity: ClientIP on the Service. Kubernetes routes requests from the same client IP to the same pod.Ingress affinity: Use Ingress annotations to bind a user’s requests to one pod. However, sticky sessions introduce risk and are not suitable for autoscaling.StatefulSets: For true stateful workloads, use StatefulSet with stable identities. StatefulSets pair pods with PersistentVolumes, which are not zero-downtime by themselves. GitHub Actions CI/CD Pipeline zero-downtime: YAML name: Deploy on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-java@v3 with: { java-version: '17' } - name: Build run: mvn clean package -DskipTests name: Docker Build & Push run: | docker build -t ghcr.io/myorg/myapp:${{ github.sha } echo ${{ secrets.GITHUB_TOKEN } | docker login ghcr.io -u ${{ github.actor } --password-stdin docker push ghcr.io/myorg/myapp:${{ github.sha } - name: Set image tag run: echo "::set-output name=image::ghcr.io/myorg/myapp:${{ github.sha } deploy: needs: build runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 with: { path: manifests } - name: Update K8s deployment uses: azure/setup-kubectl@v3 - name: Deploy to Kubernetes run: | kubectl set image deployment/myapp-deployment myapp=ghcr.io/myorg/myapp:${{ needs.build.outputs.image } kubectl rollout status deployment myapp-deployment This workflow builds the image, pushes it, and updates the deployment. The rollout status command waits for all new pods to become ready. If health checks fail, it will abort without downtime. Conclusion Zero-downtime deployment on Kubernetes combines careful architecture and automation, using rolling updates, progressive strategies, ensuring graceful shutdown and health checks in your Java apps, externalizing state, managing database changes, and orchestrating with CI/CD pipelines. Kubernetes primitives like Deployments, Services, Probes, and HPA, along with tools like Istio or Argo Rollouts, provide the building blocks.

By Ramya vani Rayala
Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java

The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code

By Sergiy Yevtushenko
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka

After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: Java { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Production concerns addressed: Key-based partitioning: Using orderId as the key ensures all events for the same order go to the same partition, maintaining orderingAsync publishing with callbacks: Non-blocking publish with explicit success/failure handlingStructured logging: Captures partition and offset for troubleshooting Kafka Consumer With Idempotency At-least-once delivery means duplicates are possible. Consumers must deduplicate: Java @KafkaListener(topics = "orders.order-created.v1", groupId = "billing-service") public void onOrderCreated(OrderCreated event) { // Check if already processed if (processedEventsRepository.existsByEventId(event.getEventId())) { log.debug("Skipping duplicate event: {}", event.getEventId()); return; } // Process event billingService.createInvoice( event.getOrderId(), event.getCustomerId(), event.getItems() ); // Mark as processed processedEventsRepository.save( new ProcessedEvent(event.getEventId(), Instant.now()) ); } Idempotency pattern: Check eventId before processing, store it after processing. If the same event arrives twice, the second one is ignored. Architecture Diagram: Contract-First Flow After implementing contract-first integration across three different microservices architectures, I've learned that the biggest bottleneck in distributed systems isn't technical; it's coordination between teams. When Team A waits for Team B to finish their API before starting integration work, you're throwing away weeks of productivity. Contract-first development flips this model. By defining your integration contracts upfront (OpenAPI specs, Avro schemas, database migrations), you enable teams to work in parallel, catch breaking changes early through CI validation, and treat contracts as the single source of truth. This isn't theoretical; this is how Netflix, Uber, and Amazon scale their engineering organizations. In this article, I'll show you production-ready contract-first patterns using Java 21, Spring Boot 3, OpenAPI 3, Apache Kafka with Avro, and Flyway migrations. You'll see real code from a working system that handles the three critical integration boundaries: REST APIs, event-driven messaging, and database schemas. The Problem: Why Traditional Integration Fails at Scale When systems integrate without contracts, you hit three major problems: 1. Serial development bottlenecks: Team A builds an API endpoint. Team B waits. Team B builds a consumer. Team A discovers the payload doesn't match what Team B expected. Both teams spend days debugging mismatched assumptions. 2. Late discovery of breaking changes: You deploy a service update that changes a response field from customerId to customer_id. Your API consumers break in production. No tests caught it because there was no contract to validate against. 3. Documentation drift: The Swagger docs say the endpoint returns a 201. The actual code returns a 200. The integration tests expect a 404. Nobody knows which one is right because there's no single source of truth. Contract-first development solves all three by making the contract the authoritative specification that generates code, mocks, tests, and documentation. What Contract-First Actually Means Contract-first means you define the integration boundary first (the contract) and then write code that conforms to it. The contract is not an afterthought or generated documentation. It's the design specification. A complete contract includes: Operations: Endpoints (REST), topics (Kafka), or tables (database)Data shapes: Request/response DTOs, event schemas, column definitionsValidation rules: Required fields, constraints, data typesError model: HTTP status codes, error payloads, dead-letter queuesNon-functional rules: Idempotency, retries, compatibility policies, SLAs Here's the mental model: Agree on the contract → generate tools → build independently → let CI enforce alignment. Contract Type 1: REST API Contracts With OpenAPI OpenAPI 3 specs are the gold standard for REST API contracts. You define endpoints, request/response schemas, validation rules, and error responses in YAML. Then you generate server stubs, client SDKs, mocks, and documentation from that single source. OpenAPI Contract Example Here's a production OpenAPI contract for an order management API: YAML openapi: 3.2.0 info: title: Orders API version: 1.0.0 description: Contract-first REST API for order management paths: /v1/orders: post: operationId: createOrder summary: Create a new order requestBody: required: true content: application/json: schema: $ref: '#/components/schemas/CreateOrderRequest' responses: '201': description: Order created successfully content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '400': description: Validation error content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' '409': description: Idempotency conflict content: application/json: schema: $ref: '#/components/schemas/ErrorResponse' /v1/orders/{orderId}: get: operationId: getOrder parameters: - name: orderId in: path required: true schema: type: string responses: '200': description: Order found content: application/json: schema: $ref: '#/components/schemas/OrderResponse' '404': description: Order not found components: schemas: CreateOrderRequest: type: object required: [customerId, items] properties: customerId: type: string example: CUST-123 idempotencyKey: type: string description: Optional key for safe retries items: type: array minItems: 1 items: $ref: '#/components/schemas/OrderItem' OrderItem: type: object required: [sku, quantity] properties: sku: type: string example: SKU-001 quantity: type: integer minimum: 1 OrderResponse: type: object required: [orderId, customerId, status, items, timestamp] properties: orderId: type: string customerId: type: string status: type: string enum: [CREATED, REJECTED] items: type: array items: $ref: '#/components/schemas/OrderItem' timestamp: type: string format: date-time ErrorResponse: type: object required: [code, message, traceId, timestamp] properties: code: type: string enum: [VALIDATION_ERROR, NOT_FOUND, CONFLICT, INTERNAL_ERROR] message: type: string traceId: type: string timestamp: type: string format: date-time Implementing the Provider Side (Spring Boot) The contract drives the implementation. Your Spring Boot controller implements what the contract specifies: Java @RestController @RequestMapping("/v1/orders") @RequiredArgsConstructor @Slf4j public class OrderController { private final OrderService orderService; /** * POST /v1/orders * Contract: contracts/openapi/orders-api.v1.yaml */ @PostMapping public ResponseEntity<OrderResponse> createOrder( @Valid @RequestBody CreateOrderRequest request) { log.info("Creating order for customer: {}", request.customerId()); OrderResponse response = orderService.createOrder(request); return ResponseEntity .status(HttpStatus.CREATED) .body(response); } /** * GET /v1/orders/{orderId} */ @GetMapping("/{orderId}") public ResponseEntity<OrderResponse> getOrder(@PathVariable String orderId) { OrderResponse response = orderService.getOrder(orderId) .orElseThrow(() -> new ResourceNotFoundException("Order not found: " + orderId)); return ResponseEntity.ok(response); } } Critical implementation details: DTOs match contract schemas exactly: CreateOrderRequest and OrderResponse are Java records generated from or validated against the OpenAPI specHTTP status codes match contract: 201 for creation, 404 for not found, 409 for idempotency conflictsValidation is enforced: @Valid annotation ensures request validation matches OpenAPI constraintsError responses are standardized: All errors return ErrorResponse with consistent structure Consumer Parallel Development Here's where contract-first shines. While your team implements the provider, the consumer team can: Generate a Java client from the OpenAPI spec using openapi-generatorRun a mock server that returns valid responses based on the contractWrite integration tests against the mockSwitch to the real service when it's ready (no code changes needed) The consumer doesn't wait for you to finish. They develop in parallel. Contract Type 2: Event Contracts With Kafka and Avro Event-driven systems need two contract layers: topic semantics (human-readable) and schema definitions (machine-validated). Topic Semantics Contract Document the operational contract for each topic: Markdown ## Topic: orders.order-created.v1 - **Purpose**: Emitted when an order is created successfully - **Key**: orderId (partition affinity per order) - **Delivery**: At-least-once (consumers must be idempotent) - **Consumer requirement**: Deduplicate by eventId - **Retry policy**: Consumer retries transient errors - **DLQ**: orders.order-created.v1.dlq for poison messages - **Compatibility**: Backward compatible schema evolution required Avro Schema Contract The Avro schema is your machine-validated contract: JSON { "type": "record", "name": "OrderCreated", "namespace": "com.acme.events", "doc": "Event emitted when an order is successfully created", "fields": [ { "name": "eventId", "type": "string", "doc": "Unique event ID for idempotent processing" }, { "name": "occurredAt", "type": "string", "doc": "ISO 8601 timestamp" }, { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "source", "type": ["null", "string"], "default": null, "doc": "Order source (WEB, MOBILE, API). Nullable for backward compatibility." }, { "name": "items", "type": { "type": "array", "items": { "type": "record", "name": "OrderItem", "fields": [ {"name": "sku", "type": "string"}, {"name": "quantity", "type": "int"} ] } } } ] } Key pattern: The source field is nullable with a default value. This supports backward-compatible evolution; old consumers can read new events, and new consumers can read old events. Kafka Producer Implementation Java @Component @RequiredArgsConstructor @Slf4j public class OrderEventPublisher { private final KafkaTemplate<String, Object> kafkaTemplate; public void publishOrderCreated(OrderCreated event) { String key = event.getOrderId(); log.debug("Publishing OrderCreated: orderId={}, eventId={}", key, event.getEventId()); CompletableFuture<SendResult<String, Object>> future = kafkaTemplate.send("orders.order-created.v1", key, event); future.whenComplete((result, ex) -> { if (ex == null) { log.info("Published OrderCreated: orderId={}, partition={}, offset={}", key, result.getRecordMetadata().partition(), result.getRecordMetadata().offset()); } else { log.error("Failed to publish OrderCreated: orderId={}", key, ex); } }); } } Architecture Diagram: Contract-First Flow Figure 1: Contract-first development flow showing parallel team development Note: Solid arrows represent generation or implementation flow. Dashed arrows represent validation relationships. Contract Type 3: Database Contracts With Flyway Contract Type 3: Database Contracts With Flyway Database schemas are contracts, too. Flyway migrations let you version and evolve schemas with the same contract-first approach. Base Schema Migration File: V1__create_orders.sql SQL CREATE TABLE orders ( id VARCHAR(32) PRIMARY KEY, customer_id VARCHAR(32) NOT NULL, status VARCHAR(16) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); CREATE TABLE order_items ( order_id VARCHAR(32) NOT NULL REFERENCES orders(id), sku VARCHAR(64) NOT NULL, quantity INT NOT NULL CHECK (quantity > 0), PRIMARY KEY (order_id, sku) ); CREATE INDEX idx_orders_customer_id ON orders(customer_id); Schema Evolution: Expand/Migrate/Contract When you need to add a field without breaking old code, use the expand/migrate/contract pattern: Expand (V2__add_order_source.sql): SQL ALTER TABLE orders ADD COLUMN source VARCHAR(32); Migrate (application code): SQL -- Backfill for existing rows UPDATE orders SET source = 'UNKNOWN' WHERE source IS NULL; Contract (V3__enforce_source_not_null.sql): SQL -- After all systems produce source ALTER TABLE orders ALTER COLUMN source SET NOT NULL; This three-phase approach lets you deploy schema changes without downtime. CI/CD Enforcement: Making Contracts Real Contracts only work if they're enforced. Here are the CI gates that prevent breaking changes: REST API: OpenAPI Breaking Change Detection Run openapi-diff in CI to catch breaking changes: YAML # .github/workflows/api-contract-check.yml - name: Check for API breaking changes run: | npx openapi-diff \ main:contracts/openapi/orders-api.v1.yaml \ HEAD:contracts/openapi/orders-api.v1.yaml \ --fail-on-breaking This fails the build if you: Remove required fieldsChange field typesRemove endpointsChange HTTP status codes Kafka: Schema Registry Compatibility Check Configure Schema Registry to enforce backward compatibility: YAML # Schema Registry configuration confluent.schema.registry.url=http://schema-registry:8081 spring.kafka.properties.schema.registry.url=http://schema-registry:8081 # Enforce backward compatibility spring.kafka.producer.properties.auto.register.schemas=true spring.kafka.producer.properties.use.latest.version=true When you try to publish an incompatible schema, the producer fails at startup, before reaching production. Database: Flyway Validation Flyway validates migrations on startup: Properties files spring.flyway.validate-on-migrate=true spring.flyway.baseline-on-migrate=false If someone manually modified the database schema, Flyway detects the mismatch and fails the deployment. Integration Flow Diagram Figure 2: End-to-end flow showing REST → DB → Kafka integration with contracts enforced at each boundary Real-World Benefits: Production Implementation Experience After implementing contract-first across multiple services (orders, billing, inventory), teams consistently observe significant improvements: Integration quality: Substantial reduction in integration bugs reaching productionMost remaining bugs are business logic issues, not contract mismatchesClear contracts eliminate ambiguity in integration expectations Development velocity: Integration cycles measured in weeks rather than monthsConsumer teams start integration work immediately instead of waiting for provider completionMock servers enable realistic integration testing without coordination delays Operational reliability: CI-enforced contract validation prevents breaking changes from reaching productionBreaking changes caught during PR review, not after deploymentAutomated validation ensures compatibility before merge Documentation accuracy: Documentation generated from OpenAPI specs stays synchronized with implementation by designSwagger UI always reflects actual API behavior as both derive from the same contractNo manual documentation maintenance required Common Pitfalls and How to Avoid Them Pitfall 1: Treating Contracts as Documentation Wrong approach: Write code first, generate OpenAPI from annotations later.Right approach: Write OpenAPI first, generate server stubs, or validate implementation against it. Pitfall 2: Skipping CI Validation Wrong approach: Trust developers to manually check compatibility.Right approach: Automate breaking change detection in CI. Make it a required check. Pitfall 3: No Schema Evolution Strategy Wrong approach: Add fields without defaults, breaking old consumers.Right approach: All new fields must be optional or have defaults. Test with multiple schema versions. Pitfall 4: Ignoring Idempotency Wrong approach: Assume exactly-once delivery in Kafka.Right approach: Design consumers to deduplicate by eventId. Assume at-least-once. Pitfall 5: Coupling Contracts to Implementation Wrong approach: Expose internal database schema directly in API contracts.Right approach: Design contracts for consumers, not internal convenience. Use DTOs that map between the external contract and the internal model. When Contract-First Makes Sense Contract-first isn't always the answer. Use it when: ✅ Multiple teams integrate: The coordination cost justifies the upfront contract design effort✅ Public APIs or partner integrations: External consumers need stability and clear documentation✅ Microservices architecture: Services must evolve independently without breaking dependents✅ High change frequency: CI validation catches breaking changes early when changes are frequent Skip contract-first when: ❌ Single small team: Coordination overhead is low, so formal contracts add friction❌ Prototyping: You're exploring the problem space and expect major pivots❌ Internal tools with one consumer: The provider and consumer are maintained by the same person Key Takeaways Contracts enable parallel development: Provider and consumer teams work simultaneously instead of seriallyCI validation prevents breaking changes: Automate OpenAPI diffs, schema compatibility checks, and migration validationIdempotency is not optional: At-least-once delivery in Kafka requires consumers to deduplicate by event IDSchema evolution requires backward compatibility: Use nullable fields with defaults to support gradual rolloutsContracts are design specifications, not afterthoughts: Define contracts first, generate code from them What's Next? If you're implementing contract-first integration: Start with one integration, pick your most painful cross-team dependencyWrite the OpenAPI spec or Avro schema first, before any implementationSet up CI validation for that contract (openapi-diff or Schema Registry)Measure reduction in integration bugs and coordination overheadExpand to other integrations once you've proven the pattern Contract-first isn't a silver bullet, but it transforms integration from a coordination nightmare into a governance problem that CI can solve. When you're coordinating three teams across two time zones, that shift makes the difference between shipping in weeks versus months. Full source code: github.com/wallaceespindola/contract-first-integrations Related reading: OpenAPI 3.0 Specification: spec.openapis.orgConfluent Schema Registry: docs.confluent.io/platform/current/schema-registryFlyway Documentation: flywaydb.org/documentationMastering Contract-First API Development: moesif.com/blogResearch on Microservices Issues: arxiv.org Need more tech insights? Check out my GitHub, LinkedIn, and Speaker Deck. Happy coding!

By Wallace Espindola

The Latest Coding Topics

article thumbnail
Liquid Glass, Material 3, and a Lot of Plumbing
New iOS Modern (liquid glass) and Android Material 3 native themes, how they work in the Playground, in the simulator, and on devices, plus a week of performance and look
June 4, 2026
by Shai Almog DZone Core CORE
· 425 Views · 1 Like
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
Build a Slack bot using AWS Bedrock and MCP to answer GitHub questions. Learn setup, architecture, and how to extend it with new tools and data sources.
June 4, 2026
by Sangharsh Agarwal
· 585 Views
article thumbnail
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
How AI-native tooling is finally closing the loop between compliance personas and OSCAL artifacts with an MCP-standardized, AI-agent-ready interface.
June 4, 2026
by Yuji Watanabe
· 600 Views
article thumbnail
Building Threat Intelligence Pipelines Using Python, APIs, and Elasticsearch
STIX/TAXII in, ECS normalized, provenance preserved deterministic IDs, correct bulk writes, ingest pipelines keep threat indicator data reliable and queryable under load.
June 3, 2026
by Krishnaveni Musku
· 1,056 Views
article thumbnail
Getting Started With Agentic Workflows in Java and Quarkus
A step-by-step tutorial on how to add agentic workflows to Quarkus applications with the Agentican framework via YAML and annotations.
June 3, 2026
by Shane Johnson
· 1,078 Views · 2 Likes
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Building a Slack bot with traditional APIs led to 400 lines of code. Using MCP and AWS Bedrock reduced complexity, enabling scalable, tool-driven automation.
June 3, 2026
by Sangharsh Agarwal
· 964 Views · 1 Like
article thumbnail
Building AI-Powered Java Applications With Jakarta EE and LangChain4j
Integrate AI into Java apps with Jakarta EE, CDI, MicroProfile Config, and LangChain4j. Build AI services from simple prompts to type-safe domain-driven interactions.
June 3, 2026
by Otavio Santana DZone Core CORE
· 1,025 Views
article thumbnail
MuleSoft MCP and A2A in Production: What 17 Recipes Reveal
MuleSoft MCP and A2A shipped in 2025. Zero practitioner guides exist beyond basic setup. 17 recipes reveal the implementation ladder teams are missing.
June 3, 2026
by Balachandra Shakar Bisetty
· 697 Views
article thumbnail
Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes
Moving a hardcoded LangGraph React agent into LaunchDarkly AI Configs so prompts, models, tools, tracking, and rollout testing can be changed without redeploying.
June 2, 2026
by Scarlett Attensil
· 1,193 Views
article thumbnail
Alternative Structured Concurrency
My goal here is to experiment with an alternative approach leveraging Java's tried-and-tested, robust functionalities that have been available since JDK 1.5.
June 2, 2026
by Valery Silaev
· 1,359 Views
article thumbnail
Optimizing Databricks Spark Pipelines Using Declarative Patterns
This article explains why hand-tuning Spark is becoming the slow path — and what the declarative alternatives actually look like in production.
June 1, 2026
by Seshendranath Balla Venkata
· 999 Views
article thumbnail
MuleSoft IDP: Enhancing Efficiency and Accuracy in Data Extraction
MuleSoft IDP uses AI to extract and structure data from documents like invoices and PDFs, helping automate workflows, reduce errors, and improve processing speed.
June 1, 2026
by Jitendra Bafna
· 1,088 Views
article thumbnail
Jakarta EE 12: Entering the Data Age of Enterprise Java
Jakarta EE 12 introduces the Data Age of Enterprise Java with Jakarta Query, improved data access, and a unified model for cloud-native and polyglot systems.
June 1, 2026
by Otavio Santana DZone Core CORE
· 8,722 Views
article thumbnail
A Hands-On ABAP RESTful Programming Model Guide
BAPIs are legacy; replace them with RAP-based APIs and EML in S/4HANA 2022 for cleaner, cloud-ready, upgrade-safe ABAP that SAP actually maintains.
June 1, 2026
by Deepika Paturu
· 665 Views
article thumbnail
Event-Driven Pipelines With Apache Pulsar and Go
Build scalable, real-time pipelines with Apache Pulsar and Go using event-driven producers and consumers that communicate via Pulsar topics.
May 29, 2026
by Shivi Kashyap
· 2,364 Views
article thumbnail
Zero-Downtime Deployments for Java Apps on Kubernetes
Achieve zero-downtime deployments for Java applications on Kubernetes using rolling updates, readiness/liveness probes, and graceful shutdown strategies.
May 29, 2026
by Ramya vani Rayala
· 3,178 Views
article thumbnail
Pragmatica Aether: Let Java Be Java
A modern, distributed, fault-tolerant runtime environment for the language that was intentionally designed for managed environments.
May 29, 2026
by Sergiy Yevtushenko
· 3,384 Views · 1 Like
article thumbnail
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Define API, event, and DB contracts upfront to enable parallel development, catch breaking changes in CI, and maintain consistent, reliable integrations.
May 29, 2026
by Wallace Espindola
· 2,023 Views
article thumbnail
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Learn how to build an ETL pipeline with human-in-the-loop approval that costs nothing while waiting — and see real cost data from processing 1,000 documents.
May 28, 2026
by Harpreet Siddhu
· 3,450 Views
article thumbnail
AI Paradigm Shift: Analytics Without SQL
An AI-native analytics agent sits between users and the data warehouse, translating natural-language questions into governed SQL or Python workflows and dashboards.
May 28, 2026
by Haricharan Shivram Suresh Chandra Kumar
· 1,879 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×