A framework is a collection of code that is leveraged in the development process by providing ready-made components. Through the use of frameworks, architectural patterns and structures are created, which help speed up the development process. This Zone contains helpful resources for developers to learn about and further explore popular frameworks such as the Spring framework, Drupal, Angular, Eclipse, and more.
Grok AI API Tutorial: Chat, Image, Video, Tool Calling, and Web Search
Runtime Formula Evaluation With MVEL Library in Spring Boot
The MovieManager project has been updated to use JDK 25 and the AOT cache from project Leyden. Project Leyden is part of the OpenJDK project and provides cached linking and cached performance statistics. That means the time spent linking at startup is moved to build time, and the statistics are created during a test run at build time as well. Because of that, the JVM loads the needed classes already linked and starts compiling the hot code paths immediately. The MovieManager application starts in less than half the time with these optimizations without any code changes. All these advantages come with preconditions: Exactly the same JVM version at build time, training time, and run timeThe same OS(Linux is used here) and libc at all steps -> (No Alpine-based Docker Images)Same CPU architecture, for example, AMD64 or ARM64 The steps to use Project Leyden: Build the Spring Boot ApplicationExtract the Spring Boot ApplicationDo a training run with the extracted Application to create the AOT cacheCreate the Docker Image with the extracted Application and the AOT cache Building and Training the Application The first step is to build the Spring Boot JAR. The MovieManager project has an integrated build that builds the Angular frontend and the Spring Boot backend with this Maven command: Shell ./mvnw clean install -Ddocker=true -Dnpm.test.script=test-chromium Project Leyden does not support Spring Boot Jars. The Jar has to be extracted to help Project Leyden find the used library jars of the project. To do that, this command needs to be used: Shell java -Djarmode=tools -jar backend/target/moviemanager-backend-0.0.1-SNAPSHOT.jar extract --destination extracted The result is the directory ‘extracted’ with the application jar and a sub-directory ‘lib’ that contains the used libraries. The second step is to create the AOT cache. To do that, the application has to run in production conditions. That means using a real PostgreSQL database with the database driver. That enables the JDK to record all the needed classes of the project and to create realistic performance statistics for the code compilation. To do this, a PostgreSQL database has to be started(done here in a Docker container), and the Application has to do the full startup. These commands are needed: Shell docker pull postgres:13 docker run --name local-postgres -e POSTGRES_PASSWORD=sven1 -e POSTGRES_USER=sven1 -e POSTGRES_DB=movies -p 5432:5432 -d postgres java -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:+UseCompressedOops -XX:+UseCompactObjectHeaders -XX:+ExitOnOutOfMemoryError -XX:MaxDirectMemorySize=64m -XX:+UseStringDeduplication -Xlog:aot -XX:AOTCacheOutput=app.aot -Dspring.context.exit=onRefresh -Djava.security.egd=file:/dev/./urandom -jar extracted/moviemanager-backend-0.0.1-SNAPSHOT.jar --spring.profiles.active=prod The Java command runs the application with the parameter ‘-Dspring.context.exit=onRefresh’ that makes Spring Boot do the full startup and exit then. The parameters ‘-Xlog:aot -XX:AOTCacheOutput=app.aot’ enable the logging of the AOT process and the creation of the ‘app.aot’ that is the AOT cache. The AOT cache contains everything that is needed for a fast startup of the application. If the AOT cache should also contain information to improve production performance, it would have to start up and process realistic production requests. That is beyond the scope of this article. The third step is to test the new application setup: Shell java -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:+UseCompressedOops -XX:+UseCompactObjectHeaders -XX:+ExitOnOutOfMemoryError -XX:MaxDirectMemorySize=64m -XX:+UseStringDeduplication -Xlog:class+path=info -XX:AOTCache=app.aot -Xlog:aot -Djava.security.egd=file:/dev/./urandom -jar extracted/moviemanager-backend-0.0.1-SNAPSHOT.jar --spring.profiles.active=prod The start-up time of the new setup with the AOT cache can be compared to the start-up time of the Spring Boot jar. On a medium-powered laptop, the times are: 9 seconds for the Spring Boot Jar3.5 seconds for the new setup with the AOT cache Creating a Docker Image To use the application in production, it needs to be packaged into a Docker image. The Docker image needs to contain the extracted application setup and the AOT cache. The base image needs to have the exact same JDK version, OS, and the same libc. That means small base images like Alpine cannot be used. The created Image can not be small because it contains 180 MB of AOT cache and a larger base image. This can be done with this Dockerfile: Dockerfile FROM eclipse-temurin:25.0.3_9-jdk-jammy WORKDIR /application ARG JAR_FILE=extracted/*.jar COPY ${JAR_FILE} moviemanager-backend-0.0.1-SNAPSHOT.jar COPY extracted/ ./ COPY app.aot app.aot ENV JAVA_OPTS="-XX:+UseG1GC \ -XX:MaxGCPauseMillis=50 \ -XX:+UseCompressedOops \ -XX:+UseCompactObjectHeaders \ -XX:+ExitOnOutOfMemoryError \ -XX:MaxDirectMemorySize=64m \ -XX:+UseStringDeduplication" ENTRYPOINT exec java $JAVA_OPTS -XX:+AOTClassLinking \ -XX:AOTCache=app.aot \ -Xlog:class+path=info \ -Djava.security.egd=file:/dev/./urandom \ -jar moviemanager-backend-0.0.1-SNAPSHOT.jar It copies the new application setup in the image and adds the AOT cache. The name of the application jar is in the AOT cache and has to be exactly the same as during the creation of the AOT cache. The ‘JAVA_OPTS’ also have to be the same. If the JDK version in the build environment changes, the version of the base image has to be adjusted accordingly. The parameter ‘-Xlog:class+path=info’ makes analyzing AOT problems much easier. The Docker container size is 705 MB. That makes the container about double the size of a Docker container with a Spring Boot Jar and an Alpine-based JDK image. Creating a Build Pipeline Creating Docker images for an application by hand is unsustainable in a production environment. A build pipeline is needed. The MovieManager project is hosted on GitHub; because of that, the project uses a GitHub Workflow as a build pipeline. The complete code for the build pipeline is in the script. The steps of the GitHub pipeline can be recreated in other environments too. The first step is to set up the PostgreSQL database service to be used in this build: YAML jobs: analyze: name: Analyze runs-on: ubuntu-latest env: POSTGRES_URL: jdbc:postgresql://localhost:5432/movies services: postgres: image: postgres:latest env: POSTGRES_USER: sven1 POSTGRES_PASSWORD: sven1 POSTGRES_DB: movies ports: - 5432:5432 options: >- --health-cmd="pg_isready -U sven1 -d movies" --health-interval=10s --health-timeout=5s --health-retries=5 The commands set up the PostgreSQL service in the build pipeline with user, password, dbname, and dbport. The ‘POSTGRES_URL’ is set to access the database later. The second step is to check out the project: YAML steps: - name: Checkout repository uses: actions/checkout@v3 It checks out the contents of the master branch. The third step is to provide the JDK: YAML - name: Setup Java JDK uses: actions/setup-java@v3 with: distribution: 'temurin' java-version: 25 JDK version 25 is the minimum to use the project Leyden with linking and performance statistics. The fourth step builds the Spring Boot Jar: YAML - name: Build with Maven if: matrix.language == 'java' run: | ./mvnw clean install -Ddocker=true That is the Maven command to build the project. The fifth step is to find the Spring Boot jar: YAML - name: Find fat jar if: matrix.language == 'java' id: jar run: | JAR_PATH=$(find ./backend/target -type f -name "*SNAPSHOT.jar" | head -n 1) echo "Found JAR: $JAR_PATH" echo "jar=$JAR_PATH" >> $GITHUB_OUTPUT The sixth step is to extract the Spring Boot jar: YAML - name: Unpack fat jar if: matrix.language == 'java' id: UNPACK run: | java -Djarmode=tools -jar ${{ steps.jar.outputs.jar } extract --destination extracted EXTRACTED_PATH=$(find . -type d -name "extracted" | head -n 1) echo "Found directory: $EXTRACTED_PATH" echo "extracted=$EXTRACTED_PATH" >> $GITHUB_OUTPUT The seventh step is to get the name of the extracted application jar: YAML - name: find extracted jar if: matrix.language == 'java' id: EXTRACT run: | EXTRACTED_JAR=$(find "${{ steps.UNPACK.outputs.extracted }" -type f -name "*.jar" | head -n 1) EXTRACTED_JAR=${EXTRACTED_JAR#./} echo "Found extracted JAR: $EXTRACTED_JAR" echo "extracted=$EXTRACTED_JAR" >> $GITHUB_OUTPUT The eighth step is to create the AOT cache: YAML - name: Create AOT cache if: matrix.language == 'java' id: AOT env: JAVA_TOOL_OPTIONS: "" _JAVA_OPTIONS: "" JDK_JAVA_OPTIONS: "" run: | EXTRACTED_JAR="${{ steps.EXTRACT.outputs.extracted }" echo "jar=$EXTRACTED_JAR" echo "JAVA_TOOL_OPTIONS=$JAVA_TOOL_OPTIONS" echo "_JAVA_OPTIONS=$_JAVA_OPTIONS" echo "JDK_JAVA_OPTIONS=$JDK_JAVA_OPTIONS" JAVA_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:+UseCompressedOops -XX:+UseCompactObjectHeaders -XX:+ExitOnOutOfMemoryError -XX:MaxDirectMemorySize=64m -XX:+UseStringDeduplication" java $JAVA_OPTS \ -XX:+AOTClassLinking \ -XX:AOTCacheOutput=app.aot \ -Xlog:aot \ -Dspring.context.exit=onRefresh \ -Dspring.datasource.url="${{ env.POSTGRES_URL }" \ -Dspring.profiles.active=prod \ -jar "$EXTRACTED_JAR" || echo "AOT Training finished with exit code $?" This runs the application startup with the PostgreSQL database to create the AOT cache. The ninth step shows the exact JDK version used in the AOT cache generation: YAML - name: Show Jdk version if: matrix.language == 'java' id: JDK run: | JDK_VERSION=$(java -version 2>&1) VERSION=$(echo "$JDK_VERSION" | sed -n 's/.*build \([^[:space:]]*\)-LTS.*/\1/p') echo "JDK_VERSION=$JDK_VERSION" echo "VERSION=$VERSION" MY_VERSION="jdk=$VERSION" In case of problems with using the AOT cache. The first check is the version shown here against the JDK version in the Docker base image. The tenth step creates the Docker image: YAML - name: Build and push uses: docker/build-push-action@v6 if: matrix.language == 'java' with: context: . file: ./Dockerfile build-args: | JAR_PATH=${{ steps.EXTRACT.outputs.extracted } LIB_PATH=${{ steps.aot.outputs.extracted } push: false tags: angular2guy/moviemanager:latest This step can push the Docker image to an image repository. Conclusion The results of using the AOT cache of project Leyden are impressive. Cutting the startup time in half without any code change is amazing. The effort to create the AOT cache and set up the new application is a one-time investment. The impact of the larger Docker Images is low. That makes scaling application instances in Kubernetes clusters up and down much more flexible because the time to the availability of a new application instance is much lower. In Kubernetes environments with scaling of application instances, the AOT cache is a significant step forward and should be used. For serverless applications 3.5 seconds startup time is too slow. Their project, CrAC or Native Image, would be needed. Project CrAC needs code changes and testing. Native Image has the closed-world assumption, which makes it hard to prove that larger applications work correctly. Alternatives are Node.js with Nest.js and TypeScript, or Go with its libraries. Project Leyden is not finished in JDK 25. There are plans to add compiled code to the AOT cache in the future. The JVM is an impressive piece of technology that is still improving further.
Picture a simple scenario. An AI agent is wired into your Case page in Salesforce. A customer sends a reply that sounds like the issue is resolved. The agent reads the conversation, decides the case can be closed, and updates the status to "Closed." A week later, the customer calls in frustrated. "Why did you close my case? My issue wasn't resolved." You go to the Case in Salesforce to investigate. The audit trail tells you almost nothing useful. CreatedBy says "Integration User." LastModifiedBy says the same. Field history tracking shows the status changed from "Open" to "Closed" at 3:37 pm last Tuesday. None of it tells you the one thing you actually need to know: why did the agent do that? This is the gap that shows up the minute AI agents start taking real action inside Salesforce. The fix is small. One custom object, a handful of fields, and a habit of asking the agent to explain itself. Where the Problem Begins Standard Salesforce auditing was built for two kinds of actors: human users and deterministic automation. When a human closes a case, you can ask them why. When a Flow closes a case, the criteria are sitting right there in the metadata. When an Apex trigger closes a case, the logic is in the code. AI agents are neither of those. The decision to close the case lived inside a language model's response, and the moment the action committed, that reasoning was gone. Unless you captured it. The Idea Drop a single custom object into the org. Every time the agent does something, it writes one row. That row captures three things: What the agent didWhat context it had when decidingWhy it made the call That's the whole framework. One object, one row per action, three buckets of data. The Core Object: Agent_Action_Log__c Here are the fields that matter. Everything else is optional polish. Triggering User – Lookup to the User whose interaction caused the agent to run.Related Case – Lookup to the Case the agent was working on.Action Type – Picklist with values like Create Record, Update Record, Send Email, Call External API.Tool Called – The specific Apex method, Flow, or API the agent invoked.Inputs – The arguments the agent passed to the tool (long text).Context Snapshot – The relevant context the agent had at decision time, such as record state and recent activity (long text).Reasoning – The agent's stated rationale, captured verbatim from the LLM response (long text).Confidence – A number between 0 and 1 if the agent reports one. Optional but useful for reporting.Status – Success or Failure.Error – If anything went wrong, the error message lands here. Eleven fields. One object. Keep it small until you need it to be bigger. Because Related Case is a lookup field, every Case page can show a related list of Agent Action Logs. How the Reasoning Actually Gets Captured This is the part most teams get wrong, so it's worth slowing down on. The LLM doesn't volunteer its reasoning. You have to ask for it as part of the response itself. The way you do that is by defining every tool the agent can call with a JSON schema, and adding a reasoning field to that schema right next to the actual arguments. Back to the case-closing example. Normally the update_case_status tool would accept two arguments: case_id and new_status. With this framework, add a third required field: JSON { "type": "function", "function": { "name": "update_case_status", "description": "Updates the status of a Salesforce Case.", "parameters": { "type": "object", "properties": { "case_id": { "type": "string", "description": "The 18-character Salesforce Case Id." }, "new_status": { "type": "string", "description": "The new status value to apply." }, "reasoning": { "type": "string", "description": "Why this status change is appropriate given the context." } }, "required": ["case_id", "new_status", "reasoning"] } } } Here's the important part. The LLM is not calling Salesforce. Your Apex code calls the LLM, the LLM sends back a JSON payload describing the tool call (function name plus arguments), and your Apex code is the one that actually executes the action. So when the response comes back with case_id, new_status, and reasoning, your code reads all three out of the JSON. The first two drive the update on the Case. The third gets written into the Reasoning field on the new Agent_Action_Log__c record. The agent's words land in your audit log verbatim. If you turn on strict mode in your tool definition (OpenAI's strict: true), the model is forced to return every required field, including reasoning. That makes empty reasoning rare. Even so, defensive code should reject any tool call missing the field and ask the model to retry, in case strict enforcement isn't available on the path you're using. The reasoning field is the difference between "the agent closed this case" and "the agent closed this case because the customer's last message said 'Thanks, that solved it, you can close this out.'" One of those is an audit log entry. The other is actually useful. How It All Works Walking through the case-closing scenario end to end: A customer reply lands on a Case, triggering the agent.The framework gathers context about the Case: the Triggering User, the Related Case, and a Context Snapshot showing the last few messages on the case.The agent calls update_case_status with case_id, new_status set to "Closed", and a reasoning string. The framework now has everything it needs: Action Type, Tool Called, Inputs, Reasoning, and Confidence.The framework executes the actual status update on the Case.The framework writes one Agent_Action_Log__c row capturing everything: the context, the reasoning, and the outcome (Status set to "Success", or Error if the update failed).If the agent takes a follow-up action (say, sending the customer a notification email), that's a new row on the same Case, with its own reasoning. From any Case page, the related list shows every agent action that ever touched that record. From a Salesforce report, compliance can pull every "Close Case" action where Confidence was below 0.7 in the last quarter. From a debugger's view, you can find out exactly what the agent thought it was doing the moment it did it. Why This Approach Works One object, one place to look. No traversing relationships, no joining across logs.The "why" lives next to the "what." Reasoning is captured at decision time, not reconstructed after the fact. Final Thoughts The case-closing example is intentionally small. But the same pattern handles every other agent action, whether that's sending an email, escalating to a specialist, updating a field, or posting a notification. Each one becomes another row on the same object, with the same fields, captured the same way. That's the whole point. You don't need an elaborate observability stack to make AI agents auditable in Salesforce. You need one custom object, a habit of forcing the model to explain itself, and a wrapper around your tool calls that writes a row every time. The next time someone asks why the agent did something, the answer is already sitting in a record on the page they're looking at.
The Silent Killer No One Mentions Until the Bill Comes Most papers about "agentic AI in production" stop where the problem starts: price. Interacting with Claude or GPT requires a natural pace setter: you, reading the generated text. Take the same agent out of chat mode and drop it into CI/CD, nightly batch, webhook handling; the pacing goes away, and you're running the thing purely on computer time, at computer prices. The numbers look even scarier when you dig deeper. ReAct-style looped execution prepends all outputs to the following prompts, meaning you consume tokens at roughly O(n²). A PR review agent in a loop of three steps, which costs $0.04 in local development, can rack up a charge of $0.40+ once it gets stuck in its looping process. With hundreds of PRs a week, that could easily amount to five-figure surprises in your bill. But not only that, with that kind of bill, you don't know which particular agent, which repo, which user, or which prompt you spent the money on. The invoice provided by your vendor simply says "API usage." Since you won't be able to attribute the charges properly, the only recourse is a blunt measure such as "ban Sonnet in CI." Such a measure would render useless all the valuable agents you had. This paper proposes an attribution-based architecture, a thin layer between you and your LLM vendor that adds proper deployment-aware attribution to each API call, enforces budget limits at proper levels, and cuts the looping processes off before they start racking up charges. The source code is written in Python and framework agnostic. It employs OpenTelemetry's gen_ai.* semantics, allowing you to send the telemetry anywhere. Architecture The design has four layers. Each one fails closed by default. "Why not use observability as a gateway?" Because LangSmith, Helicone, Langfuse, and Arize Phoenix observe costs retroactively. None of them intercepts in the execution path. They alert you about the $437 weekend, but they don't prevent it. Prevention involves enforcing costs asynchronously before the inference API call is made outside the VPC. The Cost Attribution Context The outbound API calls from an agent must include a context object. This is the only decision that matters, and yet it's the one most often missed by teams. Python # attribution.py from dataclasses import dataclass, field, asdict from typing import Optional import contextvars import uuid @dataclass(frozen=True) class CostContext: """Travels with every LLM call. Used for budgets, traces, and chargeback.""" tenant_id: str # business unit / team / customer agent_id: str # logical agent name, e.g. "pr-reviewer" agent_version: str # prompt+code hash, lets you A/B prompts run_id: str # one full agent invocation (a "task") step_id: str # individual reasoning step inside the run parent_run_id: Optional[str] = None # for hierarchical agents repo: Optional[str] = None # CI metadata pr_number: Optional[int] = None triggered_by: Optional[str] = None # user, scheduler, webhook, ... labels: dict = field(default_factory=dict) _current_context: contextvars.ContextVar[Optional[CostContext]] = \ contextvars.ContextVar("cost_context", default=None) def new_run(tenant_id: str, agent_id: str, agent_version: str, **kw) -> CostContext: return CostContext( tenant_id=tenant_id, agent_id=agent_id, agent_version=agent_version, run_id=str(uuid.uuid4()), step_id="0", **kw, ) def child_step(ctx: CostContext, step_label: str) -> CostContext: return CostContext(**{**asdict(ctx), "step_id": f"{ctx.step_id}.{step_label}"}) def set_current(ctx: CostContext): _current_context.set(ctx) def get_current() -> Optional[CostContext]: return _current_context.get() Since the 'step_id' is represented using a dot notation '0.plan.2.tool.web_search', one single tracing query can compute the cost of a certain step ("what was the planning cost last week for all the PR reviewers?") without having to process free-text tags. The Attribution Interceptor The interceptor adds instrumentation around the LLM client, resulting in emitting spans for each request according to the OpenTelemetry conventions for GenAI services ('gen_ai.system', 'gen_ai.request.model', 'gen_ai.usage.input_tokens', etc). What really makes the difference is the additional namespace of 'cost.*'. Python # interceptor.py import time from opentelemetry import trace from opentelemetry.trace import Status, StatusCode from attribution import get_current tracer = trace.get_tracer("agentic-cost-gateway") # Provider price tables in USD per 1M tokens (input, output). # Keep this in config — prices move quarterly. PRICE_TABLE = { "claude-sonnet-4-7": (3.00, 15.00), "claude-haiku-4-5": (0.80, 4.00), "gpt-4o": (2.50, 10.00), "gpt-4o-mini": (0.15, 0.60), } def estimated_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float: p_in, p_out = PRICE_TABLE.get(model, (0.0, 0.0)) return (input_tokens * p_in + output_tokens * p_out) / 1_000_000 class AttributedLLMClient: """Wraps a raw provider SDK and emits attributed OTel spans.""" def __init__(self, raw_client, system_name: str): self._raw = raw_client self._system = system_name # "anthropic", "openai", ... def chat(self, *, model: str, messages: list, **kw) -> dict: ctx = get_current() if ctx is None: raise RuntimeError( "No CostContext set. Wrap your agent loop in `set_current(new_run(...))`." ) with tracer.start_as_current_span("gen_ai.chat") as span: span.set_attribute("gen_ai.system", self._system) span.set_attribute("gen_ai.request.model", model) span.set_attribute("gen_ai.operation.name", "chat") # Business attribution. span.set_attribute("cost.tenant_id", ctx.tenant_id) span.set_attribute("cost.agent_id", ctx.agent_id) span.set_attribute("cost.agent_version", ctx.agent_version) span.set_attribute("cost.run_id", ctx.run_id) span.set_attribute("cost.step_id", ctx.step_id) if ctx.repo: span.set_attribute("cost.repo", ctx.repo) if ctx.pr_number: span.set_attribute("cost.pr_number", ctx.pr_number) t0 = time.monotonic() try: resp = self._raw.chat(model=model, messages=messages, **kw) except Exception as e: span.set_status(Status(StatusCode.ERROR, str(e))) span.set_attribute("cost.outcome", "provider_error") raise latency_ms = (time.monotonic() - t0) * 1000 usage = resp.get("usage", {}) in_tok = usage.get("input_tokens", 0) out_tok = usage.get("output_tokens", 0) cost = estimated_cost_usd(model, in_tok, out_tok) span.set_attribute("gen_ai.usage.input_tokens", in_tok) span.set_attribute("gen_ai.usage.output_tokens", out_tok) span.set_attribute("gen_ai.response.finish_reasons", resp.get("finish_reason", "stop")) span.set_attribute("cost.usd", cost) span.set_attribute("cost.latency_ms", latency_ms) span.set_attribute("cost.outcome", "ok") return resp One line of the trace from a PR cost analysis would then look like the following in any backend that supports OTel tracing (Jaeger, Tempo, Honeycomb, Datadog, ClickStack, etc.): Python agent.run cost.run_id=abc123 cost.usd=0.41 ├── gen_ai.chat step_id=0.plan model=claude-sonnet-4-7 cost.usd=0.04 ├── gen_ai.chat step_id=0.tool.diff model=claude-haiku-4-5 cost.usd=0.01 ├── gen_ai.chat step_id=0.review.1 model=claude-sonnet-4-7 cost.usd=0.18 └── gen_ai.chat step_id=0.review.2 model=claude-sonnet-4-7 cost.usd=0.18 You will be able to ask "what agent/which repo cost us the most last week?" with one question, and you'll get a structured response, not a regular expression search through log entries. The Budget Broker This broker makes sure that there are hard limits set before the call leaves your network. The challenge is doing the checking and decrementing atomically for multiple worker agents; race conditions translate into dollars lost here. Redis and a Lua script do the job in one shot. Python # budget_broker.py import redis from dataclasses import dataclass # Atomic: read current spend, compare to limit, increment if room. # Returns 1 if approved, 0 if over budget. The increment uses a tentative # token estimate; the interceptor reconciles to actuals after the call. RESERVE_LUA = """ local key = KEYS[1] local limit = tonumber(ARGV[1]) local request = tonumber(ARGV[2]) local ttl = tonumber(ARGV[3]) local current = tonumber(redis.call('GET', key) or '0') if current + request > limit then return 0 end redis.call('INCRBY', key, request) redis.call('EXPIRE', key, ttl) return 1 """ RECONCILE_LUA = """ local key = KEYS[1] local delta = tonumber(ARGV[1]) local new = redis.call('INCRBY', key, delta) if new < 0 then redis.call('SET', key, '0') end return new """ @dataclass class Budget: scope_key: str # e.g. "tenant:platform-team:run:abc123" limit_tokens: int ttl_seconds: int = 3600 class BudgetBroker: def __init__(self, redis_url: str): self.r = redis.Redis.from_url(redis_url) self._reserve = self.r.register_script(RESERVE_LUA) self._reconcile = self.r.register_script(RECONCILE_LUA) def reserve(self, b: Budget, estimated_tokens: int) -> bool: ok = self._reserve( keys=[b.scope_key], args=[b.limit_tokens, estimated_tokens, b.ttl_seconds], ) return bool(ok) def reconcile(self, scope_key: str, actual_minus_estimate: int): """Called after the response. Positive delta = we under-estimated.""" self._reconcile(keys=[scope_key], args=[actual_minus_estimate]) Wire the broker into the interceptor: Python # in AttributedLLMClient.chat, before calling self._raw.chat: estimated_in = estimate_tokens(messages) # tiktoken or anthropic-tokenizer estimated_out = kw.get("max_tokens", 1024) estimated_total = estimated_in + estimated_out # Two-level budget: per-run AND per-tenant-per-day. run_key = f"tenant:{ctx.tenant_id}:run:{ctx.run_id}" day_key = f"tenant:{ctx.tenant_id}:day:{date.today().isoformat()}" if not broker.reserve(Budget(run_key, RUN_LIMIT, ttl_seconds=3600), estimated_total): span.set_attribute("cost.outcome", "run_budget_exceeded") raise BudgetExceeded(f"Run {ctx.run_id} hit per-run ceiling") if not broker.reserve(Budget(day_key, TENANT_DAILY_LIMIT, ttl_seconds=86400), estimated_total): # Refund the run reservation we just took. broker.reconcile(run_key, -estimated_total) span.set_attribute("cost.outcome", "tenant_daily_exceeded") raise BudgetExceeded(f"Tenant {ctx.tenant_id} over daily limit") # ... call provider ... actual_total = in_tok + out_tok broker.reconcile(run_key, actual_total - estimated_total) broker.reconcile(day_key, actual_total - estimated_total) There are two important budgets that are broken on different levels: budgetCATCHESFAILURE MODEPer-Run One agent is hung up in an endless loopTerminate the runTenant per dayOne bug found in *all* agents from one teamStop new runs, notifyAgent per hour (optional)One rogue agent with a buggy version of promptRollback, notify The third tier is optional, but affordable; it is the same broker with another key pair, and it will help you spot the regression that was caused by the prompt modification during deployment. The Loop Guard Per-run limits prevent the execution of expensive tasks eventually. The loop guard will spot fast-executing tasks — the ReAct loop that started reading each file in the repository, as the planner misunderstood the diff. The indicator that we should track is token velocity per step: if the number of tokens consumed by the task during step n increases by a configured value compared to step n-1, then we break the loop. Python # loop_guard.py from collections import defaultdict class LoopGuard: """Detects pathological context growth across steps within a run.""" def __init__(self, growth_factor_limit: float = 1.6, absolute_step_limit: int = 50): self.growth_factor_limit = growth_factor_limit self.absolute_step_limit = absolute_step_limit self._last_input_tokens: dict[str, int] = defaultdict(int) self._step_count: dict[str, int] = defaultdict(int) def check(self, run_id: str, current_input_tokens: int) -> None: self._step_count[run_id] += 1 if self._step_count[run_id] > self.absolute_step_limit: raise CircuitOpen( f"Run {run_id} exceeded {self.absolute_step_limit} steps" ) last = self._last_input_tokens[run_id] if last > 0 and current_input_tokens > last * self.growth_factor_limit: raise CircuitOpen( f"Run {run_id}: context grew {current_input_tokens / last:.2f}x " f"(limit {self.growth_factor_limit}x). Likely loop." ) self._last_input_tokens[run_id] = current_input_tokens def end_run(self, run_id: str): self._last_input_tokens.pop(run_id, None) self._step_count.pop(run_id, None) A factor of 1.6× is not magic either. It follows from the analysis of well-functioning agents: a properly operating planning-executing loop with the tool's output and reasoning, where input roughly doubles per iteration. A factor of 2× almost certainly implies that the agent has read the same file twice or even loaded the full RAG corpus. In the multi-process setup, replace the in-memory dict with Redis hash — the same code, the same keys. Putting It All Together: A Wrapped Agent Loop Python # pr_review_agent.py from attribution import new_run, child_step, set_current from interceptor import AttributedLLMClient from budget_broker import BudgetBroker from loop_guard import LoopGuard, CircuitOpen broker = BudgetBroker("redis://budget-broker:6379") guard = LoopGuard(growth_factor_limit=1.6, absolute_step_limit=20) llm = AttributedLLMClient(raw_anthropic_client, system_name="anthropic") def review_pr(repo: str, pr_number: int, triggered_by: str) -> dict: run_ctx = new_run( tenant_id="platform-team", agent_id="pr-reviewer", agent_version="prompt-v7-2026-05-09", repo=repo, pr_number=pr_number, triggered_by=triggered_by, ) set_current(run_ctx) try: # Step 1: plan set_current(child_step(run_ctx, "plan")) plan = llm.chat( model="claude-haiku-4-5", # cheap model for planning messages=plan_prompt(repo, pr_number), ) guard.check(run_ctx.run_id, plan["usage"]["input_tokens"]) # Step 2..N: execute reviewers per file results = [] for i, file_path in enumerate(plan["files_to_review"]): set_current(child_step(run_ctx, f"review.{i}")) r = llm.chat( model="claude-sonnet-4-7", # premium for actual review messages=review_prompt(file_path), ) guard.check(run_ctx.run_id, r["usage"]["input_tokens"]) results.append(r) return {"plan": plan, "reviews": results} except CircuitOpen as e: # Loop guard fired — record but don't crash the pipeline. return {"status": "circuit_open", "reason": str(e)} finally: guard.end_run(run_ctx.run_id) Three big takeaways: 1. Model routing happens at the call site, not the gateway. The difference between a cheap planner on Haiku and an expensive reviewer on Sonnet is the number one opportunity for cost savings. It will always be invisible in daily totals without per-call attribution. 2. Failure events get logged, not bubbled up to a panic alert. Circuit open is an outcome, not an exception. CI Bot writes a comment ("Review skipped — agent circuit opened, run abc123"), and then the on-call and pipeline get notified. 3. Full cost attribution survives failures. The OTel span representing a failed task will have a 'cost.outcome="run_budget_exceeded"' attribute in addition to all the regular call site attributions. So those expensive failures can go right into your dashboard. Querying the Data When every span is attributed, getting the right dashboard query takes three layers. With ClickHouse (or any columnar DB where you've put OTel): Cost per agent, last 7 days, with growth rate: SQL SELECT SpanAttributes['cost.agent_id'] AS agent, SpanAttributes['cost.agent_version'] AS version, sum(toFloat64OrZero(SpanAttributes['cost.usd'])) AS usd_total, count() AS calls, round(usd_total / calls, 4) AS usd_per_call FROM otel_traces WHERE Timestamp > now() - INTERVAL 7 DAY AND SpanName = 'gen_ai.chat' GROUP BY agent, version ORDER BY usd_total DESC; Top 10 most expensive runs, with the step that did the damage: SQL WITH run_costs AS ( SELECT SpanAttributes['cost.run_id'] AS run_id, SpanAttributes['cost.agent_id'] AS agent, SpanAttributes['cost.repo'] AS repo, sum(toFloat64OrZero(SpanAttributes['cost.usd'])) AS usd FROM otel_traces WHERE Timestamp > now() - INTERVAL 1 DAY GROUP BY run_id, agent, repo ORDER BY usd DESC LIMIT 10 ) SELECT r.run_id, r.agent, r.repo, r.usd, argMax(SpanAttributes['cost.step_id'], toFloat64OrZero(SpanAttributes['cost.usd'])) AS top_step FROM run_costs r JOIN otel_traces t ON SpanAttributes['cost.run_id'] = r.run_id GROUP BY r.run_id, r.agent, r.repo, r.usd ORDER BY r.usd DESC; Budget rejection rate by tenant (the canary): SQL SELECT SpanAttributes['cost.tenant_id'] AS tenant, countIf(SpanAttributes['cost.outcome'] = 'run_budget_exceeded') AS run_kills, countIf(SpanAttributes['cost.outcome'] = 'tenant_daily_exceeded') AS tenant_kills, count() AS total_calls FROM otel_traces WHERE Timestamp > now() - INTERVAL 1 DAY GROUP BY tenant HAVING run_kills + tenant_kills > 0 ORDER BY run_kills DESC; The fact that budgets are rejected is what's making them effective, so having any rejection rate is good. A rejection rate greater than ~5% probably means either that your budgets are too tight, or your agents have a bug – either of which would be worth investigating. What This Gives You A clear solution to "what's causing such a high bill?" Your query above will have you sorted within seconds.Per-version A/B testing on cost. Upgrading the 'agent_version' in a prompt version update will enable you to compare cost-per-task between v6 and v7.Separation of failure modes. Provider unavailability, budget exceedance, and loop guard are each represented as their own 'cost.outcome'.Vendor independence. Your infrastructure uses one gateway, through which all providers are accessed, instrumented via OpenTelemetry. What This Doesn't Fix Semantic caching (which reduces costs by reusing near-duplicate prompts). This should live alongside the rest of the instrumentation, although it requires its own design.Prompt-version A/B for quality, which requires evaluations against ground truth outside of the system — it's not enough to attribute cost per prompt.Cross-region fallback. The architecture outlined here has a single point of failure in the Redis budget store. Actual implementations may opt for Redis Cluster or regional brokers for each data plane. The Takeaway What you learn from the real-world incidents that motivate this architecture is that your agents will spend whatever budget they can, as long as there's nothing stopping them, and whatever attribution they like best. Both of these are choices in system architecture, not monitoring capabilities.
"2025 was meant to be the year agents transformed the enterprise, but the hype turned out to be mostly premature. It wasn't a failure of effort. It was a failure of approach." — Kate Jensen, Head of Americas, Anthropic, TechCrunch, February 2026 Jensen's diagnosis is precise, and it matters that she made it in February 2026 — twelve months after the agent deployment wave crested. The teams that struggled in 2025 weren't short on ambition or resources. They were short on a coherent architecture for deciding what to build, what to buy, and how to govern the seam between the two. The consequences of getting that decision wrong are not hypothetical. Gartner projects that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear value, and inadequate risk controls.1 The same firms that generated a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025 are now facing hard questions from CFOs about what any of it is actually delivering. This piece is not a framework for deciding whether to pursue AI agents. That decision is largely made: Gartner separately forecasts that 40% of enterprise applications will incorporate AI agents by the end of 2026, up from less than 5% today. The question is not if, but how to architect the decision intelligently. 47%40%+1,445%of enterprises already run a hybrid build + buy modelof agentic AI projects forecast to be canceled by end 2027surge in multi-agent system inquiries, Q1 2024–Q2 2025Anthropic State of AI Agents, 2026Gartner, Jun 2025Gartner, Dec 2025 The False Binary That's Costing You Time and Money The dominant framing in every vendor deck and analyst report structures the decision as a binary: buy a platform-native agent (Salesforce Agentforce, Microsoft Copilot, ServiceNow AI) or build custom via APIs and open-source orchestration frameworks like LangGraph or AutoGen. Consulting firms have built entire practices around helping enterprises resolve this choice. The Anthropic 2026 State of AI Agents report reveals that framing is empirically obsolete. The plurality of enterprises — 47% — are already combining off-the-shelf agents with custom-built ones. Only 21% rely entirely on pre-built agents; 20% are fully custom via APIs or open-source; the remainder are in various stages of transition. The market has already voted for hybrid. The problem is that almost no one entered that state deliberately. They arrived there by accident: a vendor bought for one use case, a custom build launched for another, and now the two are running in parallel with no shared observability, no governance model for the seam between them, and no principled framework for which future capabilities belong where. The goal of this piece is to give engineering leaders the architecture for converting accidental hybrid into deliberate hybrid — with a layer-by-layer decision framework, a data readiness gate, and a governance model for what happens at the seam. Why Pure-Build Fails at Enterprise Scale Building everything custom is the position most attractive to engineering-led organizations, and the one most likely to produce a Gartner cancellation statistic. The failure mode is not technical capability — it's the three compounding traps that emerge between proof-of-concept and production. Trap 1: The AI Skills Debt Spiral Building and maintaining production-grade AI agents requires a stack of capabilities that most enterprise engineering teams do not have in steady-state: prompt engineers who understand evaluation and regression testing, ML platform engineers who can build and operate inference infrastructure, and reliability engineers with experience in non-deterministic failure modes. The first custom agent typically ships on borrowed talent. The second and third require either significant hiring or the uncomfortable acknowledgment that the build velocity is unsustainable. Trap 2: MLOps Debt Accumulation The pattern that should worry engineering leaders most is the one that arrives silently. A team builds a custom agent that performs well in testing — reliable tool calls, clean outputs, low latency. Three months into production, support tickets start arriving: the agent is hallucinating in edge cases, producing contradictory outputs for similar queries, or failing silently when its context window fills up with tool call responses the team didn't account for in their capacity model. By the time this surfaces, the fix requires rearchitecting the memory management layer. Custom agent infrastructure accretes technical debt faster than traditional software. Organizations that build their own orchestration layer instead of adopting an existing framework often discover — six to twelve months post-launch — that a disproportionate share of their AI engineering capacity is consumed by infrastructure maintenance: model versioning conflicts, context window edge cases, tool call logging gaps, fallback chain failures. None of this work surfaces as features. The engineering time it consumes is real; the competitive advantage it generates is not. Trap 3: The Undifferentiated Infrastructure Trap The most insidious failure mode. Organizations pour engineering effort into building capabilities — document parsing, web browsing, code execution — that are already commoditized in the market. The insight that should govern all custom build decisions: only build what gives you a durable competitive advantage. If your competitors can buy the same capability for $50/month per seat, you should probably buy it too and allocate your engineers to the 10% of your agent architecture that actually encodes your differentiation. Why Pure-Buy Fails at Enterprise Scale The buy-everything position is more defensible in initial economics, and more dangerous in a three-year strategy. The failure modes here are structural, not technical. Failure modeWhy it compounds over timeAgent washing: The 2024–25 market saw dozens of existing SaaS products relabeled as "AI agents" with minimal underlying capability change. Gartner research identified only approximately 130 genuine agentic AI vendors out of thousands claiming the label.6 We consistently see this in enterprise evaluations: vendor products described as "agentic" that, on technical review, execute a fixed multi-step workflow with no planning layer, no dynamic tool selection, and no state persistence across sessions. The product is a chatbot with an API. Purchasing decisions made on the basis of vendor demos frequently encounter a production experience that is closer to a dressed-up chatbot.Vendor roadmap becomes your capability ceiling. When the agent cannot do what your process requires, you either adapt your process to the tool or you build around it — both options erode the initial ROI case.Vendor lock-in at the orchestration layer: Platform-native agents (Salesforce, ServiceNow, Microsoft) deliver high initial velocity within their ecosystem. Cross-system orchestration — the case that generates most enterprise value — requires either expensive integration work or accepting that your agents cannot coordinate across your full stack.As multi-agent architectures become table stakes, organizations locked into a single vendor's orchestration model face rebuild costs that were not in the original business case.Data sensitivity constraints: Many enterprise workflows involve data that cannot traverse a vendor's inference infrastructure due to regulatory requirements (GDPR, HIPAA, SOC 2 commitments) or contractual confidentiality obligations. Pre-built agents that require cloud-side processing create compliance exposure that procurement teams discover after, not before, deployment.The compliance remediation path for a deployed agent that is processing data it shouldn't be is expensive and slow. Prevention requires capability mapping before vendor selection.Customization ceiling: Pre-built agents are optimized for the modal enterprise use case. Organizations with non-standard processes, proprietary data models, or domain-specific reasoning requirements will hit the customization ceiling — the point at which no amount of prompt configuration or workflow configuration can make the agent behave the way the process requires.Discovering the customization ceiling after a 12-month deployment produces the worst possible outcome: switching costs are now embedded, and the build alternative that was available at the start of the project is now a rebuild. The Five-Layer Decision Framework The insight that resolves the build/buy binary is architectural decomposition. An AI agent is not a monolithic thing — it is a stack of five distinct layers, each with its own differentiation economics, and each warranting a separate build/buy analysis. What follows is Kellton's layer-by-layer framework. It emerged from observing how enterprises have adopted previous infrastructure technologies — cloud migration, microservices decomposition, API platformiszation — and applying those adoption patterns to the specific economics of AI agent architecture. In our experience, the recurring failure pattern is not at any single layer but at the junction between layers 2 and 4, where orchestration assumptions collide with domain logic requirements. For each layer, the verdict reflects the economics most enterprises will encounter. Your specific situation — data sensitivity, engineering capacity, competitive context — may shift any individual recommendation. LayerWhat it is & why it mattersDefault verdict1. Foundation modelThe underlying LLM (GPT-4o, Claude, Gemini, Llama). This layer determines reasoning quality, context window, cost per token, and data residency options. Fine-tuning is increasingly rare; prompt engineering and RAG handle most customization needs.Buy via API2. OrchestrationThe framework managing agent execution: task decomposition, tool routing, multi-agent coordination, retry logic, and state management. Options range from LangGraph and AutoGen (open-source) to vendor-native (Salesforce Agentforce runtime). This is where lock-in risk is highest.Hybrid open-source + config3. Tool integrationsThe connectors exposing external systems (CRM, ERP, databases, APIs, web) to the agent. Generic integrations (Salesforce, Jira, Slack) are commoditized. Custom integrations — proprietary internal systems, legacy databases — require build effort proportional to integration complexity.Hybrid buy standard, build custom4. Domain logicThe business rules, decision heuristics, and domain knowledge encoded into the agent's behavior: underwriting criteria, compliance checks, pricing logic, escalation thresholds. This is your differentiation. This is almost always a build — it is the layer competitors cannot replicate by buying the same vendor.Build own your most5. ObservabilityLogging, tracing, evaluation, and monitoring for agent behavior. This includes latency tracking, tool call audits, output quality scoring, and anomaly detection. Mature platforms (LangSmith, Weights & Biases, custom dashboards) exist. Building from scratch here is rarely justified.Buy via platform The Data Readiness Gate Before any build or buy commitment, there is a prior question that most organizations skip: Is your data infrastructure ready to support an AI agent at production quality? The majority of agentic AI project failures that Gartner identifies trace back to this omission — teams discover data problems six months into deployment, not six weeks before launch. The following checklist is not exhaustive, but completing it before committing to architecture will eliminate the most common failure modes. Each item maps to a class of production incidents we have observed in enterprise deployments. Data Readiness Gate — Complete Before Architecture Commitment Data lineage is documented for all agent-accessible sources. The agent must be able to reason about data provenance. Undocumented sources create hallucination risk and compliance exposure. Data classification is complete for all workflows the agent will touch. PII, PHI, and contractually confidential data must be identified before tool integration design — not after vendor selection.Retrieval quality has been benchmarked, not assumed. RAG-based agents are only as good as their retrieval pipeline. In our experience, teams that skip this step and build agent logic on top of an untested retrieval layer discover precision problems in production that require rearchitecting the pipeline under time pressure. Test precision and recall against representative queries before building agent logic on top.Data freshness requirements are mapped to agent decision types. An agent making time-sensitive operational decisions (inventory, pricing, routing) has different freshness requirements than one doing analytical summarization. Mismatches produce silent errors, not loud failures. An evaluation dataset exists for the target use case. You cannot assess agent quality without a ground truth dataset. Building one before deployment is non-optional for production readiness.Data access controls have been reviewed for the agent's identity. Agents act with the permissions of whatever identity they run as. Ensure least-privilege access is enforced — agents should not have broader data access than the task requires. Governing the Hybrid: The Seam Is Where Projects Fail Hybrid AI agent architectures do not fail at the build layer or the buy layer in isolation. They fail at the seam — the interface between custom-built and off-the-shelf components. Governing that seam requires explicit decisions in three areas that most teams leave implicit. Orchestration governanceObservability governanceTeam topologyDefine a single orchestration authorityUnified trace context across all agentsName a seam ownerIn a hybrid architecture, every agent — whether custom-built or vendor-provided — must register with a single orchestration layer. This is typically an open-source framework (LangGraph, AutoGen) or a custom orchestration service. Vendor-native orchestration that cannot be subordinated to this layer is an integration liability. Establish this constraint before vendor evaluation, not after.In a hybrid system, a single user request may traverse both a custom agent and a vendor agent. If these agents emit traces to different observability systems, debugging a production incident requires stitching together logs from two or more platforms — a process that doubles incident response time in our experience. Require all agents to propagate a shared trace context. For vendor agents, this may require wrapping the vendor's API in a thin observability proxy.The team that built the custom agent does not own the vendor agent, and the vendor agent is not owned by anyone internal. This organizational gap is the most reliable predictor of operational incidents that persist. Name an explicit seam owner — typically a platform engineering team — with responsibility for integration health, shared observability, and vendor relationship escalation. Without a named owner, the seam is ungoverned by default. The Decision Scorecard: Three Axes, One Hybrid Zone For each capability you are evaluating — now and as new use cases emerge — score it against three axes. The combination determines whether it belongs in the buy, build, or hybrid zone of your architecture. Capability typeUniquenessData sensitivityRecommendedGeneric workflow automation (email triage, meeting summaries, ticket routing)Low — identical across competitorsLow–MediumBuyStandard system integrations (Salesforce, Jira, ServiceNow connectors)Low — commoditized connectorsMediumBuyCross-system orchestration (multi-agent coordination across owned and vendor agents)Medium — architecture is proprietary, tooling is notMediumHybridProprietary data retrieval and reasoning (internal knowledge, historical records)High — data is unique, retrieval logic is uniqueHighHybridDomain logic and decision rules (underwriting, pricing, compliance, clinical protocols)High — this is your productHighBuildRegulated data workflows (HIPAA, GDPR-sensitive, contractually confidential processing)VariesCritical — cannot leave boundaryBuild The hybrid zone is not a compromise — it is a deliberate architectural position. Capabilities in the hybrid zone typically use open-source or vendor orchestration frameworks but run on infrastructure you control, with domain-specific configuration and prompting that encodes your proprietary knowledge. The vendor provides the chassis; you provide the engine. What 2026 Asks of Engineering Leaders Specifically The strategic question for CTOs and VPs of Engineering is not which agent platform to choose. It is how to build an organizational capability for ongoing hybrid architecture decisions as the agent market continues to evolve — and it will evolve faster in 2026 and 2027 than it did in 2024 and 2025. The practical work of becoming an organization that can execute on hybrid AI agents is not primarily technical. It is architectural and organizational. Engineering leaders who succeed in 2026 and 2027 will be the ones who built two things alongside their agents: a repeatable decision process for where new capabilities belong in the stack, and a platform function that owns the seam — the integration layer between what you built and what you bought — with the same rigor they apply to production infrastructure. The Gartner cancellation wave is not going to claim the enterprises that found the best vendor or built the cleverest custom system. It will claim the ones who accumulated technical and governance debt at the seam, accrued shadow AI spend outside any architectural review, and discovered their vendor's customization ceiling twelve months after the decision was irreversible. You now have the framework to avoid that trajectory. The five layers give you a decision surface. The data readiness gate gives you a pre-commitment discipline. The governance model gives you the seam. What you do with it is an execution problem — and execution problems are solvable. A note on Kellton's AI practice: The framework described in this piece is vendor-neutral by design — the layer decomposition and governance model apply regardless of which stack you are running. For organizations where the orchestration and tool integration layer is a bottleneck, Kellton's AI practice has built production hybrid architectures across financial services, healthcare, and logistics environments. KAI, Kellton's enterprise-grade Agentic AI platform launched in 2025, is designed to accelerate work at the orchestration and integration layers while preserving build flexibility where it matters most. The Deliberate Hybrid Is a Choice, Not a Destination Forty-seven percent of enterprises are already hybrid. The question is whether that state was arrived at through deliberate architectural decisions or through accumulated vendor purchases and one-off custom builds that are now running alongside each other without shared governance. The failure Kate Jensen described at Anthropic was not a failure of technology. It was a failure of approach. The approach, it turns out, is architecture.
When optimizing Spring Boot integration tests, developers often focus on obvious metrics: total build time, test execution time, CPU usage, memory consumption, or the number of failed tests. These metrics are useful, but they do not always explain why an integration test suite is slow. One of the most important hidden metrics in Spring Boot integration testing is the number of distinct ApplicationContext instances created during the test run, check out my other article. Spring’s TestContext framework can cache and reuse ApplicationContext between test classes, but only if the effective test configuration is the same. If the configuration differs, Spring has to create another context. In large enterprise applications, this can become expensive very quickly. How can the number of contexts correctly interpreted?If a test suite creates two contexts, is that good?If it creates six contexts, is that acceptable?If it creates twenty contexts, is that already a design smell?And most importantly: where should such a judgment come from? Spring itself does not define a universal threshold for a “good” or “bad” number of cached ApplicationContext instances. However, the official documentation explicitly points out that a large number of loaded contexts can make a test suite unnecessarily slow. This means the number of contexts is not just an implementation detail. It is a relevant diagnostic signal. This article explains how I derived a practical interpretation table for a real-world Spring Boot integration test suite and why such a table should be understood as a case-study heuristic, not as a universal Spring Framework rule. Test Grouping Is a Valid Concept General testing research supports that tests can be grouped by similarity, cost, coverage, or runtime behavior. This is highly relevant for Spring Boot integration tests. In Spring Boot integration testing, MergedContextConfiguration may be interpreted as one practical grouping dimension: tests with the same effective Spring configuration belong to the same context group. In this case, similarity means shared Spring test configuration. That does not mean all tests should use the same context. It means that tests should not accidentally create different contexts when they are actually testing under the same architectural conditions. Spring’s Context Cache as a Framework-Specific Grouping Mechanism Spring Boot integration tests are not plain unit tests. They often require infrastructure such as dependency injection, database configuration, security configuration, web layer configuration, mock infrastructure, external API clients, messaging components, or tenant-specific setup. Spring’s TestContext framework handles this through the ApplicationContext. The framework can reuse a context if the effective configuration is the same. The cache key is based on configuration parameters such as configuration classes, active profiles, property sources, context customizers, initializers, and other test context settings. Spring’s documentation describes this context caching mechanism and explains that contexts can be reused when the same unique context configuration is encountered again. Let me explain. Two tests may look similar to a developer but still produce different contexts if they use different profiles, properties, mocks, or imported configuration classes. They should normally produce separate context groups. For example, a database-focused test and a test involving an external OData destination may have different infrastructure requirements. In that case, a separate context is not a problem. It reflects a real test configuration group. When every test class introduces a slightly different property, mock, or configuration import without a strong technical reason. Then the number of contexts grows not because the architecture requires it, but because the test suite has configuration drift. Why Multiple Contexts Can Be Legitimate in Enterprise Applications Spring Boot itself supports different testing styles. The documentation describes @SpringBootTest for loading the application context through SpringApplication, and it also provides more focused test annotations for specific slices of an application. Spring Boot’s test slices include annotations such as @WebMvcTest, @DataJpaTest, @JsonTest, and others. These annotations intentionally load only selected parts of the application and import different auto-configurations depending on the target slice. Besides the Spring documentation, many community blogs report that different enterprise systems may have separate integration test groups, such as database-focused tests, web/controller tests, security-related tests, and so on. So, the goal should be to minimize unnecessary context fragmentation while preserving justified test configuration groups, instead of forcing the entire integration test suite into one ApplicationContext. From Test Grouping to a Context-Count Heuristic Based on this reasoning, I used the following interpretation in a case study: 1-3 application contexts show excellent context reuse,4-8 are acceptable if justified,10+ should be investigated, and a signal of a fragmented test configuration. Let's discuss the numbers. 1-3: The most integration tests share the same effective configuration. For example: Plain Text Context 1: default integration test context Context 2: database-specific context Context 3: external-system-specific context Such a structure is usually easy to understand. It suggests that the team has standardized its test profiles, properties, and infrastructure setup. 4-8: This is consistent with broader software-testing research, where test suites are not treated as one homogeneous block. They are often optimized, selected, prioritized, or clustered according to meaningful technical criteria such as coverage, execution cost, change relevance, or runtime behavior. For example: Plain Text Context 1: default SpringBootTest context Context 2: database-heavy context Context 3: external API integration context Context 4: security-specific context Context 5: multi-tenant context Context 6: messaging context Context 7: no-external-destination context Context 8: migration-specific context 10+: Once the number of contexts reaches double digits, investigation becomes worthwhile. This does not automatically mean the test suite is badly designed. Community articles on Spring test optimization show that a very large enterprise platform with many modules, tenant variants, data stores, messaging systems, and external integrations may legitimately require more contexts. So, the number 10+ is not firm, but suggests that the risk of accidental fragmentation becomes higher. Conclusion Test grouping is a recognized concept in software-testing research. Large test suites are often optimized through minimization, selection, prioritization, and clustering. These techniques are based on the idea that tests have different costs, purposes, coverage, runtime behavior, and relevance. For Spring Boot integration tests, context reuse is a framework-specific grouping criterion. (Use the method of test grouping to create Spring application contexts) Tests with the same effective MergedContextConfiguration belong to the same context group and can share the same cached ApplicationContext. Tests with genuinely different infrastructure needs may require different contexts. Therefore, the goal is not to reduce every enterprise test suite to a single context. The goal is to distinguish between justified test configuration groups and accidental configuration fragmentation. The shown numbers are a practical case-study heuristic, and not universal. But the underlying principle is robust: A small number of well-defined context groups is healthy, but a growing number of slightly different contexts is a performance smell. That principle connects Spring’s TestContext cache mechanism with a broader idea from software-testing research: large test suites should be structured intentionally, not allowed to fragment accidentally.
Most AI Agent frameworks treat the model as a black box: you register tools, the model picks one, the tool runs, and the cycle repeats. This pattern is perfect for demos, but for a production system, it requires more complex systems. We need to manage context windows, cache API calls, filter sensitive tools by role, and compact the information history within models to avoid token limits. I landed on middleware while reviewing issues for deepagents and understanding their codebase. This is when I started to wonder what middleware really is in the context of AI agents and its significance. This got me thinking: how do other frameworks handle this problem? So I went ahead and installed Pydantic AI, read the CrewAI source, and checked Langchain and Autogen. This article compares two frameworks that implement middleware as a primitive: Deep Agents (from LangChain) and Pydantic AI, and understands the difference between middleware and callbacks, and explains why this difference matters when running agents at scale. What You Will Learn By the end of this article, you will be able to: Distinguish middleware from tool callbacks and event callbacks, and why this mattersRead working code for deepagents' AgentMiddleware and Pydantic AI's AbstractCapabilityUnderstand the difference between the two frameworks: cross-turn AgentState access, production middleware, and config-driven profiles via HarnessProfile.Understand why frameworks built on callbacks cannot support patterns that middleware enables. What Is Middleware? The term "Middleware" often gets overloaded. In the context of AI agents, it means code that runs before or after every model call, with the ability to read and rewrite the request or response. What Differentiates Middleware From the Rest Middleware is different from: Tool callbacks – fired when the tool is called and not the model.Event callbacks – fire and forget, that can be observed but not changed.Post-processing – wrapping the final output after the agent loop ends. Middleware sits inside the request/response cycle of every LLM call, which gives it unique capabilities. Where the Middleware Sits in the Agent Loop It's the only layer with access to the request before it reaches the model and the response before it reaches the tool executor. CapabilityMiddlewareTool callbackEvent callbackModify system prompt per call✓✗✗Filter tool list dynamically✓✗✗Transform message history✓✗✗Cancel the model call✓✗✗Track state across turns✓Partial✗Observe output✓✓✓ Deep Agents: Middleware as a Composable Hook Installation: Shell pip install deepagents # Requires Python >=3.10 # Docs: https://docs.langchain.com/oss/python/deepagents/overview deepagents ships AgentMiddleware as a base class from langchain.agents.middleware.types. Every middleware subclass can override these key hooks (each has an async variant): Python class AgentMiddleware: def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelCallResult: # Intercept before AND after the model call. Call handler() to execute it. return handler(request) def before_model(self, state: AgentState, runtime: Runtime) -> dict | None: # Runs before the model is called. Can update agent state. return None def after_model( self, state: AgentState, runtime: Runtime ) -> dict | None: # Runs after the model responds. Can inject new messages into state. return None def wrap_tool_call( self, request: ToolCallRequest, handler: Callable[[ToolCallRequest], ToolMessage], ) -> ToolMessage: # Intercept individual tool calls for retry logic, monitoring, or modification. return handler(request) # async def awrap_model_call(...): ... # async versions of each hook also available The key insight: wrap_model_call receives the full request: messages, tools, settings, and can return anything, including a modified request passed to the next middleware in the stack. Multiple middleware compose like nested functions: Request -> Middleware A -> Middleware B -> Model Response <- Middleware A <- Middleware B <- Model Deep Agents middleware composition (innermost = closest to model) Built-In Middleware Deep Agents Ships Deep Agents includes several production-grade middleware out of the box: Python from deepagents.middleware import ( FilesystemMiddleware, # Filesystem read/write tools + permission enforcement MemoryMiddleware, # Injects relevant memories into system prompt each turn SkillsMiddleware, # Injects SKILL.md definitions into system prompt SubAgentMiddleware, # Spawns synchronous subagents as tools AsyncSubAgentMiddleware, # Spawns async background subagents SummarizationMiddleware, # Auto-compacts history when token budget fills SummarizationToolMiddleware,# Exposes compact_conversation as an explicit tool ) Writing a Custom Middleware Here is a practical example: a rate-limiting middleware that counts tool calls per turn and injects a warning into a system message when the agent is being "chatty": Python from langchain.agents.middleware.types import ( AgentMiddleware, ModelRequest, ModelResponse, ModelCallResult ) from langchain_core.messages import SystemMessage from collections.abc import Callable class ToolBudgetMiddleware(AgentMiddleware): """Warn the model when it has used many tools in a single turn.""" def __init__(self, budget: int = 5) -> None: self.budget = budget self._call_count = 0 def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelCallResult: # Count tool messages in the conversation (each = one tool call made) tool_calls_this_turn = sum( 1 for m in request.messages if hasattr(m, "tool_call_id") ) if tool_calls_this_turn >= self.budget: warning = ( f"\n\n[Budget notice: you have called {tool_calls_this_turn} tools " f"this turn. Prefer to synthesize results rather than calling more tools.]" ) system = request.system_message if system: new_content = str(system.content) + warning request = request.override( system_message=SystemMessage(content=new_content) ) return handler(request) You can wire this custom middleware alongside built-ins: Python from deepagents import create_deep_agent from deepagents.middleware import FilesystemMiddleware, SummarizationMiddleware from deepagents.backends import FilesystemBackend backend = FilesystemBackend(root_dir="/workspace") summarizer = SummarizationMiddleware( model="anthropic:claude-haiku-4-5", backend=backend, trigger=("fraction", 0.85), keep=("fraction", 0.10), ) agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ FilesystemMiddleware(backend=backend), summarizer, ToolBudgetMiddleware(budget=5), # custom ], ) Middleware runs in list order: FilesystemMiddleware wraps first, then SummarizationMiddleware, then your custom one. Innermost is the closest to the model. The Profiles API: Middleware Configuration Without Code deepagents v0.5.4 added HarnessProfile which lets you declare middleware changes declaratively — add extra middleware, exclude a few middleware, override tool descriptions without touching create_deep_agent call sites. HarnessProfile merge semantics (additive, model-specific overrides, provider-level): Python from deepagents.profiles import HarnessProfile, register_harness_profile register_harness_profile( "anthropic:claude-haiku-4-5", HarnessProfile( system_prompt_suffix="Be concise. Prefer short answers.", excluded_middleware={SummarizationMiddleware}, # Haiku has small context, skip extra_middleware=[ToolBudgetMiddleware(budget=3)], ), ) # Now any agent using claude-haiku-4-5 automatically gets this profile applied agent = create_deep_agent(model="anthropic:claude-haiku-4-5") You can also load from a YAML file for a config file-driven deployment: YAML # haiku-profile.yaml system_prompt_suffix: "Be concise. Prefer short answers." excluded_middleware: - SummarizationMiddleware Python import yaml from deepagents.profiles import HarnessProfileConfig, register_harness_profile with open("haiku-profile.yaml") as f: register_harness_profile( "anthropic:claude-haiku-4-5", HarnessProfileConfig.from_dict(yaml.safe_load(f)), ) Pydantic AI: Capabilities as the Closest Parallel Installation: Shell pip install pydantic-ai # Docs: https://ai.pydantic.dev Pydantic AI's AbstractCapability is the closest architectural equivalent to LangChain's deepagents middleware. Subclass it from pydantic_ai.capabilities and override any of these lifecycle hooks: Python from pydantic_ai.capabilities import AbstractCapability class MyCapability(AbstractCapability): # Run-level hooks async def before_run(self, ctx, ...): ... # Before run starts async def after_run(self, ctx, *, result): ... # Observe/modify result async def wrap_run(self, ctx, *, handler): ... # Full wrap — intercept + resume async def on_run_error(self, ctx, *, error): ... # Handle run-level errors # Graph-node hooks async def before_node_run(self, ctx, *, node): ... # Before each graph node async def wrap_node_run(self, ctx, *, node, handler): ... async def on_node_run_error(self, ctx, *, node, error): ... # Model-request hooks — intercept the raw LLM call async def before_model_request(self, ctx, request_context): ... # Modify messages/tools async def wrap_model_request(self, ctx, *, request_context, handler): ... async def after_model_request(self, ctx, *, request_context, response): ... async def on_model_request_error(self, ctx, *, request_context, error): ... Note on granularity: Pydantic AI's before_model_request hook receives a ModelRequestContext containing messages, model_settings, and model_request_parameters (which includes the tool list). You can return a modified ModelRequestContext to rewrite what gets sent to the model, which is similar to deepagents' wrap_model_call. The key remaining difference is state persistence: these hooks operate within a single run's context, not across agent turns via a shared graph state. A practical example — wrapping a run to add timing and error context: Python from pydantic_ai import Agent from pydantic_ai.capabilities import AbstractCapability import time class TimingCapability(AbstractCapability): async def wrap_run(self, ctx, *, handler): start = time.monotonic() try: result = await handler() elapsed = time.monotonic() - start print(f"Run completed in {elapsed:.2f}s") return result except Exception as e: elapsed = time.monotonic() - start print(f"Run failed after {elapsed:.2f}s: {e}") raise agent = Agent( "anthropic:claude-sonnet-4-6", capabilities=[TimingCapability()], ) For injecting dynamic content into system prompts, you can use before_model_request to return a modified ModelRequestContext with updated instruction_parts, or use the instructions field and callable system_prompt at agent construction time. Pydantic AI vs. Deep Agents Middleware: The Key Differences DimensiondeepagentsPydantic AIHook classAgentMiddlewareAbstractCapabilityHook granularityPer LLM request, tool call, node, runPer LLM request, node, and runSystem prompt injectionvia ModelRequest in wrap_model_callvia ModelRequestContext in before_model_requestError hooksNo dedicated hookon_run_error, on_node_run_error, on_model_request_errorState persistence across turnsAgentState dict shared with LangGraphPer-run context onlyTool list access & filteringModelRequest.tools in wrap_model_callvia ModelRequestContext.model_request_parametersCross-framework portabilitydeepagents / LangGraph onlyPydantic AI onlyConfig-driven (no code)Yes - HarnessProfile + YAMLNoBuilt-ins included7 production middlewareNone - user-defined The biggest practical difference is that Deep Agent's middleware has access to AgentState (the full LangGraph graph state across turns) through after_modelwhich means middleware can read message history, inject summary nodes, and write back to the state. Pydantic AI capabilities are scoped to a single run's context. This means that there is no shared graph state across agent turns. What Other Frameworks Do Instead LangChain Callbacks (v0.1 Style) Python from langchain_core.callbacks.base import BaseCallbackHandler class MyCallback(BaseCallbackHandler): def on_llm_start(self, serialized, prompts, **kwargs): ... def on_llm_end(self, response, **kwargs): ... You cannot modify or cancel the request, and it is not composable in any way. This is useful for logging, but not useful in request transformation. CrewAI Step Callbacks Python from crewai import Crew def my_step_callback(output): print(f"Step completed: {output}") crew = Crew(agents=[...], tasks=[...], step_callback=my_step_callback) step Callbacks are called after each task step completes. This has no access to the request, and you cannot modify the list of tools or even the system prompt. This has similar limitations to LangChain callbacks. AutoGen v0.4 Message Middleware AutoGen's message-passing model means you can inject agents into the conversation (e.g., a logging proxy agent), but there's no formal pre or post-hook around model calls. The closest equivalent is a UserProxy agent that intercepts messages, but it's a peer agent and not a transparent middleware layer. What the Middleware Gap Can Actually Cost You Token budget. When a particular conversation is approaching the model limit, you would want to summarize old tool outputs before the model call and not after. A callback fires too late to help, and you might run out of tokens or overshoot your token usage.Per user tool filtering. In any given organization, there are different roles for different users and different access permissions. Without middleware, it's hard to filter out tools that certain users cannot run. Consider a scenario where you don't have middleware to filter, and you just call the LLM, which in turn calls the tools, only to find out that the tool call failed because of access permissions. That's wasted resources and tokens, and unnecessary LLM calls, which could be easily avoided.Prompt caching across providers. Anthropic's prompt caching requires cache_control in the request. AnthropicPromptCachingMiddleware rewrites the message and tool definitions of every model call to apply cache breakpoints in the right places. Without middleware, this would have required changes to every call site. Conclusion The middleware gap is why some production agents are trivially simple in Deep Agents and PydanticAI, but not possible in other frameworks. Summarizing message history before the model call, filtering tools based on roles, and injecting cache-control blocks in the right position are all possible with middleware, not with a callback that fires after it completes. For teams choosing a framework today: if you need to transform what the model sees on every call rather than just observe it, the choice narrows to Deep Agents or Pydantic AI. If you want that transformation to reference or rewrite history spanning multiple turns, deepagents with LangGraph is the only framework that supports this today. Middleware is not the most visible feature of an agent framework, but it is a primitive that sets the ceiling for everything else.
It has been one of those weeks where the diff is bigger than the headline. The headline is short — Codename One now ships modern native themes: an iOS "liquid glass" look and an Android Material 3 look, bundled into the iOS and Android ports, on by default in the Playground, and selectable from a brand new menu in the simulator. The diff behind that headline is several thousand lines across the platform ports, the simulator, the GUI plumbing, and a small army of screenshot tests. What is Codename One? Codename One is an open-source framework for building native iOS, Android, desktop, and web apps from a single Java or Kotlin codebase. Learn more at codenameone.com. The theme behind the work is simple: Codename One should look modern out of the box on every platform we ship to, and it should feel fast. Almost everything in the past week of commits is in service of one of those two goals. Try It Right Now in the Playground The easiest way to see any of this is the Playground. The Playground now defaults to iOS Modern when the device toggle is set to iPhone and Android Material 3 when it is set to Android, in both light and dark mode. No setup, no pom.xml, no build hints — just open the page, drop in any of the standard components, and the modern look is what you get. If the past releases of Codename One looked dated to you, the Playground is where to start. The simulator is the second-easiest place. We will get to that. The New Native Themes For most of Codename One's life, the iOS native theme has been the venerable iOS 7 flat theme, and the Android native theme has been Holo Light. Both still ship — backward compatibility has always been one of our most important goals — but they are no longer where we want a brand new app to start. We spent the bulk of this week building two new themes that target current platform aesthetics: iOS Modern – Apple system colors (accent #007aff light / #0a84ff dark, grouped-form surfaces, the system separator palette), pill borders for tabs, an iOS-Settings-style MultiButton, CHECK_CIRCLE-style checkbox glyphs, and translucent surfaces for Dialog and TabsContainer so they read as glass-frosted on top of whatever is behind them. It is not a real UIVisualEffectView backdrop — that is a port-side primitive we have not built yet — but the look is much closer to the iOS 26 vibe than anything we have shipped before.Android Material 3 – the Material 3 baseline tonal palette (primary #6750a4 light / #d0bcff dark, surface-container tiers, elevated containers approximated tonally because real elevation drop-shadows are still on the to-do list), plus all the Material density and padding choices — Roboto-ish proportions, a top-tab bar with the underline-by-color treatment, the standard square checkbox glyph. Each theme covers the usual ~25 UIIDs: base (Component, Form, ContentPane, Container), typography (Label, SecondaryLabel, TertiaryLabel, SpanLabel*), buttons (Button, RaisedButton, FlatButton with .pressed and .disabled), text input, selection controls, toolbar, tabs, side menu, list, MultiButton, dialog/sheet, FAB, and all the supporting separator and popup pieces. Both themes have full light and dark coverage. The shipping CSS sources sit in the repo at native-themes/ios-modern/theme.css and native-themes/android-material/theme.css for anyone who wants to read what each UIID is doing. iOS Modern This is the ShowcaseTheme capture from the new screenshot suite, run on iOS in light and dark. Same Form, same components, swap Display.setDarkMode(...) and re-resolve. The form is built like this: Java Container row = new Container(BoxLayout.x()); row.add(new Button("Default")); Button raised = new Button("Raised"); raised.setUIID("RaisedButton"); row.add(raised); form.add(row); TextField tf = new TextField("[email protected]"); form.add(tf); Container toggles = new Container(BoxLayout.x()); CheckBox cb = new CheckBox("Remember me"); cb.setSelected(true); toggles.add(cb); RadioButton rb = new RadioButton("Agree"); rb.setSelected(true); toggles.add(rb); form.add(toggles); SpanLabel body = new SpanLabel("Body copy …"); That gives you the full picture on one screen: The Default button uses the stock Button UIID. The Raised button uses RaisedButton, which cn1-derives from Button and adds a tinted pill on top of the iOS system blue — that is the iOS Modern accent in both modes.The TextField is a single rounded-rect surface with the iOS system gray fill, the same shape Apple uses in Settings.CheckBox and RadioButton use the new optional @checkBoxCheckedIconInt / @radioCheckedIconInt theme constants to swap to CHECK_CIRCLE / CHECK_CIRCLE_OUTLINE glyphs — Reminders-app aesthetic on iOS, while Android keeps the standard square check.The SpanLabel body uses the theme's base font and inherits transparent backgrounds, so it never paints over a translucent parent. The full-screen source is DarkLightShowcaseThemeScreenshotTest.java. Android Material 3 Same ShowcaseTheme source on Android. The Material 3 baseline palette gives Default the primary container color and Raised the elevated-surface tone, with the dark variant flipping the relationship correctly via the dark color-role mapping. Padding and font sizing follow Material density, which you can see in how compact the same Form lays out compared to iOS. Translucent Surfaces This is the DialogTheme capture against the screenshot suite's textured diagonal-stripe backdrop. The backdrop is intentional — it lets reviewers see whether anything that is supposed to be translucent actually is. The iOS Modern Dialog uses an rgba surface fill (0.78 alpha in light, 0.95 in dark — dark needs more opacity because bright stripes bleed through) and its DialogBody, DialogTitle, ContentPane, CommandArea sub-UIIDs are transparent, so the rounded corners read cleanly. The same trick is applied to TabsContainer and the iOS MultiButton. Runtime Palette Overrides The native theme is meant to be a starting point — you can layer your own palette on top without forking the theme. Above is the PaletteOverrideTheme capture: the base is iOS Modern, but the test layers a magenta palette on top at runtime via UIManager.addThemeProps(...). RaisedButton, FlatButton, the disabled tone, and the body-copy span all pick up the override in both light and dark — the override seam works at the resource-bundle layer, exactly the same mechanism a user theme uses to override the native theme on a real app. In the Simulator Three pieces, all live: Themes are bundled. The simulator jar-with-dependencies includes both modern themes alongside the four legacy themes (iPhoneTheme, iOS7Theme, androidTheme, android_holo_light) at the root of the jar. The simulator can pick any one of them at runtime without touching the skin repo.A new "Native Theme" menu. Right next to the Skins menu, there is now a Native Theme menu with a radio group for the six themes, plus "Auto" and "Use skin's embedded theme". Selecting one writes the simulatorNativeTheme Preference, flips the simulator-reload flag, and disposes the current window so the skin reloader kicks in with the new theme. You can sit on a single skin and flip through every native theme in seconds.Build hints know about it. The new nativeTheme, ios.themeMode, and and.themeMode build hints are registered with the simulator's Build Hints UI on launch — labels, types, value lists, descriptions, the lot. (The legacy keys cn1.nativeTheme and cn1.androidTheme are still honored for back-compat.) Set them in the Build Hints dialog, in codenameone_settings.properties, or via -D system properties; they flow through to the device build and the simulator, both. The "Auto" choice in the Native Theme menu defers to those build hints — set ios.themeMode=modern in your project's settings and "Auto" previews iOS Modern; flip the same project to ios.themeMode=ios7 and "Auto" previews iOS 7. The explicit menu entries (iOS Modern, iOS 7, etc.) override the hints regardless. -Dcn1.forceSimulatorTheme is still honored as the highest-priority override; pick "Use skin's embedded theme" to bypass the framework theme entirely and get whatever the skin shipped with. On Devices The opt-in is the same on iOS and Android. The platform knobs follow a single naming pattern — ios.themeMode and and.themeMode — and accept modern / liquid / auto / ios7 / flat on iOS, modern / material / auto / hololight / legacy on Android. There is a single cross-platform shortcut, nativeTheme=modern, which the iOS builder consults when ios.themeMode is unset and which the Android port reads at runtime as a default for and.themeMode. The legacy aliases cn1.androidTheme and cn1.nativeTheme are still honored for back-compat, as is and.hololight=true. The default for an existing app stays on legacy on every platform. We do not flip a 15-year-old app's look without an opt-in. New apps generated from the initializr ship with nativeTheme=modern, ios.themeMode=modern, and and.themeMode=modern already set in codenameone_settings.properties, so a brand new project starts with the modern themes preselected. The Playground does the same, and Playground project downloads carry the same defaults into the generated codenameone_settings.properties. The HTML5 port has the runtime support for the modern themes, but does not bundle them with user apps yet — that is one of the loose ends we want to close in the next round. Sticky Headers The other piece of look-and-feel that we want to highlight is StickyHeaderContainer, which finally has a proper home in the framework. It is the iOS-contacts-list / sectioned-material-list component: scroll past a section boundary, and the previous header is replaced by the next one. New this week, the swap is animated. A directional slide moves the outgoing header up on a forward scroll and down on a reverse scroll, or you can pick a cross-fade. Above is a six-frame sweep from the screenshot test — the user scrolls through sections A, B, C, D, E, and the pinned header recolors to whichever section is currently active at the top of the viewport. The API is small. Build the container, register sections with addSection(header, content), configure the transition style and duration, and add it to a Form: Java StickyHeaderContainer sticky = new StickyHeaderContainer(); sticky.setTransitionStyle(StickyHeaderContainer.TRANSITION_SLIDE); sticky.setTransitionDurationMillis(250); for (char c = 'A'; c <= 'Z'; c++) { Label header = new Label("" + c, "StickyHeader"); Container items = new Container(BoxLayout.y()); for (int i = 0; i < 5; i++) { items.add(new Label(c + " entry " + i)); } sticky.addSection(header, items); } TRANSITION_SLIDE is the default. TRANSITION_FADE cross-fades the outgoing header on top of the incoming one. TRANSITION_NONE keeps the prior instantaneous swap if you want it. Issue #4807 for the original request. How We Test This Every screenshot in this post is captured by a test that runs the app on a real iOS device, an Android emulator, and headless Chrome, then diffs each capture against a stored golden image. The diff is the test — if the rendered pixels drift, the run fails. For animations, the test grabs a series of frames over a fixed-duration transition, then composites them into a single index image. That is how the dual-appearance shots end up as one side-by-side picture per test: … and how the sticky-header animation ends up as a six-frame strip stitched into a GIF: If you want to read the source, the suite lives at scripts/hellocodenameone/common/src/main/java/com/codenameone/examples/hellocodenameone/tests/. Bugs and Misc Features From This Week The theme work was the loudest thing this week, but plenty of other commits landed alongside it: SIMD large-allocation fallback. The SIMD path on iOS allocates its working buffers on the stack via alloca for speed. Past a certain buffer size, the stack allocation simply fails — there is not enough stack to give, and the request crashes the process. The fix detects that case and falls back to a regular heap allocation when the request is too large to live on the stack. Small SIMD ops keep the fast alloca path; large ones no longer crash.Pluggable AnimationTime clock. Motion, Timeline, MorphAnimation, Image.animate, and Label tickers now all route through a new AnimationTime class that defaults to System.currentTimeMillis() but can be overridden. Tests can drive animations deterministically frame by frame; demos can run in slow motion or fast forward; Motion.slowMotion is no longer the only lever.POSIX character classes for non-ASCII letters. [[:alpha:]], [[:alnum:]], [[:lower:]], and [[:upper:]] silently failed to match anything outside the basic ASCII range — Greek, Cyrillic, CJK ideographs, accented letters, vulgar fractions, currency symbols. They now match the way you would expect, with five regression tests covering the failing cases from the issue.Fail-fast on JDK < 11. The simulator and "Run as desktop app" goals fork the JVM with --add-exports=java.desktop/com.apple.eawt=ALL-UNNAMED, which JDK 8 rejects with the unhelpful "Could not create the Java Virtual Machine". Now the Maven plugin checks the runtime JDK version on entry to cn1:run and cn1:debug and aborts with a friendly message naming the detected version, JAVA_HOME, and a pointer to Adoptium. JDK 11 through 25 is the supported runtime range for the simulator, JDK 8 stays the build-time requirement for the core framework, and JDK 8 is still fully supported at runtime for shipped desktop apps — only the simulator / "Run as desktop app" Maven goals require JDK 11+.Sheet scrolling, swipe, and animation. Sheet finally drags from the bottom with a real animation instead of snapping in. Issue #4825.Picker positioning. Picker got additional button-positioning options and a small batch of coverage tests.Playground polish. The Playground moved every Dialog.show(...) to InteractionDialog mode so user code calling Dialog.show does not blow away the editor chrome — it renders into the layered pane instead. Error messages got a substantial overhaul. The preview-resolution syntax expanded so the Playground can pick previews from a much wider set of expressions, with a new harness keeping it honest in CI.Deeper refreshTheme(). Form.refreshTheme() has been around forever — it re-resolves the styles on a single Form. The new thing this week is UIManager.getInstance().refreshTheme(), which snapshots the current theme props and theme constants, clears the resolved-style caches, and re-applies the lot. This is what lets the screenshot suite flip dark mode mid-suite and see fresh styles, and what lets a runtime palette override take effect immediately. Most apps will never need to call it directly — palettes typically don't change at runtime, and a Display.setDarkMode(...) call already triggers the right invalidation. It is there if you do change the palette and want the change to stick on the next paint without reloading the theme from disk. Where This Is Going — and a Thank-You Last week's post was about Codename One feeling faster: corrected pixel densities, principled scroll physics, SIMD on iOS, and accessibility text scaling. This week is the symbiotic other half — Codename One, looking like it belongs on a 2026 phone. Both halves are the same project. There is not much point in shipping a SIMD-accelerated Base64 if the surrounding UI looks like a 2014 app, and there is not much point in shipping a glass-frosted Dialog if the scroll underneath it judders. Neither half is finished. They are both ongoing, and they both depend on community help — bug reports, RFEs, the patient back-and-forth on issue threads where somebody describes a layout problem on an iPhone you do not own. A specific thank you to the people who drove the issues that turned into this week's commits: Thomas (@ThomasH99) filed #4781 (the original "build a liquid glass example" RFE that started this whole effort), #4807 (sticky headers), #4838 (sideways tab swipe), #4841 (the POSIX regex fix), #4819 (picker buttons), and several others; Francesco Galgani (@jsfan3) filed #4825 (sheet swipe animation) and #4824 (light + dark theme by default in initializr); @ddyer0 caught #4811 (the EDT stack overflow) and #4767 (iPad restart Form size); Lucca Biagi (@LuccaPrado) filed #4817 (form creation in IntelliJ). Several of those are RFEs you would not file unless you actually use the framework day-to-day, and that is the kind of feedback that turns into shippable work. We are sitting at 496 open issues as of this post. That is slow but steady progress — the number is moving in the right direction week over week, and the issues that close tend to ship as features or fixes you can see, not as silent triage. If you have a problem, file it. If you have an RFE, file that too. The themes you saw above started as an RFE. You can try the new themes today by opening the Playground by setting nativeTheme=modern (or ios.themeMode=modern / and.themeMode=modern for finer control) in your project's codenameone_settings.properties, or by picking them from the simulator's new Native Theme menu. New projects from the initializr already have them on. The shipping resources are bundled in the iOS and Android ports as of this week.
Threat intelligence becomes operationally valuable when indicator data can be collected continuously, normalized into a consistent schema, and queried fast enough to support enrichment and detection workflows. Standardized exchange formats such as STIX and transport protocols such as TAXII exist specifically to make machine-readable cyber threat intelligence easier to distribute at scale, while preserving enough structure for downstream correlation and context. Operational Requirements That Shape Intelligence Pipelines A threat intelligence pipeline is best treated as data engineering with security-specific constraints: provenance must remain intact, updates and revocations must be representable, and “freshness” should be measurable rather than assumed. STIX is explicitly designed to model cyber threat intelligence using typed objects with attributes, and it supports building richer context by linking objects through relationships rather than shipping flat indicator lists. A practical pipeline design often separates raw ingestion from normalized storage. Raw ingestion preserves the original feed payload for auditability and reversibility, while normalized storage produces documents that are easy to match against telemetry. This split aligns with STIX’s modeling approach, where producers may publish Indicators expressed as STIX patterns and connect them to other objects through relationship constructs, enabling consumers to choose between lightweight atom extraction for matching and graph-style context for analysis. Pulling From TAXII and Other APIs Without Losing Provenance TAXII 2.1, published by OASIS Open, defines a RESTful API and related requirements for TAXII clients and servers to exchange cyber threat information in a scalable manner, with STIX 2.1 support described as mandatory to implement in the TAXII context. The IANA media type registration for application/taxii+json also documents that the older application/vnd.oasis.taxii+json name is a deprecated alias, which matters in real integrations because content negotiation and strict header validation vary by server implementation. TAXII 2.1 also formalized mechanics that directly affect pipeline correctness under load. The CTI documentation notes that TAXII 2.1 added limit and next URL parameters and updated content negotiation and media types, reflecting a move toward pagination patterns that can handle large or rapidly changing datasets more safely than item-based offset pagination. A Python pipeline can either implement paging logic directly or delegate it to a client library; the taxii2client project documents a TAXII 2.1 client API that uses application/taxii+json;version=2.1 for Accept handling and provides an as_pages helper for TAXII 2.1 endpoints that support pagination, including “Get Objects” and “Get Manifest.” Python def iter_taxii_objects(collection, cursor, page_size=2000): accept = "application/taxii+json;version=2.1" for page in as_pages(collection.get_objects, per_request=page_size, added_after=cursor, accept=accept): envelope = page if isinstance(page, dict) else page.json() for obj in envelope.get("objects", []): yield obj This pattern avoids embedding server-specific pagination tokens into pipeline logic while still enabling incremental collection reads. The cursor argument can be persisted as an ISO-8601 timestamp when the upstream provides a timestamp filter, a model commonly used by TAXII-feed vendors; for example, ESET documents STIX 2.1 feeds delivered via TAXII 2.1 collections and describes an added_after filter parameter for retrieving objects added after a specified timestamp, alongside retention constraints that make incremental pulls operationally necessary. Not all threat intelligence sources are TAXII-first. MISP Project documentation describes a REST-accessible STIX export capability and explicitly notes that STIX XML export can be slow and lead to timeouts with large events or collections, while STIX JSON avoids that regime, making JSON a more stable transport choice for high-volume automation. The same ecosystem provides a published OpenAPI specification and a dedicated converter library, misp-stix, which supports bidirectional conversion across STIX versions, including STIX 2.1, and includes features such as pattern parsing and indicator-observable fingerprinting, reducing the cost of maintaining bespoke mapping logic for every upstream source. Normalization Into ECS and STIX-Aware Semantics Normalization is where a pipeline either becomes queryable or becomes another archive. The Elastic Common Schema (ECS) threat field guidance explicitly frames threat.* as the mapping layer that normalizes threat intelligence indicators from many structures into consistent fields, and it links that normalization to detection and enrichment workflows such as indicator match rules. In particular, the guidance calls out normalizing indicators into threat.indicator.* so that disparate feeds can be queried consistently and used to build indicator matching logic without treating every provider as a special case. Atomic indicators benefit from being stored both as “typed value” and as vendor identifiers. ECS defines threat.indicator.type values aligned with cyber observable types and documents threat.indicator.id as a place to store indicator IDs, noting that a STIX 2.x indicator ID is a common approach and that the field can hold multiple values to represent the same indicator across systems. The practical implication is that a pipeline can preserve the upstream STIX identifier, attach a stable provider-local identifier when necessary, and still normalize the matchable indicator value into fields such as threat.indicator.ip or other threat.indicator.* subfields. Python def stix_confidence_to_nlmh(value): if value is None: return "Not Specified" v = int(value) if v == 0: return "None" if 1 <= v <= 29: return "Low" if 30 <= v <= 69: return "Medium" if 70 <= v <= 100: return "High" return "Not Specified" def extract_atomic_from_pattern(pattern): p = (pattern or "").strip() if "ipv4-addr:value" in p and "'" in p: return ("ipv4-addr", p.split("'")[1]) if "domain-name:value" in p and "'" in p: return ("domain-name", p.split("'")[1]) if "url:value" in p and "'" in p: return ("url", p.split("'")[1]) return (None, None) def stix_indicator_to_ecs(indicator_obj, provider, fetched_at_iso): itype, ivalue = extract_atomic_from_pattern(indicator_obj.get("pattern")) if not itype: return None doc = { "@timestamp": fetched_at_iso, "event": {"kind": "enrichment", "category": ["threat"], "type": ["indicator"]}, "threat": { "indicator": { "type": itype, "provider": provider, "name": indicator_obj.get("name") or ivalue, "description": indicator_obj.get("description"), "confidence": stix_confidence_to_nlmh(indicator_obj.get("confidence")), "reference": indicator_obj.get("id"), "id": [indicator_obj.get("id")], } }, "labels": {"feed": provider}, } if itype in {"ipv4-addr", "ipv6-addr"}: doc["threat"]["indicator"]["ip"] = ivalue return doc The extraction logic deliberately scopes itself to common “atomic” patterns to keep parsing deterministic and to minimize the risk of silently incorrect field derivation. This constraint matches the operational intent of ECS indicator guidance, which emphasizes consistent querying and reuse for indicator match rules after normalization, rather than attempting to fully interpret every possible composite STIX pattern in real time. Indexing Strategy in Elasticsearch That Avoids Accidental Cost Explosion Elasticsearch storage decisions are not purely operational preferences because they alter what update patterns are safe. Data streams consist of one or more hidden backing indices and require a matching index template; every document indexed into a data stream must include an @timestamp field mapped as a date-type (or date_nanos). Data streams are described as a good fit for most time-series use cases, while the documentation explicitly flags that frequent reuse of the same _id expecting last-write-wins can indicate a better fit for an index alias with a write index rather than a data stream. Threat intelligence pipelines often straddle that boundary: indicator state changes and revocations benefit from upsert semantics, while ingestion audits benefit from append-only history. Retention should be tied to query strategy. Elastic Security documentation warns that indicator match rules can consume significant resources and recommends limiting the indicator index query time range to the minimum necessary for coverage, with a default example query of the past 30 days. Even outside an alerting engine, a time-bounded indicator set tends to be operationally safer: it reduces scan cost, makes cache behavior more predictable, and avoids matching against long-expired infrastructure that is no longer relevant. When vendor retention is narrower, such as the 14-day retention window described for some TAXII feeds, the pipeline should persist that constraint as a policy and avoid relying on “full historical replay” as a recovery mechanism. Ingestion-Time Guardrails With Python, Ingest Pipelines, and Bulk Writes Ingest pipelines provide an explicit place to enforce normalization rules at ingest time. Elastic documentation describes ingest pipelines as a sequence of processors that run sequentially to transform data before it is indexed into a data stream or index, supporting operations such as removal, extraction, and enrichment. In addition, ingest processors can access ingest metadata under the _ingest key, and Elasticsearch notes that pipelines create _ingest.timestamp by default and that indexing ingest metadata requires explicitly setting it via a processor. JSON PUT /_ingest/pipeline/ti_normalize { "description": "Normalize threat intel indicators into ECS threat.indicator.*", "processors": [ { "set": { "field": "event.kind", "value": "enrichment" } }, { "set": { "field": "event.category", "value": ["threat"] } }, { "set": { "field": "event.type", "value": ["indicator"] } }, { "set": { "field": "event.ingested", "value": "{{{_ingest.timestamp}}" } }, { "fingerprint": { "fields": ["threat.indicator.provider", "threat.indicator.type", "threat.indicator.ip"], "target_field": "threat.indicator.fingerprint", "method": "SHA-256", "ignore_missing": true } } ] } Bulk ingestion should align with Elasticsearch’s wire format rules. The bulk API documentation describes NDJSON requirements, including that the final line must end with a newline character and that JSON actions and sources should not be pretty printed because newlines are literal delimiters. A Python producer can serialize documents into bulk batches, assign a deterministic _id derived from provider and atomic indicator value to make writes idempotent, and optionally route documents through the normalization pipeline configured above. Python def build_indicator_id(provider, itype, ivalue): return (provider + ":" + itype + ":" + ivalue).lower() def bulk_index_indicators(es_http, index_name, docs): lines = [] for d in docs: ti = d.get("threat", {}).get("indicator", {}) doc_id = build_indicator_id(ti.get("provider", "unknown"), ti.get("type", "unknown"), ti.get("ip", ti.get("name", "unknown"))) lines.append(encode_json({"index": {"_index": index_name, "_id": doc_id, "pipeline": "ti_normalize"})) lines.append(encode_json(d)) payload = "\n".join(lines) + "\n" return es_http.post("/_bulk", body=payload, headers={"Content-Type": "application/x-ndjson"}) The NDJSON newline termination is not optional, so building the payload in a way that always emits a trailing newline avoids a class of partial-ingest failures that are hard to diagnose under load. For enrichment use cases, ingest-time join behavior should be applied cautiously: Elastic warns that the enrich processor can impact ingest speed, recommends benchmarking, and explicitly states that it is not recommended for appending real-time data, instead working best with reference data that does not change frequently. This guidance aligns with threat intelligence practice: fast-changing indicators typically work better as a queried dataset, joined at search or detection time, rather than as an ingest-time enrichment applied to every event. Conclusion A threat intelligence pipeline built on Python, APIs, and Elasticsearch becomes reliable when it treats schemas, media types, and update semantics as core engineering concerns instead of integration details. STIX and TAXII provide standard object modeling and transport expectations, including content negotiation and pagination mechanics, while ECS provides a target schema that makes indicators consistently queryable and directly usable by matching workflows such as indicator match rules. High-quality implementations preserve provenance, normalize into threat.indicator.* with STIX-aligned confidence semantics, choose an indexing strategy that matches expected update patterns, and enforce ingestion guardrails through ingest pipelines, simulation, and NDJSON-correct bulk writes.
If you've ever inherited a Spark job that runs in 35 minutes and someone asks you to make it faster, you know the routine. You start by checking partition counts, then file sizes, then shuffle stages, then broadcast hints. You find a handwritten OPTIMIZE schedule from 2022, a Z-ORDER on the wrong column, and a cluster sized for last year's data volume. By the time you've made the job fast, you've absorbed three new things to maintain. The next person to inherit it will absorb four. This pattern — call it the hand-tuning treadmill — is what the declarative optimization story on Databricks is trying to break. It's not a single feature; it's a cluster of capabilities that collectively let teams describe what a table should look like and let the engine handle the physical optimizations. What follows is the practical view of those patterns: where they fit, what they replace, and how to migrate without a rewrite weekend. 1. The Hand-Tuning Treadmill: Why Imperative Optimization Doesn't Scale Before getting into the declarative side, it's worth being concrete about what "imperative Spark optimization" actually means in production. The shape is consistent across teams I've audited: Layout decisions frozen on day one. Somebody picks a partition column when the table is created. The data shape changes a year later. Nobody re-partitions because the migration is scary. Query plans drift toward full scans.Maintenance jobs that nobody owns. An OPTIMIZE / Z-ORDER / VACUUM script lives in a notebook scheduled at 3 AM. It runs on a cluster that's slightly mis-sized. When data volume grows, the job runs into the morning workload, and people complain about latency.Cluster sizing as a guess. Worker count is a heuristic from a senior engineer's memory of last year's spike. Half the time it's too big, half the time it's too small, and the cost discussion gets emotional.Hint-driven plans. Broadcast hints, repartition hints, coalesce (N) — sprinkled through pipelines to fix yesterday's problem, kept indefinitely because removing them feels risky. None of these are bugs. They're symptoms of the imperative model: the team owns the layout, the maintenance, the sizing, and the plan tuning. In small pipelines, ownership is fine. At scale, it becomes the bottleneck that the team can't outsource. 2. What "Declarative" Means in the Spark Optimization Context Declarative is a word that gets used in two different ways here, and it's worth pulling them apart. Within Lakeflow pipelines (formerly DLT), it means "describe the tables, not the steps" — the engine builds the DAG and runs it. But in the broader optimization story, declarative also means "describe the desired property of the table or workload, not the operations to maintain it": Layout: I want this table clustered by these columns; figure out when and how to re-cluster.Maintenance: I want this table optimized and vacuumed; figure out the schedule.Ingestion: I want all new files in this path picked up exactly once; figure out checkpointing and listing.Quality: These rows must satisfy these expectations; enforce them and report what gets dropped.Compute: I want this query fast and not wasteful; size and scale appropriately. Each one of those bullets corresponds to a piece of the declarative stack. Used together, they replace a remarkable amount of the boilerplate that has historically lived in Spark pipelines. The mental shift: You stop writing operations against the table and start writing properties of the table. The engine becomes the actor; you become the editor. 3. The Declarative Optimization Stack on Databricks The chart below maps each thing the team declares to the engine capability that handles it, ending at the physical Delta table. It's the picture I draw on whiteboards when teams ask, "What's the order to adopt these in?" Figure 1. The declarative optimization stack: each user-facing intent at the top maps to a continuous engine behavior, which keeps the underlying Delta tables well-clustered, compacted, and statistically up-to-date — without human intervention. Two things are worth highlighting in this picture. First, every box in the engine row is something that runs continuously, not on a cron — there is no daily "optimization window" anymore. Second, the bottom layer is identical to what you'd get from any well-tuned imperative pipeline: 256 MB Parquet files with current statistics. The declarative path doesn't change what good looks like; it changes who does the work to keep things looking good. 4. Layout: Liquid Clustering Replaces Hand-Maintained Z-ORDER Liquid Clustering is the change with the largest practical impact, because partition-key choices are where most lakehouse pipelines accumulate the most technical debt. The declarative version: you specify the columns the data is most often filtered or joined by, and the engine maintains a layout that supports those access patterns — incrementally, as new data arrives, without a full rewrite. When access patterns change, you change the cluster columns, and the engine re-clusters in the background. Defining Liquid-Clustered Tables SQL -- New table, clustered by the columns most commonly filtered on. -- No more PARTITIONED BY, no more guessing at partition cardinality. CREATE TABLE prod.gold.daily_totals ( account_id STRING, region STRING, ingest_date DATE, daily_total DECIMAL(18,2), txn_count BIGINT ) USING DELTA CLUSTER BY (region, ingest_date, account_id); -- Even better: let the engine pick the clustering columns by -- observing real query patterns over time. CREATE TABLE prod.gold.events_clustered USING DELTA CLUSTER BY AUTO AS SELECT * FROM prod.silver.events; Migrating an Existing Partitioned/Z-ORDER Table SQL -- Convert a legacy partitioned table to liquid clustering. -- Existing data files are not rewritten immediately; the engine -- rebalances incrementally on subsequent writes + maintenance. ALTER TABLE prod.silver.transactions CLUSTER BY (account_id, ingest_date); -- Force the first clustering pass for a freshly converted table OPTIMIZE prod.silver.transactions FULL; Why this matters: the recurring 2 AM Slack thread of "can we re-partition this table?" goes away. Layout becomes a property you change with one DDL statement, not a multi-week rewrite project. 5. Maintenance: Predictive Optimization Replaces Cron-Driven OPTIMIZE/VACUUM Predictive optimization is the part that retired the most legacy code in the pipelines I've migrated. Once enabled at the catalog or schema level, the engine monitors each table's read and write patterns and decides on its own when to compact files, re-cluster, vacuum, and refresh statistics. The big win isn't the operations themselves — the imperative pipeline could already run those — it's that the timing is observed-driven, not schedule-driven. Tables that get heavy ingestion get more frequent maintenance. Cold tables get left alone. SQL -- Turn it on at the catalog level once; new tables inherit. ALTER CATALOG prod SET PREDICTIVE OPTIMIZATION = ENABLED; -- Or at the schema level for a phased rollout ALTER SCHEMA prod.gold SET PREDICTIVE OPTIMIZATION = ENABLED; -- Inspect what the engine has been doing on a given table SELECT operation, operation_metrics.numFilesAdded AS files_added, operation_metrics.numFilesRemoved AS files_removed, operation_metrics.numOutputBytes AS output_bytes, timestamp FROM (DESCRIBE HISTORY prod.gold.daily_totals) WHERE userMetadata IS NULL -- engine-driven, not user AND operation IN ('OPTIMIZE', 'VACUUM') AND timestamp >= current_timestamp() - INTERVAL 7 DAYS ORDER BY timestamp DESC; What you should delete after enabling this: the nightly notebook that runs OPTIMIZE on every table in a schema, the VACUUM cron job, the ANALYZE TABLE wrapper, and the alerting that wakes someone up when those jobs run long. None of them are needed anymore, and leaving them on creates duplicate work that the engine and the cron will fight over. 6. Ingestion: Auto Loader Replaces Listing-Based File Detection Auto Loader is the declarative answer to the perennial "which files have we processed already?" problem. Instead of listing a directory, comparing it to a state file, and figuring out the new bits, you describe the source location and the format and let the engine maintain its own incremental state. It uses cloud-native event notifications (S3 events, ADLS notifications, or efficient directory listing as a fallback), and the checkpoint is just another piece of state the engine owns. Python from pyspark.sql.functions import current_timestamp # Streaming ingest from S3 with schema inference + evolution. # Replaces hand-maintained checkpointing, listing logic, and # whatever file-tracking table the team built two years ago. (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaLocation", "s3://acme-checkpoints/txns_schema") .option("cloudFiles.schemaEvolutionMode", "addNewColumns") .load("s3://landing/txns/") .withColumn("_ingest_ts", current_timestamp()) .writeStream .format("delta") .option("checkpointLocation", "s3://acme-checkpoints/txns_writer") .trigger(availableNow=True) # batch-style; runs to completion .toTable("prod.bronze.txns")) Two notes from production. First, schemaEvolutionMode is the option that prevents the silent-data-loss class of bugs when partner schemas change; pick the policy explicitly rather than letting it default. Second, trigger(availableNow=True) gives you batch ergonomics on a streaming source — the job runs until it has consumed everything and exits, which is what most teams actually want for daily ingestion. 7. Transforms and Quality: Declarative Pipelines Replace Bare Spark + External DQ The final piece is the transformation layer. Lakeflow pipelines (the rebrand of Delta Live Tables) let you declare each table as a Python or SQL definition, and add expectations as a first-class concept. The engine derives the DAG from the dependencies and enforces the expectations on every write — the data quality framework, the lineage layer, and the orchestration glue collapse into a single artifact. Python import dlt from pyspark.sql.functions import sum as _sum, col @dlt.table( name="silver_txns", table_properties={ "delta.enableChangeDataFeed": "true", "delta.tuneFileSizesForRewrites": "true", }, cluster_by=["account_id", "ingest_date"], ) @dlt.expect_or_drop("non_null_amount", "amount IS NOT NULL") @dlt.expect_or_fail("valid_currency", "currency IN ('USD','EUR','GBP')") @dlt.expect("unique_txn", "txn_id IS NOT NULL") def silver_txns(): return (dlt.read_stream("bronze_txns") .dropDuplicates(["txn_id"])) @dlt.table(name="gold_daily_totals") def gold_daily_totals(): return (dlt.read("silver_txns") .groupBy("ingest_date", "account_id", "region") .agg(_sum("amount").alias("daily_total"))) The decorators do four things at once: define the table, declare its layout (cluster_by), declare its quality rules, and let the engine infer that gold_daily_totals depends on silver_txns from the dlt.read call. There is no DAG file. There is no separate Great Expectations suite. Lineage is generated for free in Unity Catalog, including column-level edges. If you want to query how the expectations have been performing — useful for SLO dashboards or alerting — the event log surfaces it directly: SQL -- Pass / fail / drop counts per expectation, last 24 hours SELECT flow_name, details:flow_progress.data_quality.expectations[0].name AS exp_name, details:flow_progress.data_quality.expectations[0].passed_records AS passed, details:flow_progress.data_quality.expectations[0].failed_records AS failed, details:flow_progress.data_quality.expectations[0].dropped_records AS dropped, timestamp FROM event_log("<pipeline-id>") WHERE event_type = 'flow_progress' AND timestamp >= current_timestamp() - INTERVAL 1 DAY ORDER BY timestamp DESC; 8. Putting It Together: Where to Start, What to Measure Adopting all of this at once is a recipe for pain. The order I've seen work, and a small set of metrics to verify the change is paying off: Step Adopt Retire Verify with 1 Predictive optimization at schema level Nightly OPTIMIZE / VACUUM jobs Reduction in maintenance-cluster cost 2 Liquid clustering on top 5 tables Static partitioning + Z-ORDER p95 query latency on the same workloads 3 Auto loader for 1-2 ingestion pipelines Custom file-tracking + listing logic End-to-end data freshness 4 Lakeflow pipelines for new pipelines only External DQ + DAG glue (for new work) Lines of pipeline code per table 5 Serverless compute for SQL warehouses + DLT Hand-sized job clusters Cost-per-query, scale-up time What you do not need to migrate: imperative pipelines that already work and aren't growing. Declarative patterns are about new work and high-pain hot spots, not a heroic rewrite of every notebook ever shipped. 9. Honest Limitations and Where Imperative Still Wins Three places where the declarative model still bites — worth knowing before you commit: Procedural logic still belongs in Jobs. If your pipeline is really a sequence of API calls with branching error handling, that's a Lakeflow Job (or external code), not a declarative table. Don't try to bend dlt around it.Predictive optimization needs observation time. On a table that's a week old, the engine hasn't seen enough patterns to make great decisions. For tables under heavy initial load, an explicit OPTIMIZE FULL after the first big ingest still helps.Cluster-by-column choice still matters. CLUSTER BY AUTO is great for stable workloads with predictable filters. For tables whose access pattern is genuinely heterogeneous across teams, an explicit cluster-by based on the dominant query is usually faster.Hint-driven escapes are still allowed. If a particular query benefits from a /*+ BROADCAST(t) */ hint and AQE isn't catching it, the hint is fine. Just keep them rare and document why. Conclusion The declarative optimization story isn't a single feature you toggle — it's a quiet shift in who owns the boring parts of a Spark pipeline. Layout, maintenance, ingestion bookkeeping, plan tuning, cluster sizing, data quality enforcement: every one of those was traditionally a thing the team owned and paid for in toil. The current Databricks stack lets you express each as an intent and let the engine handle the operations underneath. Adopt them in order, retire what they replace, and the optimization treadmill slows from a daily concern to a quarterly review. That's the actual win, and it's the reason the declarative paradigm has gone from a Lakeflow detail to the default mental model for new pipelines on Databricks.
This article provides a comprehensive guide to achieving zero-downtime deployments for Java-based applications on Kubernetes. We cover deployment strategies, Kubernetes primitives, Java-specific considerations, session state handling, database migrations, traffic shifting techniques, CI/CD pipelines, GitHub Actions, Jenkins with automated rollbacks, observability (Prometheus, Grafana, Jaeger), Helm/ArgoCD examples, testing strategies (canary analysis, chaos, smoke tests), and troubleshooting. Deployment Strategies Kubernetes offers several strategies for deploying new versions without downtime: Rolling Update Incrementally replace old pods with new ones, maintaining availability. Kubernetes Deployment object uses rolling updates by default. You can control maxUnavailable and maxSurge to tune the rollout. Blue-Green Deployment Run two separate environments: Blue = current, green = new. Only one serves live traffic at a time. Once the Green version is verified, switch the Service or Ingress to point at Green, then scale down Blue. This allows instant rollback by redirecting traffic back to Blue. Argo Rollouts defines a blue/green strategy with an active and preview Service. Traffic flows only to the active version until promotion. Canary Deployment Gradually shift a small percentage of traffic to the new version. Start with a few pods of v2, monitor, then incrementally increase. Tools like Istio or Argo Rollouts can control traffic weights. For instance, sending 10% of traffic to v2 can be done by running 9 v1 pods and 1 v2 pod (10%). Argo defines a canary rollout with setWeight steps and pauses for analysis. Shadow/Mirroring The new version receives a copy of live requests for testing under real load, but its responses are not returned to users. This is low risk but does not assist in rollback decisions since users don’t see the new behavior. Kubernetes Primitives for Zero Downtime Deployment A Deployment naturally performs rolling updates. By default, it creates a new ReplicaSet and scales it up while scaling down the old one controlled by maxUnavailable/maxSurge. This ensures some pods always serve traffic. To use blue/green, you would deploy two separate Deployments (e.g., app-blue, app-green) and switch Services. Service and Ingress A Service fronts pods. For blue/green, you can point a single Service at either the blue or green pods. Ingress can also switch between backend services. E.g., label selectors can be adjusted to redirect traffic from version blue to version green pods. PodDisruptionBudget Ensures a minimum number of pods stay running during voluntary disruptions. For instance, setting minAvailable 1 ensures at least one pod remains during a rolling update. To avoid complete downtime during maintenance. Horizontal Pod Autoscaler (HPA) Scales pods based on CPU/memory or custom metrics. It automatically updates a workload to match demand. An HPA can be attached to the Deployment so that if traffic spikes during a rollout, new pods will be created to handle the load. Example: YAML apiVersion autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 Liveness and Readiness Probes Critical for zero downtime. A liveness probe checks if the app is alive; if it fails, K8 restarts the pod. A readiness probe tells if the app is ready to serve traffic. During startup or shutdown, the readiness probe should fail, causing the pod to be removed from the service load balancer. Spring Boot Actuator provides /actuator/health for this. In K8S YAML: YAML livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 5 Spring Boot exposes health/liveness and health/readiness groups by default. Quarkus and Micronaut have similar health endpoints. Spring Boot supports graceful shutdown by setting server.shutdown is equals to graceful and tuning spring.lifecycle.timeout-per-shutdown-phase. This causes the embedded server, either Tomcat/Jetty/Undertow, to stop accepting traffic and wait up to the timeout for active requests. Java @Component public class ShutdownListener implements SmartLifecycle { private boolean running = true; @Override public void stop() { running = false; } @Override public boolean isRunning() { return running; } } Quarkus provides graceful shutdown configuration. By setting quarkus.shutdown.timeout=10s, Quarkus will wait up to 10 seconds for current requests to finish before exiting. You can annotate a bean method with @Shutdown to run cleanup code. Micronaut has @EventListener for ShutdownEvent: Java @Singleton public class ShutdownBean { @EventListener void onShutdown(ShutdownEvent event) { } } Kubernetes Hooks You can use a preStop hook in the Deployment spec to run a script before SIGTERM. YAML lifecycle: preStop: exec: command: ["/bin/sh","-c","sleep 5"] terminationGracePeriodSeconds: 30 The grace period (default 30s) should be tuned to let the app finish. K8S doc 77†L99-L107 describes the sequence container enters Terminating, runs preStop, sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. JVM Tuning Set -XX +ExitOnOutOfMemoryError to avoid hanging. Tune thread pools so they drain quickly. Monitor GC pause times, consider using low-latency GC to minimize pause before shutdown. Session and State Handling To maintain zero downtime when pods switch: Stateless services: Best practice is to keep services stateless. Store session state or user data in an external store, such as Redis or a database. This way, any pod can handle any request, and pods can be replaced without losing the user session.Sticky sessions: If an app uses in-memory sessions, you can enforce sticky sessionsService affinity: Set sessionAffinity: ClientIP on the Service. Kubernetes routes requests from the same client IP to the same pod.Ingress affinity: Use Ingress annotations to bind a user’s requests to one pod. However, sticky sessions introduce risk and are not suitable for autoscaling.StatefulSets: For true stateful workloads, use StatefulSet with stable identities. StatefulSets pair pods with PersistentVolumes, which are not zero-downtime by themselves. GitHub Actions CI/CD Pipeline zero-downtime: YAML name: Deploy on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-java@v3 with: { java-version: '17' } - name: Build run: mvn clean package -DskipTests name: Docker Build & Push run: | docker build -t ghcr.io/myorg/myapp:${{ github.sha } echo ${{ secrets.GITHUB_TOKEN } | docker login ghcr.io -u ${{ github.actor } --password-stdin docker push ghcr.io/myorg/myapp:${{ github.sha } - name: Set image tag run: echo "::set-output name=image::ghcr.io/myorg/myapp:${{ github.sha } deploy: needs: build runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 with: { path: manifests } - name: Update K8s deployment uses: azure/setup-kubectl@v3 - name: Deploy to Kubernetes run: | kubectl set image deployment/myapp-deployment myapp=ghcr.io/myorg/myapp:${{ needs.build.outputs.image } kubectl rollout status deployment myapp-deployment This workflow builds the image, pushes it, and updates the deployment. The rollout status command waits for all new pods to become ready. If health checks fail, it will abort without downtime. Conclusion Zero-downtime deployment on Kubernetes combines careful architecture and automation, using rolling updates, progressive strategies, ensuring graceful shutdown and health checks in your Java apps, externalizing state, managing database changes, and orchestrating with CI/CD pipelines. Kubernetes primitives like Deployments, Services, Probes, and HPA, along with tools like Istio or Argo Rollouts, provide the building blocks.
Justin Albano
Software Engineer,
IBM