DZone Spotlight

Wednesday, September 24 View All Articles »

Testing Automation Antipatterns: When Good Practices Become Your Worst Enemy

By Francisco Moreno

CORE

Note: This article is a summary of a talk I gave at VLCTesting in 2023. Here's the recording (Spanish). Test automation is a fundamental tool for gaining confidence in what we build in a fast and efficient way. However, we often encounter practices that, while seemingly beneficial in the short term, generate significant problems in the long term: antipatterns. What Is an Antipattern? First, let's establish what I consider an antipattern, as it's not simply an obvious bad practice. It's characterized by: Offering an immediate benefit that's intuitiveSeeming like the right solution at the momentLeading to negative consequences over time Understanding and identifying them is important to avoid test suite degradation, slowness, unexplained failures, and ultimately abandoning automation efforts. Let's look at some examples I've encountered recurrently in teams and testers. 1. The Testing Pyramid as Dogma The testing pyramid suggests a specific distribution: many unit tests at the base, some integration tests in the middle, and few end-to-end tests at the top. The problem arises when it's applied as a universal rule without considering the project's specific context. Why is it an antipattern? The testing pyramid becomes an antipattern due to these apparent short-term benefits: Intuitive and visual model: Easy to understand and explain to stakeholdersWidespread popularity: "Cargo cult" - if Martin Fowler says it, it must be rightFeeling of doing the right thing: Following a recognized model gives immediate confidenceSimplifies decisions: No need to think about strategy, just follow the distribution Why it happens: It's much easier to follow an established model than to analyze each project's specific context. Also, questioning the pyramid might seem like you're questioning a "universal truth" of testing. Long-Term Problems Misalignment with business objectives: We should focus testing efforts on what makes sense for our application and what customers value. In some cases, this might be visual aspects, performance, stability, or the ability to integrate with other systemsLow-value tests: Adding low-value unit tests just to "comply" with the pyramid's baseStrategic rigidity: Applying the distribution without considering that frontend-heavy projects might benefit from other modelsResource waste: Time invested in tests that don't add real business value How to Solve It Identify your business core: Is it performance? Visual experience? Complex API logic?Risk analysis: Where are the biggest failure points in your specific application?Consider alternatives: "Trophy" model for SPAs, "diamond" for hybrid applicationsRemember the iceberg: The pyramid is the visible result, but it needs a solid foundation of team culture, knowledge, and strategy. 2. Local Test Execution Only This antipattern occurs when automated tests can only be executed on one specific person's local machine, usually the tester, who must run them manually instead of having them integrated into a CI/CD system. Why is it an antipattern? Local-only execution offers these irresistible immediate benefits: Development speed: No dependency on complex CI/CD configurationsTotal control: You have absolute dominion over the execution environmentNo blockers: No need to wait for others to configure pipelines or environmentsInitial ease: It's the quickest way to start with automationNo dependencies: No need to coordinate with DevOps or system administratorsImmediate feedback: You see results instantly without CI queues Why it happens: It's simpler to have total control and see immediate results. Also, setting up CI/CD can seem complex and "not urgent" when tests work locally. Long-Term Problems Knowledge silos: Only one person can execute the testsSlow feedback for the team: Others don't receive immediate information about code statusCritical dependency: If that person isn't available, tests don't runImpossible continuous integration: No real automation in the development pipelineInvestment loss: When the person leaves, all automation is lost How to Solve It CI/CD from day one: Every test must be able to run unattendedDockerization: Use containers to ensure consistency across environmentsShared repositories: Test code must be versioned and accessible to the entire teamClear documentation: Procedures so anyone can execute the testsInvestment in configuration: Dedicate initial time to properly configure environments 3. Cucumber: Misunderstood and Misused Cucumber allows writing tests in natural language (Gherkin) that are then linked to technical code. It becomes an antipattern when adopted, expecting it to automatically improve collaboration with business or simplify testing, without a clear BDD strategy behind it. Why is it an antipattern? Cucumber offers very attractive benefits: "Wow effect": Translating natural language to executable code seems magicalPromise of collaboration: "Now business can write tests"Ubiquitous language: Everyone will understand tests, technical, and non-technicalLiving documentation: Scenarios serve as executable specificationsMethodological justification: "We're doing real BDD"Professional differentiation: Using Cucumber makes you seem more advanced than using only unit tests Why it happens: The promise of democratizing testing is very attractive. Also, once you invest time learning Gherkin and creating step definitions, it's hard to admit it doesn't add value in your specific context. Long-Term Problems Unnecessary complexity: You add an extra layer that isn't always justifiedExpensive maintenance: Imperative scenarios become fragile with changesFalse collaboration: Business people rarely maintain or write scenariosImplementation coupling: Steps become too specific about the "how"Loss of real value: Used as a post-development testing tool instead of facilitating prior conversations How to Solve It Evaluate real context: Is there active collaboration between non-technical roles in criteria definition?Use it a priori: To generate conversations before development, not as a post-development testing toolDeclarative scenarios: Focus on what (behavior), not how (implementation), following Gherkin syntax best practicesConsider alternatives: If your entire team is technical, direct tests might be more efficient 4. Testing Through the Interface vs. Testing the Interface This antipattern consists of using UI tools (like Selenium/Cypress) to test the entire application stack when what you intend is to validate only the specific functionality of the user interface. It's the difference between using the UI as a vehicle to test the entire application versus testing specifically that the UI works correctly. Why is it an antipattern? Testing the entire stack through UI with an e2e tool offers very attractive immediate benefits: Total security feeling: "If it works in the browser, the entire system works"Manual testing replication: It's the most direct automation of what we'd do manuallyLess analysis required: No need to think about layers, dependencies, or architectureLess communication: No need to coordinate with other teams about what they testUniversal understanding: Anyone can understand what the test does just by watching it executeOne tool for everything: Selenium solves "all" testing needs Why it happens: It's natural to want to replicate what we do manually. Also, thinking in layers and dividing testing responsibilities requires more mental effort and team coordination. Long-Term Problems Slow and confusing feedback: When something fails, you don't know if it's the UI, API, database, or external serviceExpensive maintenance: Changes in any layer break end-to-end testsExtreme redundancy: You validate the same logic in multiple layers (e.g., email validation)Slow and unstable tests: More components = more failure pointsLimited scalability: Adding more end-to-end tests makes the suite progressively slower How to Solve It Divide responsibilities: Each test type in the most efficient layer for what it validatesStrategic mocking: Isolate the specific functionality you want to testRisk analysis: Identify which critical flows DO require complete end-to-end testsTeam communication: Coordinate to avoid unnecessary validation duplication 5. The Danger of Retries in Flaky Tests Flaky tests are tests that sometimes pass and sometimes fail without apparent changes in the code. The antipattern arises when we configure automatic retries to make these tests "go green" instead of investigating and solving the root cause of their instability. Why is it an antipattern? Configuring automatic retries for flaky tests offers irresistible immediate benefits: Quick and easy solution: A simple retry: 3 in your configuration and "problem solved"Instant green pipeline: No more broken builds due to "temporary issues"Fewer interruptions: The team isn't interrupted by false positivesExternal problem attribution: "It's the tool's or environment's fault, not ours"Delivery pressure: You need it green "now" to not block the releaseLess investigation required: You don't have to analyze each individual failure Why it happens: It's much easier to configure a retry than investigate the root cause. Also, when the test passes on the second attempt, it reinforces the belief that "it was just a temporary problem." Long-Term Problems Total confidence loss: Nobody trusts tests that "sometimes fail"Hidden real problems: Race conditions, memory leaks, concurrency issues get masked"Watermelon" suite: Green on the outside in the dashboard, but red inside with real failuresProgressive degradation: Underlying problems worsen until impossible to ignoreEventual abandonment: Teams end up disabling or completely ignoring tests How to Solve It Mandatory investigation: Every failure must be analyzed before any retryExhaustive root cause analysis: Logs, screenshots, manual reproduction, team consultationDeep technical knowledge: Understand how your testing tool works internallyQuality culture: "It's green" isn't enough if there were retries without prior investigation 6. The Coverage-Oriented Testing Illusion This antipattern arises when the main goal of writing tests becomes reaching a specific code coverage percentage (typically 80%), instead of focusing on testing critical system behaviors. The result is tests that increase metrics but don't add real value. Why is it an antipattern? The "80% coverage" goal offers very seductive immediate benefits: Easy metric to measure: A clear number you can report to managementObjective justification: "We have 80% coverage, our code is good"Fulfilled contractual obligation: Many contracts explicitly require itFeeling of professionalism: "A good developer has high coverage"Gamification: It's satisfying to see the percentage rise like a video gameEasy comparison: You can compare projects and teams with a simple metric Why it happens: It's much easier to chase a number than to analyze whether tests actually add value. Also, high unit test numbers give a false sense of security, but with a clear conscience. Long-Term Problems Tests without real value: Testing getters/setters just to increase the percentageExtreme coupling: One test class for every production classAbusive mocking: You end up testing your mocks, not the real codeRefactoring obstacle: Changing a constructor breaks 50 tests that didn't even detect bugsFalse security: High coverage doesn't mean error-free code How to Solve It Focus on behavior: Tests that verify real user flows, not lines of code. "Test behavior, not implementation"Coverage delta: More important that it doesn't decrease when you add new codeSocial tests: Tests that exercise multiple classes working togetherMutation testing: Verify your tests actually detect errors using tools like PIT or Mutmut After identifying the most common antipatterns in test automation, it's time to transform these learnings into practical actions to avoid falling into the same traps. Conclusions: Strategy, Context, and Collaboration To build robust test automation that adds value: 1. Define Your Strategy (Golden Circle) Why: What specific business problems do you want to solve?How: Will you focus on APIs, performance, and visual experience?What: Only at the end, choose specific tools 2. Context Determines Validity A practice can be an antipattern in one context and a solution in anotherEvaluate your specific situation: team, product, constraints 3. Collaboration and Consensus Decisions should be made by the complete teamAvoid unilateral impositions 4. Invest in Fundamentals Successful automation requires culture, knowledge, and timeVisible results depend on a solid, invisible foundation 5. Continuous Learning Antipatterns evolve with tools and practicesStay updated and willing to question your own practices Final reflection: These antipatterns don't arise from incompetence, but from rational decisions with limited information. The key is maintaining a long-term perspective and being willing to change when context requires it. Remember: In test automation, what works today can be tomorrow's antipattern. The true professional isn't one who never makes mistakes, but one who constantly questions their own practices. More

Running AI/ML on Kubernetes: From Prototype to Production — Use MLflow, KServe, and vLLM on Kubernetes to Ship Models With Confidence

By Boris Zaikin

CORE

Editor's Note: The following is an article written for and published in DZone's 2025 Trend Report, Kubernetes in the Enterprise: Optimizing the Scale, Speed, and Intelligence of Cloud Operations. After training a machine learning model, the inference phase must be fast, reliable, and cost efficient in production. Serving inference at scale, however, brings difficult problems: GPU/resource management, latency and batching, model/version rollout, observability, and orchestration of ancillary services (preprocessors, feature stores, and vector databases). Running artificial intelligence and machine learning (AI/ML) on Kubernetes gives us a scalable, portable platform for training and serving models. Kubernetes schedules GPUs and other resources so that we can pack workloads efficiently and autoscale to match traffic for both batch jobs and real-time inference. It also coordinates multi-component stacks — like model servers, preprocessors, vector DBs, and feature stores — so that complex pipelines and low-latency endpoints run reliably. Containerization enforces reproducible environments and makes CI/CD for models practical. Built-in capabilities like rolling updates, traffic splitting, and metrics/tracing help us run safe production rollouts and meet SLOs for real-time endpoints. For teams that want fewer operations, managed endpoints exist, but Kubernetes is the go-to option when control, portability, advanced orchestration, and real-time serving matter. Let's look into a typical ML inferencing setup using KServer on Kubernetes below: Figure 1. ML inference setup with KServe on Kubernetes Clients (e.g., data scientists, apps, batch jobs) send requests through ingress to a KServe InferenceService. Inside, an optional Transformer pre-processes inputs, the Predictor (required) loads the model and serves predictions, and an optional explainer returns insights. Model artifacts are pulled from model storage (as seen in the diagram) and served via the chosen runtime (e.g., TensorFlow, PyTorch, scikit-learn, ONNX, Triton). Everything runs on Knative/Kubernetes with autoscaling and routing, using the CPU/GPU compute layer from providers such as NVIDIA/AMD/Intel on AWS, Azure, Google Cloud, or on-prem. Part 1: MLFlow and KServe With Kubernetes Let's dive into the practical implementation of an AI/ML scenario. We will use a combination of MLFlow to orchestrate ML processes, scikit-learn to train ML models, and KServe to inference our model in Kubernetes clusters. Introduction to MLFlow MLflow is an open-source ML framework, and we use it to bring order to the chaos that happens when models move from experiments to production. It helps us track runs (parameters, metrics, and files), save the exact environment and code that produced a result, and manage model versions so that we know which one is ready for production. In plain terms, MLflow fixes three common problems: Lost experiment dataMissing environment or code needed to reproduce resultsConfusion about which model is "the" production model; its main pieces — Tracking, Projects, Models, and the Model Registry — map directly to those needs We can also use MLflow to package and serve models (locally as a Docker image or via a registry), which makes it easy to hand off models to a serving platform like Kubernetes. Using MLflow and KServe on Kubernetes MLflow offers a straightforward way to serve models via a FastAPI-based inference server, and mlflow models build-docker lets you containerize that server for Kubernetes deployment. However, this approach can be unsuitable for production at scale; FastAPI is lightweight and not built for extreme concurrency or complex autoscaling patterns, and manual management of numerous inference replicas creates significant operational overhead. KServe (formerly KFServing) delivers a production-grade, Kubernetes-native inference platform with high-performance, scalable, and framework-agnostic serving abstractions for popular ML libraries such as TensorFlow, XGBoost, scikit-learn, and PyTorch. We've created a short step-by-step guide on how to train an ML model with MLflow and scikit-learn, and how to deploy to Kubernetes using KServe. This guide walks you through a complete MLflow workflow to train a linear-regression model with MLflow tracking and perform hyperparameter tuning to determine the best model: Prerequisites – Install Docker, kubectl, and a local cluster (Kind or Minikube) or use a cloud Kubernetes cluster. See Kind/Minikube quickstarts.Install MLflow + MLServer support – Install MLflow with the MLServer extras (pip install mlflow[mlserver]) and review MLServer examples for MLflow.Train and log a model – Train and save the model with mlflow.log_model() (or mlflow.sklearn.autolog()), following the MLflow tutorial.Smoke-test locally – Serve with MLflow/MLServer to validate invocations before Kubernetes: mlflow models serve -m models:/<name> -p 1234 --enable-mlserver. See MLflow models/MLServer examples.Package or publish Option A – Build a Docker image: mlflow models build-docker -m runs:/<run_id>/model -n <your/image> --enable-mlserver → push to a registry.Option B – Push artifacts to remote storage (S3/GCS) and use the storageUri in KServe. Documents and examples can be found here.Deploy to KServe – Create a namespace and apply an InferenceService pointing to your image or storageUri. See KServe's InferenceService quickstartand repo examples. Below is an example (Docker image method + Kubernetes) InferenceService snippet: YAML apiVersion: "serving.kserve.io/v1beta1" kind: InferenceService metadata: name: mlflow-wine-classifier namespace: mlflow-kserve-test spec: predictor: containers: - name: mlflow-wine-classifier image: "<your_docker_user>/mlflow-wine-classifier" ports: - containerPort: 8080 protocol: TCP env: - name: PROTOCOL value: "v2" Verify and productionize – Check Pods (kubectl get pods -n <ns>), call the endpoint, then add autoscaling, metrics, canary rollouts, and explainability as needed (KServe supports these features). The official MLflow documentation also has a good step-by-step guide that covers how to package the model artifacts and dependency environment as an MLflow model, validate local serving with mlserver using mlflow models serve, and deploy the packaged model to a Kubernetes cluster with KServe. Part 2: Managed AutoML: Azure ML to AKS For this example, we selected Azure. However, Azure is just one of many tool providers that can work in this scenario. Azure Machine Learning is a managed platform for the full ML lifecycle — experiment tracking, model registry, training, deployment, and MLOps — that helps teams productionize models quickly. Defining a reliable ML process can be difficult, and Automated ML (AutoML) can simplify that work by automating algorithm selection, feature engineering, and hyperparameter tuning. For low-latency, real-time inference at scale, you can run containers on Kubernetes, the de facto orchestration layer for production workloads. We pick Azure Kubernetes Service (AKS) when we need custom runtimes, strict performance tuning (GPU clusters, custom drivers), integration with existing Kubernetes infrastructure (service mesh, VNETs), or advanced autoscaling rules. If we prefer a managed, low-ops path and don't need deep cluster control, Azure ML's managed online endpoints are usually faster to adopt. We run AutoML in Azure ML to find the best model, register it, and publish it as a low-latency real-time endpoint on AKS so that we keep full control over runtime, scaling, and networking: Prerequisites – Acquire an Azure subscription, an Azure ML workspace, the Azure CLI/ML CLI or SDK, and an AKS Cluster (create one or attach an existing cluster).Run AutoML and pick the winner – Submit an AutoML job (classification/regression/forecast) from the Azure ML studio or SDK and register the top model in the Model Registry.Prepare scoring + environment – Add a minimal score.py (load model, handle request) and an environment spec (Conda/requirements); you can reuse examples from the azureml-examples repo.Attach AKS and deploy – Attach your AKS compute to the workspace (or create AKS), then deploy the registered model as an online/real-time endpoint using the Azure ML CLI or Python SDK.Test and monitor – Call the endpoint, add logging/metrics and autoscaling rules, and use rolling/canary swaps for safe updates. As an example of how AutoML works, I will provide a typical AI/ML pipeline below: Figure 2. Example AI/ML pipeline This ML pipeline contains steps to select, clean up, and transform data from datasets; to split data for training, selecting the ML algorithm, and testing the model; and finally, to score and evaluate the model. All those steps can be automated with AutoML, including several options to deploy models to the AKS/Kubernetes Real-Time API endpoint. Part 3: Serving LLMs on Kubernetes Let's have a look into the combination of LLMs and Kubernetes. We run LLMs on Kubernetes to get reliable, scalable, and reproducible inference: Kubernetes gives us GPU scheduling, autoscaling, and the orchestration primitives to manage large models, batching, and multi-instance serving. By combining optimized runtimes, request batching, and observability (metrics, logging, and health checks), we can deliver low-latency APIs while keeping costs and operational risks under control. To do so, we can use the open-source framework vLLM, which is used when we need high-throughput, memory-efficient LLM inference. On Kubernetes, we run vLLM inside containers and couple it with a serving control plane (like KServe) so that we get autoscaling, routing, canary rollouts, and the standard InferenceService CRD without re-implementing ops logic. This combination gives us both the low-level performance of vLLM and the operational features of a Kubernetes-native inference platform. Let's see how we can deploy LLM to Kubernetes using vLLM and KServe: Prepare cluster and KServe – Provision a Kubernetes cluster (AKS/GKE/EKS or on-prem) and install KServe, following the quickstart.Get vLLM – Clone the vLLM repo or follow the docs to install vLLM and test vllm serve locally to confirm that your model loads and the API works.Create a vLLM ServingRuntime/container – Build a container image or use the vLLM ServingRuntime configuration that KServe supports (the runtime wraps vllm serve with the correct arguments and environment variables).Deploy an InferenceService – Apply a KServe InferenceService that references the vLLM serving runtime (or your image) and model storage (S3/HF cache). KServe will create pods, handle routing, and expose the endpoint.Validate and tune – Hit the endpoint (through ingress/port-forward), measure latency/throughput, and tune vLLM batching/token-cache settings and KServe autoscaling to balance latency and GPU utilization. Last but not least, we can run vLLM, KServe, and BentoML together to get high-performance LLM inference and production-grade ops. Here is a short breakdown: vLLM – the high-throughput, GPU-efficient inference engine (token generation, KV-cache, and batching) — the runtime that actually executes the LLMBentoML – the developer packaging layer that wraps model loading, custom pre-/post-processing, and a stable REST/gRPC API, then builds a reproducible Docker image or artifactKServe – the Kubernetes control plane that deploys the container (Bento image or a vLLM serving image) and handles autoscaling, routing/ingress, canaries, health checks, and lifecycle management How do they fit together? We package our model and request logic with BentoML (image or Bento bundle), which runs the vLLM server for inference. KServe then runs that container on Kubernetes as an InferenceService (or ServingRuntime), giving autoscale, traffic controls, and observability. Pros and Cons of Kubernetes Inference Frameworks for ML We already had a look at the KServe library. However, there are other powerful alternatives. Let's look at the table below: Table 1. KServe alternative tools and libraries LibraryOverviewProsCons Seldon Core Kubernetes-native ML serving and orchestration framework offering CRDs for deployments, routing, and advanced traffic control Kubernetes-first (CRDs, Istio/Envoy integrations); rich routing (canary, A/B); built-in telemetry and explainer integrations; supports multiple runtimes Steeper learning curve; more operational surface to manage; heavier cluster footprint BentoML (with Yatai) Python-centric model packaging and serving; Yatai/Helm lets you run Bento services on Kubernetes as deployments/CRDs Excellent developer ergonomics and reproducible images; fast local dev loop; simple CI/CD image artifacts Less cluster-native controls out of the box (needs Yatai/Helm); autoscaling and advanced Kubernetes ops require extra setup NVIDIA Triton Inference Server High-performance GPU-optimized inference engine supporting TensorRT, TensorFlow, PyTorch, ONNX, and custom back ends Exceptional GPU throughput and mixed-framework support; batch and model ensemble optimizations; production-grade performance tuning Less cluster-native controls out of the box (needs Yatai/Helm); autoscaling and advanced Kubernetes ops require extra setup Conclusion Our goal is to run reliable, low-latency AI/ML in production while keeping control of cost, performance, and repeatability. Kubernetes gives us the orchestration primitives we need — GPU scheduling, autoscaling, traffic control, and multi-service coordination — so that models and their supporting services can run predictably at scale. Paired with optimized runtimes, serving layers, and inference engines, we get both high-inference performance and production-grade operational controls. The result is portable, reproducible deployments with built-in observability, safe rollout patterns, and better resource efficiency. Start small, validate with a single model and clear SLOs, pick the serving stack that matches your performance and ops needs, then iterate. Kubernetes lets you grow from prototype to resilient, scalable serving. This is an excerpt from DZone's 2025 Trend Report, Kubernetes in the Enterprise: Optimizing the Scale, Speed, and Intelligence of Cloud Operations.Read the Free Report More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

Spring REST API Client Flavors: From RestTemplate to RestClient

Just as humans have always preferred co-existing and communicating ideas, looking for and providing pieces of advice from and to their fellow humans, applications nowadays find themselves in the same situation, where they need to exchange data in order to collaborate and fulfill their purposes. At a very high level, applications’ interactions are carried out either conversationally (the case of REST APIs), where the information is exchanged synchronously by asking and responding, or asynchronously via notifications (the case of event-driven APIs), where data is sent by producers and picked up by consumers as it becomes available and they are ready. This article is an analysis of the synchronous communication between a client and a server via REST, with a focus on the client part. Its main purpose is to present how a Spring REST API client can be implemented, first using the RestTemplate, then the newer RestClient and seamlessly accomplish the same interaction. A Brief History RestTemplate was introduced in Spring Framework version 3.0, and according to the API reference, it’s a “synchronous client to perform HTTP requests, exposing a simple, template method API over underlying HTTP client libraries.” Flexible and highly configurable, it’s been for a long time the best choice when a fully-fledged synchronous but blocking HTTP client was implemented as part of a Spring application. As time has passed, its lack of non-blocking capabilities, the use of the old-fashioned template pattern, and the pretty cumbersome API significantly contributed to the emergence of a new, more modern HTTP client library, one that may also handle non-blocking and asynchronous calls. Spring Framework version 5.0 introduced WebClient, “a fluent, reactive API, over underlying HTTP client libraries.” It was especially designed for the WebFlux stack and by following the modern and functional API style, it was much cleaner and easier to use by developers. Nevertheless, for blocking scenarios, WebClient‘s ease of use benefit came with an extra cost – the need to add an additional library dependency into the project. Starting with Spring Framework version 6.1 and Spring Boot version 3.2, a new component is available — RestClient — which “offers a more modern API for synchronous HTTP access.” The evolution has been quite significant, developers nowadays may choose among these three options, RestTemplate, WebClient and RestClient, depending on the application needs and particularities. Implementation As stated above, the proof of concept in this article experiments with both RestTemplate and RestClient, leaving the WebClient aside as the communication here is conversational, that is synchronous. There are two simple actors involved, two applications: figure-service – the server that exposes the REST API and allows managing Figuresfigure-client – the client that consumes the REST API and actually manages the Figures Both are custom-made and use Java 21, Spring Boot version 3.5.3, and Maven version 3.9.9. A Figure is a generic entity that could denote a fictional character, a superhero, or a Lego mini-figure, for instance. The Server figure-service is a small service that allows performing common CRUD operations on simple entities that represent figures. As the focus in this article is on the client, server characteristics are only highlighted. The implementation is done in a standard, straightforward manner in accordance with the common best practices. The service exposes a REST API to manage figures: read all – GET /api/v1/figuresread one – GET /api/v1/figures/{id}read a random one – GET /api/v1/figures/randomcreate one – POST /api/v1/figuresupdate one – PUT /api/v1/figures/{id}delete one – DELETE /api/v1/figures/{id} The operations are secured at a minimum with an API key that shall be available as a request header Plain Text "x-api-key": the api key Figure entities are stored in an in-memory H2 database, described by a unique identifier, a name, and a code, and modelled as below: Java @Entity @Table(name = "figures") public class Figure { @Id @GeneratedValue private Long id; @Column(name = "name", unique = true, nullable = false) private String name; @Column(name = "code", nullable = false) private String code; ... } While the id and the name are visible to the outside world, the code is considered business domain information and kept private. Thus, the used DTOs look as below: Java public record FigureRequest(String name) {} public record FigureResponse(long id, String name) {} All server exceptions are handled generically in a single ResponseEntityExceptionHandler and sent back to the client in the following form, with the corresponding HTTP status: JSON { "title": "Bad Request", "status": 400, "detail": "Figure not found.", "instance": "/api/v1/figures/100" } Java public record ErrorResponse(String title, int status, String detail, String instance) {} Basically, in this service implementation, a client receives either the aimed response (if any) or one highlighting the error (detailed at point 4), in case there is a service one. This resource contains the figure-service source code, it may be browsed for additional details. The Client Let’s assume figure-client is an application that uses Figure entities as part of its business operations. As these are managed and exposed by the figure-service, the client needs to communicate via REST with the server, but also to comply with the contract and the requirements of the service provider. In this direction, a few considerations are needed prior to the actual implementation. Contract Since the synchronous communication is first implemented using a RestTemplate, then modified to use the RestClient, the client operations are outlined in the interface below. Java public interface FigureClient { List<Figure> allFigures(); Optional<Figure> oneFigure(long id); Figure createFigure(FigureRequest figure); Figure updateFigure(long id, FigureRequest figureRequest); void deleteFigure(long id); Figure randomFigure(); } In this manner, the implementation change is isolated and does not impact other application parts. Authentication As the server access is secured, a valid API key is needed. Once available, it is stored as an environment variable and used via the application.properties into a ClientHttpRequestInterceptor. According to the API reference, such a component defines the contract to intercept client-side HTTP requests and allows implementers to modify the outgoing request and/or incoming response. For this use case, all requests are intercepted, and the configured API key is set as the x-api-key header, then the execution is resumed. Java @Component public class AuthInterceptor implements ClientHttpRequestInterceptor { private final String apiKey; public AuthInterceptor(@Value("${figure.service.api.key}") String apiKey) { this.apiKey = apiKey; } @Override public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException { request.getHeaders() .add("x-api-key", apiKey); return execution.execute(request, body); } } The AuthInterceptor is used in the RestTemplate configuration. Data Transfer Objects (DTOs) Particularly in this POC, as the Figure entities are trivial in terms of the attributes that describe them, the DTOs used in the operations of interest and by the RestTemplate are simple as well. Java public record FigureRequest(String name) {} public record Figure(long id, String name) {} Since once read, Figure objects might be further used, their name was simplified, although they denote response DTOs. Exception Handling RestTemplate (and then RestClient) allows setting a ResponseErrorHandler implementation during its configuration, a strategy interface used to determine whether a particular response has errors or not, and permits custom handling. In this POC, as the figure-service sends all errors in the same form, it is very convenient and easy to adopt a generic handling manner. Java @Component public class CustomResponseErrorHandler implements ResponseErrorHandler { private static final Logger log = LoggerFactory.getLogger(CustomResponseErrorHandler.class); private final ObjectMapper objectMapper; public CustomResponseErrorHandler() { objectMapper = new ObjectMapper(); } @Override public boolean hasError(ClientHttpResponse response) throws IOException { return response.getStatusCode().isError(); } @Override public void handleError(URI url, HttpMethod method, ClientHttpResponse response) throws IOException { HttpStatusCode statusCode = response.getStatusCode(); String body = new String(response.getBody().readAllBytes()); if (statusCode.is4xxClientError()) { throw new CustomException("Client error.", statusCode, body); } String message = null; try { message = objectMapper.readValue(body, ErrorResponse.class).detail(); } catch (JsonProcessingException e) { log.error("Failed to parse response body: {}", e.getMessage(), e); } throw new CustomException(message, statusCode, body); } @JsonIgnoreProperties(ignoreUnknown = true) private record ErrorResponse(String detail) {} } The logic here is the following: Both client and server errors are considered and handled — see hasError() method.All errors result in a custom RuntimeException decorated with an HTTP status code and a detail, the default being the general Internal Server Error and the raw response body, respectively. Java public class CustomException extends RuntimeException { private final HttpStatusCode statusCode; private final String detail; public CustomException(String message) { super(message); this.statusCode = HttpStatusCode.valueOf(HttpStatus.INTERNAL_SERVER_ERROR.value()); this.detail = null; } public CustomException(String message, HttpStatusCode statusCode, String detail) { super(message); this.statusCode = statusCode; this.detail = detail; } public HttpStatusCode getStatusCode() { return statusCode; } public String getDetail() { return detail; } } In case of recoverable errors, all methods declared in FigureClient are throwing CustomExceptions, thus providing a simple exception handling mechanism. An extraction of the detail provided by the figure-service in the response body is first attempted and if possible, included in the CustomException, otherwise, the body is set as such Useful to Have Although not required, especially during development, but not only, it proves very useful to be able to see the requests and the responses exchanged in the logs of the client application. In order to accomplish this, a LoggingInterceptor is added to the RestTemplate configuration. Java @Component public class LoggingInterceptor implements ClientHttpRequestInterceptor { private static final Logger log = LoggerFactory.getLogger(LoggingInterceptor.class); @Override public ClientHttpResponse intercept(HttpRequest request, byte[] body, ClientHttpRequestExecution execution) throws IOException { logRequest(body); ClientHttpResponse response = execution.execute(request, body); logResponse(response); return response; } private void logRequest(byte[] body) { var bodyContent = new String(body); log.debug("Request body : {}", bodyContent); } private void logResponse(ClientHttpResponse response) throws IOException { var bodyContent = StreamUtils.copyToString(response.getBody(), Charset.defaultCharset()); log.debug("Response body: {}", bodyContent); } } Here, only the request and response bodies are logged, although other items might be of interest as well (headers, response statuses, etc.). Although useful, there is a gotcha worth explaining that needs to be taken into account. As it can be depicted, when the response is logged in the above interceptor, it is basically read and the stream “consumed”, which determines the client to eventually end up with an empty body. To prevent this, a BufferingClientHttpRequestFactory component shall be used, a component that allows buffering the stream content into memory and thus be able to read the response twice. The response availability is now resolved, but buffering the entire response body into memory might not be a good idea when its size is significant. Before jumping into blindly using it out of the box, developers should analyze the possible performance impact and adapt, particularly for each application. Configuration Having clarified the figure-service contract and requirements, and moreover, having already implemented certain “pieces,” the RestTemplate can now be configured. Java @Bean public RestOperations restTemplate(LoggingInterceptor loggingInterceptor, AuthInterceptor authInterceptor, CustomResponseErrorHandler customResponseErrorHandler) { RestTemplateCustomizer customizer = restTemplate -> restTemplate.getInterceptors() .addAll(List.of(loggingInterceptor, authInterceptor)); return new RestTemplateBuilder(customizer) .requestFactory(() -> new BufferingClientHttpRequestFactory(new SimpleClientHttpRequestFactory())) .errorHandler(customResponseErrorHandler) .build(); } A RestTemplateBuilder is used, the LoggingInterceptor, AuthInterceptor are added via a RestTemplateCustomizer, while the error handler is set to a CustomResponseErrorHandler instance. RestTemplate Implementation Once the RestTemplate instance is constructed, it can be injected into the actual FigureClient implementation and used to communicate with the figure-service. Java @Service public class FigureRestTemplateClient implements FigureClient { private final String url; private final RestOperations restOperations; public FigureRestTemplateClient(@Value("${figure.service.url}") String url, RestOperations restOperations) { this.url = url; this.restOperations = restOperations; } @Override public List<Figure> allFigures() { ResponseEntity<Figure[]> response = restOperations.exchange(url, HttpMethod.GET, null, Figure[].class); Figure[] figures = response.getBody(); if (figures == null) { throw new CustomException("Could not get the figures."); } return List.of(figures); } @Override public Optional<Figure> oneFigure(long id) { ResponseEntity<Figure> response = restOperations.exchange(url + "/{id}", HttpMethod.GET, null, Figure.class, id); Figure figure = response.getBody(); if (figure == null) { return Optional.empty(); } return Optional.of(figure); } @Override public Figure createFigure(FigureRequest figureRequest) { HttpEntity<FigureRequest> request = new HttpEntity<>(figureRequest); ResponseEntity<Figure> response = restOperations.exchange(url, HttpMethod.POST, request, Figure.class); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not create figure."); } return figure; } @Override public Figure updateFigure(long id, FigureRequest figureRequest) { HttpEntity<FigureRequest> request = new HttpEntity<>(figureRequest); ResponseEntity<Figure> response = restOperations.exchange(url + "/{id}", HttpMethod.PUT, request, Figure.class, id); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not update figure."); } return figure; } @Override public void deleteFigure(long id) { restOperations.exchange(url + "/{id}", HttpMethod.DELETE, null, Void.class, id); } @Override public Figure randomFigure() { ResponseEntity<Figure> response = restOperations.exchange(url + "/random", HttpMethod.GET, null, Figure.class); Figure figure = response.getBody(); if (figure == null) { throw new CustomException("Could not get a random figure."); } return figure; } } In order to observe how this solution works end-to-end, first, the figure-service is started. A CommandLineRunner is configured there, so that a few Figure entities are persisted into the database. Java @Bean public CommandLineRunner initDatabase(FigureService figureService) { return args -> { log.info("Loading data..."); figureService.create(new Figure("Lloyd")); figureService.create(new Figure("Jay")); figureService.create(new Figure("Kay")); figureService.create(new Figure("Cole")); figureService.create(new Figure("Zane")); log.info("Available figures:"); figureService.findAll() .forEach(figure -> log.info("{}", figure)); }; } Then, as part of the figure-client application, a FigureRestTemplateClient instance is injected into the following integration test. Java @SpringBootTest class FigureClientTest { @Autowired private FigureRestTemplateClient figureClient; @Test void allFigures() { List<Figure> figures = figureClient.allFigures(); Assertions.assertFalse(figures.isEmpty()); } @Test void oneFigure() { long id = figureClient.allFigures().stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); Optional<Figure> figure = figureClient.oneFigure(id); Assertions.assertTrue(figure.isPresent()); } @Test void createFigure() { var request = new FigureRequest( "Fig " + UUID.randomUUID()); Figure figure = figureClient.createFigure(request); Assertions.assertNotNull(figure); Assertions.assertTrue(figure.id() > 0L); Assertions.assertEquals(request.name(), figure.name()); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.createFigure(request)); Assertions.assertEquals("A Figure with the same 'name' already exists.", ex.getMessage()); Assertions.assertEquals(HttpStatus.BAD_REQUEST.value(), ex.getStatusCode().value()); Assertions.assertEquals(""" {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"}""", ex.getDetail()); } @Test void updateFigure() { List<Figure> figures = figureClient.allFigures(); long id = figures.stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); var updatedRequest = new FigureRequest("Updated Fig " + UUID.randomUUID()); Figure updatedFigure = figureClient.updateFigure(id, updatedRequest); Assertions.assertNotNull(updatedFigure); Assertions.assertEquals(id, updatedFigure.id()); Assertions.assertEquals(updatedRequest.name(), updatedFigure.name()); Figure otherExistingFigure = figures.stream() .filter(f -> f.id() != id) .findFirst() .orElseThrow(() -> new RuntimeException("Not enough figures")); var updateExistingRequest = new FigureRequest(otherExistingFigure.name()); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.updateFigure(id, updateExistingRequest)); Assertions.assertEquals(HttpStatus.INTERNAL_SERVER_ERROR.value(), ex.getStatusCode().value()); } @Test void deleteFigure() { long id = figureClient.allFigures().stream() .findFirst() .orElseThrow(() -> new RuntimeException("No figures found")) .id(); figureClient.deleteFigure(id); CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.deleteFigure(id)); Assertions.assertEquals(HttpStatus.BAD_REQUEST.value(), ex.getStatusCode().value()); Assertions.assertEquals("Figure not found.", ex.getMessage()); } @Test void randomFigure() { CustomException ex = Assertions.assertThrows(CustomException.class, () -> figureClient.randomFigure()); Assertions.assertEquals(HttpStatus.INTERNAL_SERVER_ERROR.value(), ex.getStatusCode().value()); Assertions.assertEquals("Not implemented yet.", ex.getMessage()); } } When running, for instance, the above createFigure() test, the RestTemplate and the LoggingInterceptor contribute to clearly describe what’s happening and display it in the client log: Plain Text [main] DEBUG RestTemplate#HTTP POST http://localhost:8082/api/v1/figures [main] DEBUG InternalLoggerFactory#Using SLF4J as the default logging framework [main] DEBUG RestTemplate#Accept=[application/json, application/*+json] [main] DEBUG RestTemplate#Writing [FigureRequest[name=Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG LoggingInterceptor#Response body: {"id":8,"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG RestTemplate#Response 201 CREATED [main] DEBUG RestTemplate#Reading to [com.hcd.figureclient.service.dto.Figure] [main] DEBUG RestTemplate#HTTP POST http://localhost:8082/api/v1/figures [main] DEBUG RestTemplate#Accept=[application/json, application/*+json] [main] DEBUG RestTemplate#Writing [FigureRequest[name=Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b]] with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 6aa854a5-ba7a-4bbf-8160-70adf7d3e59b"} [main] DEBUG LoggingInterceptor#Response body: {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"} [main] DEBUG RestTemplate#Response 400 BAD_REQUEST And with that, a client implementation using RestTemplate is complete. RestClient Implementation The aim here, as stated from the beginning, is to be able to accomplish the same, but instead of using RestTemplate, to use a RestClient instance. As the LoggingInterceptor, AuthInterceptor and the CustomResponseErrorHandler can be reused, they are not changed, and the RestClient configured as below. Java @Bean public RestClient restClient(@Value("${figure.service.url}") String url, LoggingInterceptor loggingInterceptor, AuthInterceptor authInterceptor, CustomResponseErrorHandler customResponseErrorHandler) { return RestClient.builder() .baseUrl(url) .requestFactory(new BufferingClientHttpRequestFactory(new SimpleClientHttpRequestFactory())) .requestInterceptor(loggingInterceptor) .requestInterceptor(authInterceptor) .defaultStatusHandler(customResponseErrorHandler) .build(); } Then, the instance is injected into a new FigureClient implementation. Java @Service public class FigureRestClient implements FigureClient { private final RestClient restClient; public FigureRestClient(RestClient restClient) { this.restClient = restClient; } @Override public List<Figure> allFigures() { var figures = restClient.get() .retrieve() .body(Figure[].class); if (figures == null) { throw new CustomException("Could not get the figures."); } return List.of(figures); } @Override public Optional<Figure> oneFigure(long id) { var figure = restClient.get() .uri("/{id}", id) .retrieve() .body(Figure.class); return Optional.ofNullable(figure); } @Override public Figure createFigure(FigureRequest figureRequest) { var figure = restClient.post() .contentType(MediaType.APPLICATION_JSON) .body(figureRequest) .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not create figure."); } return figure; } @Override public Figure updateFigure(long id, FigureRequest figureRequest) { var figure = restClient.put() .uri("/{id}", id) .contentType(MediaType.APPLICATION_JSON) .body(figureRequest) .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not update figure."); } return figure; } @Override public void deleteFigure(long id) { restClient.delete() .uri("/{id}", id) .retrieve() .toBodilessEntity(); } @Override public Figure randomFigure() { var figure = restClient.get() .uri("/random") .retrieve() .body(Figure.class); if (figure == null) { throw new CustomException("Could not get a random figure."); } return figure; } } In addition to these, there is only one important step left: to test the client-server integration. In order to fulfill that, it is enough to replace the FigureRestTemplateClient instance with the FigureRestClient one above in the previous FigureClientTest. Java @SpringBootTest class FigureClientTest { @Autowired private FigureRestClient figureClient; ... } If running, for instance, the same createFigure() test, the client output is similar. Apparently, RestClient is not as generous (or verbose) as RestTemplate when it comes to logging, but there is room for improvement as part of the custom LoggingInterceptor. Plain Text [main] DEBUG DefaultRestClient#Writing [FigureRequest[name=Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4]] as "application/json" with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG LoggingInterceptor#Response body: {"id":9,"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG DefaultRestClient#Reading to [com.hcd.figureclient.service.dto.Figure] [main] DEBUG DefaultRestClient#Writing [FigureRequest[name=Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4]] as "application/json" with org.springframework.http.converter.json.MappingJackson2HttpMessageConverter [main] DEBUG LoggingInterceptor#Request body : {"name":"Fig 1155fd2c-91fe-486d-aaa3-35bf682629d4"} [main] DEBUG LoggingInterceptor#Response body: {"title":"Bad Request","status":400,"detail":"A Figure with the same 'name' already exists.","instance":"/api/v1/figures"} That’s it, the migration from RestTemplate to RestClient is now complete. Conclusions When it comes to new synchronous API client implementations, I find RestClient the best choice mostly for its functional and fluent API style. For older projects, which had been started before Spring Framework version 6.1 (Spring Boot 3.2, respectively), introduced RestClient and most probably are still using RestTemplate, I consider the migration worth planning and doing (more details in [Resource 4]). Moreover, the possibility of reusing existing components (ClientHttpRequestInterceptors, ResponseErrorHandlers, etc.) is another incentive for such a migration. Ultimately, as a last resort, it is even possible to create a RestClient instance using the already configured RestTemplate and go from there, although I find this solution pretty tangled. Resources RestTemplate Spring Framework API ReferenceWebClient Spring Framework API ReferenceRestClient Spring Framework API ReferenceMigrating from RestTemplate to RestClientfigure-service source codefigure-client source codeThe picture was taken in Bucharest, Romania.

By Horatiu Dan

Integrating AI Into Test Automation Frameworks With the ChatGPT API

When I first tried to implement AI in a test automation framework, I expected it to be helpful only for a few basic use cases. A few experiments later, I noticed several areas where the ChatGPT API actually saved me time and gave the test automation framework more power: producing realistic test data, analyzing logs in white-box tests, and handling flaky tests in CI/CD. Getting Started With the ChatGPT API ChatGPT API is a programming interface by OpenAI that operates on top of the HTTP(s) protocol. It allows sending requests and retrieving outputs from a pre-selected model as raw text, JSON, XML, or any other format you prefer to work with. The API documentation is clear enough to get started, with examples of request/response bodies that made the first call straightforward. In my case, I just generated an API key in the OpenAI developer platform and plugged it into the framework properties to authenticate requests. Building a Client for Integration With the API I built the integration in both Java and Python, and the pattern is the same: Send a POST with JSON and read the response, so it can be applied in almost any programming language. Since I prefer to use Java in automation, here is an example of what a client might look like: Java import java.net.http.*; import java.net.URI; import java.time.Duration; public class OpenAIClient { private final HttpClient http = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(20)).build(); private final String apiKey; public OpenAIClient(String apiKey) { this.apiKey = apiKey; } public String chat(String userPrompt) throws Exception { String body = """ { "model": "gpt-5-mini", "messages": [ {"role":"system","content":"You are a helpful assistant for test automation..."}, {"role":"user","content": %s} ] } """.formatted(json(userPrompt)); HttpRequest req = HttpRequest.newBuilder() .uri(URI.create("https://api.openai.com/v1/chat/completions")) .timeout(Duration.ofSeconds(60)) .header("Authorization", "Bearer " + apiKey) .header("Content-Type", "application/json") .POST(HttpRequest.BodyPublishers.ofString(body)) .build(); HttpResponse<String> res = http.send(req, HttpResponse.BodyHandlers.ofString()); if (res.statusCode() >= 300) throw new RuntimeException(res.body()); return res.body(); } } As you probably have already noticed, one of the query parameters in the request body is the GPT model. Models differ in speed, cost, and capabilities: some are faster, while others are slower; some are expensive, while others are cheap, and some support multimodality, while others do not. Therefore, before integrating with the ChatGPT API, I recommend that you determine which model is best suited for performing tasks and set limits for it. On the OpenAI website, you can find a page where you can select several models and compare them to make a better choice. It will also probably be good to know that custom client implementation can also be extended to support server-sent streaming events to show results as they’re generated, and the Realtime API for multimodal purposes. This is what you can use for processing logs and errors in real time and identifying anomalies on the fly. Integration Architecture In my experience, integration with the ChatGPT API only makes sense in testing when applied to the correct problems. In my practice, I found three real-world scenarios I mentioned earlier, and now let’s take a closer look at them. Use Case 1: Test Data Generation The first use case I tried was a test data generation for automation tests. Instead of relying on hardcoded values, ChatGPT can provide strong and realistic data sets, ranging from user profiles with household information to unique data used in exact sciences. In my experience, this variety of data helped uncover issues that fixed or hardcoded data would never catch, especially around boundary values and rare edge cases. The diagram below illustrates how this integration with the ChatGPT API for generating test data works. At the initial stage, the TestNG Runner launches the suite, and before running the test, it goes to the ChatGPT API and requests test data for the automation tests. This test data will later be processed at the data provider level, and automated tests will be run against it with newly generated data and expected assertions. Java class TestUser { public String firstName, lastName, email, phone; public Address address; } class Address { public String street, city, state, zip; } public List<TestUser> generateUsers(OpenAIClient client, int count) throws Exception { String prompt = """ You generate test users as STRICT JSON only. Schema: {"users":[{"firstName":"","lastName":"","email":"","phone":"", "address":{"street":"","city":"","state":"","zip":""}]} Count = %d. Output JSON only, no prose. """.formatted(count); String content = client.chat(prompt); JsonNode root = new ObjectMapper().readTree(content); ArrayNode arr = (ArrayNode) root.path("users"); List<TestUser> out = new ArrayList<>(); ObjectMapper m = new ObjectMapper(); arr.forEach(n -> out.add(m.convertValue(n, TestUser.class))); return out; } This solved the problem of repetitive test data and helped to detect errors and anomalies earlier. The main challenge was prompt reliability, and if the prompt wasn’t strict enough, the model would add extra text that broke the JSON parser. In my case, the versioning of prompts was the best way to keep improvements under control. Use Case 2: Log Analysis In some recent open-source projects I came across, automated tests also validated system behavior by analyzing logs. In most of these tests, there is an expectation that a specific message should appear in the application console or in DataDog or Loggly, for example, after calling one of the REST endpoints. Such tests are needed when the team conducts white-box testing. But what if we take it a step further and try to send logs to ChatGPT, asking it to check the sequence of messages and identify potential anomalies that may be critical for the service? Such an integration might look like this: When an automated test pulls service logs (e.g., via the Datadog API), it groups them and sends a sanitized slice to the ChatGPT API for analysis. The ChatGPT API has to return a structured verdict with a confidence score. In case anomalies are flagged, the test fails and displays the reasons from the response; otherwise, it passes. This should keep assertions focused while catching unexpected patterns you didn’t explicitly code for. The Java code for this use case might look like this: Java //Redaction middleware (keep it simple and fast) public final class LogSanitizer { private LogSanitizer() {} public static String sanitize(String log) { if (log == null) return ""; log = log.replaceAll("(?i)(api[_-]?key\\s*[:=]\\s*)([a-z0-9-_]{8,})", "$1[REDACTED]"); log = log.replaceAll("([A-Za-z0-9-_]{20,}\\.[A-Za-z0-9-_]+\\.[A-Za-z0-9-_]+)", "[REDACTED_JWT]"); log = log.replaceAll("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+", "[REDACTED_EMAIL]"); return log; } } //Ask for a structured verdict record Verdict(String verdict, double confidence, List<String> reasons) {} public Verdict analyzeLogs(OpenAIClient client, String rawLogs) throws Exception { String safeLogs = LogSanitizer.sanitize(rawLogs); String prompt = """ You are a log-analysis assistant. Given logs, detect anomalies (errors, timeouts, stack traces, inconsistent sequences). Respond ONLY as JSON with this exact schema: {"verdict":"PASS|FAIL","confidence":0.0-1.0,"reasons":["...","..."]} Logs (UTC): ---------------- %s ---------------- """.formatted(safeLogs); // Chat with the model and parse the JSON content field String content = client.chat(prompt); ObjectMapper mapper = new ObjectMapper(); JsonNode jNode = mapper.readTree(content); String verdict = jNode.path("verdict").asText("PASS"); double confidence = jNode.path("confidence").asDouble(0.0); List<String> reasons = mapper.convertValue( jNode.path("reasons").isMissingNode() ? List.of() : jNode.path("reasons"), new com.fasterxml.jackson.core.type.TypeReference<List<String>>() {} ); return new Verdict(verdict, confidence, reasons); } Before implementing such an integration, it is important to remember that logs often contain sensitive information, which may include API keys, JWT tokens, or user email addresses. Therefore, sending raw logs to the cloud API is a security risk, and in this case, the data sanitization must be performed. That is why, in my example, I added a simple LogSanitizer middleware to sanitize sensitive data before sending these logs to the ChatGPT API. It is also important to understand that this approach does not replace traditional assertions, but complements them. You can use them instead of dozens of complex checks, allowing the model to detect abnormal behavior. The most important thing is to treat the ChatGPT API verdict as a recommendation and leave the final decision to the automated framework itself based on the specified threshold values. For example, consider a test a failure only if the confidence is higher than 0.8. Use Case 3: Test Stabilization One of the most common problems in test automation is the occurrence of flaky tests. Tests can fail for various reasons, including changes to the API contract or interface. However, the worst scenario is when tests fail due to an unstable testing environment. Typically, for such unstable tests, the teams usually enable retries, and the test is run multiple times until it passes or, conversely, fails after three unsuccessful attempts in a row. But what if we give artificial intelligence the opportunity to decide whether a test needs to be restarted or whether it can be immediately marked as failed or vice versa? Here’s how this idea can be applied in a testing framework: When a test fails, the first step is to gather as much context as possible, including the stack trace, service logs, environment configuration, and, if applicable, a code diff. All this data should be sent to the ChatGPT API for analysis to obtain a verdict, which is then passed to the AiPolicy. It is essential not to let ChatGPT make decisions independently. If the confidence level is high enough, AiPolicy can quarantine the test to prevent the pipeline from being blocked, and when the confidence level is below a specific value, the test can be re-tried or immediately marked as failed. I believe it is always necessary to leave the decision logic to the automation framework to maintain control over the test results, while still using AI-based integration. The main goal for this idea is to save time on analyzing unstable tests and reduce their number. Reports after processing data by ChatGPT become more informative and provide clearer insights into the root causes of failures. Conclusion I believe that integrating the ChatGPT API into a test automation framework can be an effective way to extend its capabilities, but there are compromises to this integration that need to be carefully weighed. One of the most important factors is cost. For example, in a set of 1,000 automated tests, of which about 20 fail per run, sending logs, stack traces, and environment metadata to the API can consume over half a million input tokens per run. Adding test data generation to this quickly increases token consumption. In my opinion, the key point is that the cost is directly proportional to the amount of data: the more you send, the more you pay. Another major issue I noticed is the security and privacy concerns. Logs and test data often contain sensitive information such as API keys, JWT tokens, or users' data, and sending raw data to the cloud is rarely acceptable in production. In practice, this means either using open-source LLMs like LLaMA deployed locally or providing a redaction/anonymization layer between your framework and the API so that sensitive fields are removed or replaced before anything leaves your testing environment. Model selection also plays a role. I've found that in many cases the best strategy is to combine them: using smaller models for routine tasks, and larger ones only where higher accuracy really matters. With these considerations in mind, the ChatGPT API can bring real value to testing. It helps generate realistic test data, analyze logs more intelligently, and makes it easier to manage flaky tests. The integration also makes reporting more informative, adding context and analytics that testers would otherwise have to research manually. As I have observed in practice, utilizing AI effectively requires controlling costs, protecting sensitive data, and maintaining decision-making logic within an automation framework to enable effective regulation of AI decisions. It reminds me of the early days of automation, when teams were beginning to weigh the benefits against the limitations to determine where the real value lay.

By Serhii Romanov

How to Build Secure Knowledge Base Integrations for AI Agents

Done well, knowledge base integrations enable AI agents to deliver specific, context-rich answers without forcing employees to dig through endless folders. Done poorly, they introduce security gaps and permissioning mistakes that erode trust. The challenge for software developers building these integrations is that no two knowledge bases handle permissions the same way. One might gate content at the space level, another at the page level, and a third at the attachment level. Adding to these challenges, permissions aren't static. They change when people join or leave teams, switch roles, or when content owners update visibility rules. If your integration doesn't mirror these controls accurately and in real time, you risk exposing the wrong data to the wrong person. In building these knowledge base integrations ourselves, we've learned lots of practical tips for how to build secure, maintainable connectors that shorten the time to deployment without cutting corners on data security. 1. Treat Permissions as a First-Class Data Type Too many integration projects prioritize syncing content over permissions. This approach is backwards. Before your AI agent processes a single page, it should understand the permission model of the source system and be able to represent it internally. This means: Mapping every relevant permission scope in the source system (space, folder, page, attachment, comment).Representing permissions in your data model so your AI agent can enforce them before returning a result.Designing for exceptions. For example, if an article is generally public within a department but contains one restricted attachment, your connector should respect that partial restriction. For example, in a Confluence integration, you should check both space-level and page-level rules for each request. If you cache content to speed up retrieval, you must also cache the permissions and invalidate them promptly when they change. 2. Sync Permissions as Often as Content Permissions drift quickly. Someone might be promoted, transferred, or removed from a sensitive project, and the content they previously accessed is suddenly off-limits. Your AI agent should never rely on a stale permission snapshot. A practical approach is to tie permission updates to the same sync cadence as content updates. If you're fetching new or updated articles every five minutes, refresh the associated access control lists (ACLs) on the same schedule. If the source system supports webhooks or event subscriptions for permission changes, use them to trigger targeted re-syncs. 3. Respect the Principle of Least Privilege in Responses Enforcing permissions also shapes what your AI agent returns. For example, say your AI agent receives the query, "What are the latest results from our employee engagement survey?" The underlying knowledge base contains a page with survey results visible only to HR and executives. Even if the query perfectly matches the page's content, the agent should respond with either no result or a message indicating that the content is restricted. This means filtering retrieved documents at query time based on the current user's identity and permissions, not just when content is first synced. Retrieval-augmented generation (RAG) pipelines need this filter stage before passing context to the LLM. 4. Normalize Data Without Flattening Security Every knowledge base stores content differently, whether that's nested pages in Confluence, blocks in Notion, or articles in Zendesk. Normalizing these formats makes it easier for your AI agent to handle multiple systems. But normalization should never strip away the original permission structures. For instance, when creating a unified search index, store both the normalized text and the original system's permission metadata. Your query service can then enforce the correct rules regardless of which source system the content came from. 5. Handle Hierarchies and Inheritance Carefully Most systems allow permission inheritance, where you grant access to a top-level space, and then all child pages inherit those rights unless overridden. Your connector must understand and replicate this logic. For example, with an internal help desk AI agent, a "VPN Troubleshooting" article may inherit view rights from its parent "Network Resources" space. But if someone restricts that one article to a smaller group, your integration must override the inherited rule and enforce the more restrictive setting. 6. Test With Realistic, Complex Scenarios Permission bugs often hide in edge cases: Mixed inheritance and explicit restrictionsUsers with multiple, overlapping rolesAttachments with different permissions than their parent page Developers should build a test harness that mirrors these conditions using anonymized or synthetic data. Validate not only that your AI agent can fetch the right content, but that it never exposes restricted data, even when queried indirectly ("What did the survey results say about the marketing team?"). 7. Build for Ongoing Maintenance A secure, reliable knowledge base integration isn't a "set it and forget it" feature. It's an active part of your AI agent's architecture. Once deployed, knowledge base integrations require constant upkeep: API version changes, evolving permission models, and shifts in organizational structure. Assign ownership for monitoring and updating each connector, and automate regression tests for permission enforcement. Document your mapping between source-system roles and internal permission groups so that changes can be made confidently when needed. By giving permissions the same engineering rigor as content retrieval, you protect sensitive data and preserve trust in the system. That trust is what ultimately allows these AI agents to be embedded into the real workflows where they deliver the most value. You may be looking at the steps involved in building knowledge base connectors and wonder why they matter. When implemented well, they can transform workflows: Enterprise AI search: By integrating with a company's wiki, CRM, and file storage, a search agent can answer multi-step queries like, "What's the status of the Acme deal?" pulling from sales notes, internal strategy docs, and shared project plans. Permissions ensure that deal details remain visible only to the account team.IT help desk agent: When connected to a knowledge base, the agent can deliver precise, step-by-step troubleshooting guides to employees. If a VPN setup page is restricted to IT staff, the agent won't surface it to non-IT users.New hire onboarding bot: Integrated with the company wiki and messaging platform, an agent can answer questions about policies, teams, and tools. Each answer is filtered through the same rules that would apply if the employee searched manually. These examples work not because the AI agent "knows everything," but because it knows how to retrieve the right things for the right person at the right time. As knowledge base products become the standard for AI agents, it's critical to manage integrations in a way that prioritizes data security and trust.

By Gil Feig

Isolation Level for MongoDB Multi-Document Transactions (Strong Consistency)

Many outdated or imprecise claims about transaction isolation levels in MongoDB persist. These claims are outdated because they may be based on an old version where multi-document transactions were introduced, MongoDB 4.0, such as the old Jepsen report, and issues have been fixed since then. They are also imprecise because people attempt to map MongoDB's transaction isolation to SQL isolation levels, which is inappropriate, as the SQL Standard definitions ignore Multi-Version Concurrency Control (MVCC), utilized by most databases, including MongoDB. Martin Kleppmann has discussed this issue and provided tests to assess transaction isolation and potential anomalies. I will conduct these tests on MongoDB to explain how multi-document transactions work and avoid anomalies. I followed the structure of Martin Kleppmann's tests on PostgreSQL and ported them to MongoDB. The read isolation level in MongoDB is controlled by the Read Concern, and the "snapshot" read concern is the only one comparable to other Multi-Version Concurrency Control SQL databases, and maps to Snapshot Isolation, improperly called Repeatable Read to use the closest SQL standard term. As I test on a single-node lab, I use "majority" to show that it does more than Read Committed. The write concern should also be set to "majority" to ensure that at least one node is common between the read and write quorums. Recap on Isolation Levels in MongoDB Let me explain quickly the other isolation levels and why they cannot be mapped to the SQL standard: readConcern: { level: "local" } is sometimes compared to Uncommitted Reads because it may show a state that can be later rolled back in case of failure. However, some SQL databases may show the same behavior in some rare conditions (example here) and still call that Read CommittedreadConcern: { level: "majority" } is sometimes compared to Read Committed, because it avoids uncommitted reads. However, Read Committed was defined for wait-on-conflict databases to reduce the lock duration in two-phase locking, but MongoDB multi-document transactions use fail-on-conflict to avoid waits. Some databases consider that Read Committed can allow reads from multiple states (example here) while some others consider it must be a statement-level snapshot isolation (examples here). In a multi-shard transaction, majority may show a result from multiple states, as snapshot is the one being timeline consistent.readConcern: { level: "snapshot" } is the real equivalent to Snapshot Isolation, and prevents more anomalies than Read Committed. Some databases even call that "serializable" (example here) because the SQL standard ignores the write-skew anomaly.readConcern: { level: "linearlizable" } is comparable to serializable, but for a single document, not available for multi-document transactions, similar to many SQL databases that do not provide serializable as it reintroduces scalability the problems of read locks, that MVCC avoids. Read Committed Basic Requirements (G0, G1a, G1b, G1c) Here are some tests for anomalies typically prevented in Read Committed. I'll run them with readConcern: { level: "majority" } but keep in mind that readConcern: { level: "snapshot" } may be better if you want a consistent snapshot across multiple shards. MongoDB Prevents Write Cycles (G0) With Conflict Error JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); In a two-phase locking database, with wait-on-conflict behavior, the second transaction would wait for the first one to avoid anomalies. However, MongoDB with transactions is fail-on-conflict and raises a retriable error to avoid the anomaly. Each transaction touched only one document, but it was declared explicitly with a session and startTransaction(), to allow multi-document transactions, and this is why we observed the fail-on-conflict behavior to let the application apply its retry logic for complex transactions. If the conflicting update was run as a single-document transaction, equivalent to an auto-commit statement, it would have used a wait-on-conflict behavior. I can test it by immediately running this while the t1 transaction is still active: JavaScript const db = db.getMongo().getDB("test_db"); print(`Elapsed time: ${ ((startTime = new Date()) && db.test.updateOne({ _id: 1 }, { $set: { value: 12 } })) && (new Date() - startTime) } ms`); Elapsed time: 72548 ms I've run the updateOne({ _id: 1 }) without an implicit transaction. It waited for the other transaction to terminate, which happened after a 60-second timeout, and then the update was successful. The first transaction that timed out is aborted: JavaScript session1.commitTransaction(); MongoServerError[NoSuchTransaction]: Transaction with { txnNumber: 2 } has been aborted. The behavior of conflict in transactions differs: wait-on-conflict for implicit single-document transactionsfail-on-conflict for explicit multiple-document transactions immediately, resulting in a transient error, without waiting, to let the application rollback and retry. MongoDB Prevents Aborted Reads (G1a) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session1.abortTransaction(); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] session2.commitTransaction(); MongoDB prevents reading an aborted transaction by reading only the committed value when Read Concern is 'majority' or 'snapshot.' MongoDB Prevents Intermediate Reads (G1b) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 101 } }); T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The non-committed change from T1 is not visible to T2. JavaScript T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); session1.commitTransaction(); // T1 commits T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] The committed change from T1 is still not visible to T2 because it happened after T2 started. This is different from the majority of Multi-Version Concurrency Control SQL databases. To minimize the performance impact of wait-on-conflict, they reset the read time before each statement in Read Committed, as phantom reads are allowed. They would have displayed the newly committed value with this example. MongoDB never does that; the read time is always the start of the transaction, and no phantom read anomaly happens. However, it doesn't wait to see if the conflict is resolved or must fail with a deadlock, and fails immediately to let the application retry it. MongoDB Prevents Circular Information Flow (G1c) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 22 } }); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] session1.commitTransaction(); session2.commitTransaction(); In both transactions, the uncommitted changes are not visible to others. MongoDB Prevents Observed Transaction Vanishes (OTV) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T3 const session3 = db.getMongo().startSession(); const T3 = session3.getDatabase("test_db"); session3.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T1.test.updateOne({ _id: 2 }, { $set: { value: 19 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This anomaly is prevented by fail-on-conflict with an explicit transaction. With an implicit single-document transaction, it would have to wait for the conflicting transaction to end. MongoDB Prevents Predicate-Many-Preceders (PMP) With a SQL database, this anomaly would require the Snapshot Isolation level because Read Committed uses different read times per statement. However, I can show it in MongoDB with 'majority' read concern, 'snapshot' being required only to get cross-shard snapshot consistency. JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ value: 30 }).toArray(); [] T2.test.insertOne( { _id: 3, value: 30 } ); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The newly inserted row is not visible because it was committed by T2 after the start of T1. Martin Kleppmann's tests include some variations with a delete statement and a write predicate: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.updateMany({}, { $inc: { value: 10 } }); T2.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, rather than blocking, the delete detects the conflict and raises a retriable exception to prevent the anomaly. Compared to PostgreSQL, which prevents that in Repeatable Read, it saves the waiting time before failure, but requires the application to implement a retry logic. MongoDB Prevents Lost Update (P4) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T1.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. As it is an explicit transaction, the update doesn't wait and raises a retriable exception, so that it is impossible to overwrite the other update without waiting for its completion. MongoDB Prevents Read Skew (G-single) JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.find({ _id: 2 }); [ { _id: 2, value: 20 } ] In SQL databases with Read Committed isolation, a read skew anomaly could display the value 18. However, MongoDB avoids this issue by reading the same value of 20 consistently throughout the transaction, as it reads data as of the start of the transaction. Martin Kleppmann's tests include a variation with predicate dependency: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.findOne({ value: { $mod: [5, 0] } }); { _id: 1, value: 10 } T2.test.updateOne({ value: 10 }, { $set: { value: 12 } }); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] The uncommitted value 12 which is a multiple of 3 is not visible to the transaction that started before. Another test includes a variation with a write predicate in a delete statement: JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: 1 }); [ { _id: 1, value: 10 } ] T2.test.find(); [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 12 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 18 } }); session2.commitTransaction(); T1.test.deleteMany({ value: 20 }); MongoServerError[WriteConflict]: Caused by :: Write conflict during plan execution and yielding is disabled. :: Please retry your operation or multi-document transaction. This read skew anomaly is prevented by the fail-on-conflict behavior when writing a document that has uncommitted changes from another transaction. Write Skew (G2-item) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "majority" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.find({ _id: { $in: [1, 2] } }) [ { _id: 1, value: 10 }, { _id: 2, value: 20 } ] T2.test.updateOne({ _id: 1 }, { $set: { value: 11 } }); T2.test.updateOne({ _id: 2 }, { $set: { value: 21 } }); session1.commitTransaction(); session2.commitTransaction(); MongoDB doesn't detect the read/write conflict when one transaction has read a value updated by the other, and then writes something that may have depended on this value. The Read Concern doesn't provide the Serializable guarantee. Such isolation requires acquiring range or predicate locks during reads, and doing it prematurely would hinder the performance of a database designed to scale. For the transactions that need to avoid this, the application can transform the read/write conflict to a write/write conflict by updating a field in the document that was read to be sure that other transactions do not modify it. Or re-check the value when updating. Anti-Dependency Cycles (G2) Must Be Managed by the Application JavaScript // init use test_db; db.test.drop(); db.test.insertMany([ { _id: 1, value: 10 }, { _id: 2, value: 20 } ]); // T1 const session1 = db.getMongo().startSession(); const T1 = session1.getDatabase("test_db"); session1.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); // T2 const session2 = db.getMongo().startSession(); const T2 = session2.getDatabase("test_db"); session2.startTransaction({ readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } }); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T2.test.find({ value: { $mod: [3, 0] } }).toArray(); [] T1.test.insertOne( { _id: 3, value: 30 } ); T1.test.insertOne( { _id: 4, value: 42 } ); session1.commitTransaction(); session2.commitTransaction(); T1.test.find({ value: { $mod: [3, 0] } }).toArray(); [ { _id: 3, value: 30 }, { _id: 4, value: 42 } ] The read/write conflict was not detected, and both transactions were able to write, even if they may have depended on a previous read that had been modified by the other transaction. MongoDB does not acquire locks across read and write calls. If you run a multi-document transaction where the writes depend on the reads, the application must explicitly write to the read set in order to detect the write conflict and avoid the anomaly. All those tests were based on https://github.com/ept/hermitage. There's a lot of information about MongoDB transactions in the MongoDB Multi-Document ACID Transactions whitepaper from 2020. While the document model offers simplicity and performance when a single document matches the business transaction, MongoDB supports multi-statement transactions with Snapshot Isolation, similar to many SQL databases using Multi-Version Concurrency Control (MVCC), but favoring fail-on-conflict rather than wait. Despite outdated myths surrounding NoSQL or based on old versions, its transaction implementation is robust and effectively prevents common transactional anomalies.

By Franck Pachot

Spring Boot WebSocket: Building a Multichannel Chat in Java

As you may have already guessed from the title, the topic for today will be Spring Boot WebSockets. Some time ago, I provided an example of WebSocket chat based on Akka toolkit libraries. However, this chat will have somewhat more features, and a quite different design. I will skip some parts so as not to duplicate too much content from the previous article. Here you can find a more in-depth intro to WebSockets. Please note that all the code that’s used in this article is also available in the GitHub repository. Spring Boot WebSocket: Tools Used Let’s start the technical part of this text with a description of the tools that will be further used to implement the whole application. As I cannot fully grasp how to build a real WebSocket API with classic Spring STOMP overlay, I decided to go for Spring WebFlux and make everything reactive. Spring Boot – No modern Java app based on Spring can exist without Spring Boot; all the autoconfiguration is priceless.Spring WebFlux – A reactive version of classic Spring, provides quite a nice and descriptive toolkit for handling both WebSockets and REST. I would dare to say that it is the only way to actually get WebSocket support in Spring.Mongo – One of the most popular NoSQL databases, I am using it for storing message history.Spring Reactive Mongo – Spring Boot starter for handling Mongo access in a reactive fashion. Using reactive in one place but not the other is not the best idea. Thus, I decided to make DB access reactive as well. Let’s start the implementation! Spring Boot WebSocket: Implementation Dependencies and Config pom.xml XML <dependencies>  <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-webflux</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-mongodb-reactive</artifactId> </dependency> </dependencies> application.properties Properties files spring.data.mongodb.uri=mongodb://chats-admin:admin@localhost:27017/chats I prefer .properties over .yml — In my honest opinion, YAML is not readable and non-maintainable on a larger scale. WebSocketConfig Java @Configuration class WebSocketConfig { @Bean ChatStore chatStore(MessagesStore messagesStore) { return new DefaultChatStore(Clock.systemUTC(), messagesStore); } @Bean WebSocketHandler chatsHandler(ChatStore chatStore) { return new ChatsHandler(chatStore); } @Bean SimpleUrlHandlerMapping handlerMapping(WebSocketHandler wsh) { Map<String, WebSocketHandler> paths = Map.of("/chats/{id}", wsh); return new SimpleUrlHandlerMapping(paths, 1); } @Bean WebSocketHandlerAdapter webSocketHandlerAdapter() { return new WebSocketHandlerAdapter(); } } And surprise, all four beans defined here are very important. ChatStore – Custom bean for operating on chats, I will go into more details on this bean in the following steps.WebSocketHandler – Bean that will store all the logic related to handling WebSocket sessions.SimpleUrlHandlerMapping – Responsible for mapping URLs to correct handler full URL for this one will look more or less like this ws://localhost:8080/chats/{id}.WebSocketHandlerAdapter – A kind of capability bean it adds WebSockets handling support to Spring Dispatcher Servlet. ChatsHandler Java class ChatsHandler implements WebSocketHandler { private final Logger log = LoggerFactory.getLogger(ChatsHandler.class); private final ChatStore store; ChatsHandler(ChatStore store) { this.store = store; } @Override public Mono handle(WebSocketSession session) { String[] split = session.getHandshakeInfo() .getUri() .getPath() .split("/"); String chatIdStr = split[split.length - 1]; int chatId = Integer.parseInt(chatIdStr); ChatMeta chatMeta = store.get(chatId); if (chatMeta == null) { return session.close(CloseStatus.GOING_AWAY); } if (!chatMeta.canAddUser()) { return session.close(CloseStatus.NOT_ACCEPTABLE); } String sessionId = session.getId(); store.addNewUser(chatId, session); log.info("New User {} join the chat {}", sessionId, chatId); return session .receive() .map(WebSocketMessage::getPayloadAsText) .flatMap(message -> store.addNewMessage(chatId, sessionId, message)) .flatMap(message -> broadcastToSessions(sessionId, message, store.get(chatId).sessions()) .doFinally(sig -> store.removeSession(chatId, session.getId())) .then(); } private Mono broadcastToSessions(String sessionId, String message, List sessions) { return sessions .stream() .filter(session -> !session.getId().equals(sessionId)) .map(session -> session.send(Mono.just(session.textMessage(message)))) .reduce(Mono.empty(), Mono::then); } } As I mentioned above, here you can find all the logic related to handling WebSocket sessions. First, we parse the ID of a chat from the URL to get the target chat. Responding with different statuses depends on the context present for a particular chat. Additionally, I am also broadcasting the message to all the sessions related to particular chat — for users to actually exchange the messages. I have also added doFinally trigger that will clear closed sessions from the chatStore, to reduce redundant communication. As a whole, this code is reactive; there are some restrictions I need to follow. I have tried to make it as simple and readable as possible, if you have any idea how to improve it I am open. ChatsRouter Java @Configuration(proxyBeanMethods = false) class ChatRouter { private final ChatStore chatStore; ChatRouter(ChatStore chatStore) { this.chatStore = chatStore; } @Bean RouterFunction routes() { return RouterFunctions .route(POST("api/v1/chats/create"), e -> create(false)) .andRoute(POST("api/v1/chats/create-f2f"), e -> create(true)) .andRoute(GET("api/v1/chats/{id}"), this::get) .andRoute(DELETE("api/v1/chats/{id}"), this::delete); } } WebFlux's approach to defining REST endpoints is quite different from the classic Spring. Above, you can see the definition of 4 endpoints for managing chats. As similar as in the case of Akka implementation I want to have a REST API for managing Chats and WebSocket API for actual handling chats. I will skip the function implementations as they are pretty trivial; you can see them on GitHub. ChatStore First, the interface: Java public interface ChatStore { int create(boolean isF2F); void addNewUser(int id, WebSocketSession session); Mono addNewMessage(int id, String userId, String message); void removeSession(int id, String session); ChatMeta get(int id); ChatMeta delete(int id); Then the implementation: Java public class DefaultChatStore implements ChatStore { private final Map<Integer, ChatMeta> chats; private final AtomicInteger idGen; private final MessagesStore messagesStore; private final Clock clock; public DefaultChatStore(Clock clock, MessagesStore store) { this.chats = new ConcurrentHashMap<>(); this.idGen = new AtomicInteger(0); this.clock = clock; this.messagesStore = store; } @Override public int create(boolean isF2F) { int newId = idGen.incrementAndGet(); ChatMeta chatMeta = chats.computeIfAbsent(newId, id -> { if (isF2F) { return ChatMeta.ofId(id); } return ChatMeta.ofIdF2F(id); }); return chatMeta.id; } @Override public void addNewUser(int id, WebSocketSession session) { chats.computeIfPresent(id, (k, v) -> v.addUser(session)); } @Override public void removeSession(int id, String sessionId) { chats.computeIfPresent(id, (k, v) -> v.removeUser(sessionId)); } @Override public Mono addNewMessage(int id, String userId, String message) { ChatMeta meta = chats.getOrDefault(id, null); if (meta != null) { Message messageDoc = new Message(id, userId, meta.offset.getAndIncrement(), clock.instant(), message); return messagesStore.save(messageDoc) .map(Message::getContent); } return Mono.empty(); } // omitted The base of ChatStore is the ConcurrentHashMap that holds the metadata of all open chats. Most of the methods from the interface are self-explanatory, and there is nothing special behind them. create – Creates a new chat with a bool attribute denoting if the chat is f2f or group.addNewUser – Adds a new user to existing chats.removeUser – Removes a user from the existing chat.get – Gets the metadata of a chat with an ID.delete – Deletes the chat from CMH. The only complex method here is addNewMessages. It increments the message counter within the chat and persists message content in MongoDB, for durability. MongoDB Message Entity Java public class Message { @Id private String id; private int chatId; private String owner; private long offset; private Instant timestamp; private String content; A model for message content stored in a database, there are three important fields here: chatId – Represent chat in which a particular message was sent.ownerId – The userId of the message sender.offset – Ordinal number of message within the chat, for retrieval ordering. MessageStore Java public interface MessagesStore extends ReactiveMongoRepository<Message, String> {} Nothing special, classic Spring Repository, but in a reactive fashion, provides the same set of features as JpaRepository. It is used directly in ChatStore. Additionally, in the main application class, WebsocketsChatApplication, I am activating reactive repositories by using @EnableReactiveMongoRepositories. Without this annotation messageStore from above would not work. And here we go, we have the whole chat implemented. Let’s test it! Spring Boot WebSocket: Testing For tests, I’m using Postman and Simple WebSocket Client. I’m creating a new chat using Postman. In the response body, I got a WebSocket URL to the recently created chat. Now it is time to use them and check if users can communicate with one another. Simple Web Socket Client comes into play here. Thus, I am connecting to the newly created chat here. Here we are, everything is working, and users can communicate with each other. There is one last thing to do. Let’s spend a moment looking at things that can be done better. What Can Be Done Better As what I have just built is the most basic chat app, there are a few (or in fact quite a lot) things that may be done better. Below, I have listed the things I find worthy of improvement: Authentication and rejoining support – Right now, everything is based on the sessionId. It is not an optimal approach. It would be better to have some authentication in place and actual rejoining based on user data.Sending attachments – For now, the chat only supports simple text messages. While texting is the basic function of a chat, users enjoy exchanging images and audio files, too.Tests – There are no tests for now, but why leave it like this? Tests are always a good idea.Overflow in offset – Currency, it is a simple int. If we were to track the offset for a very long time, it would overflow sooner or later. Summary Et voilà! The Spring Boot WebSocket chat is implemented, and the main task is done. You have some ideas on what to develop in the next steps. Please keep in mind that this chat case is very simple, and it will require lots of changes and development for any type of commercial project. Anyway, I hope that you learned something new while reading this article. Thank you for your time. These other resources might interest you: Lock-Free Programming in Java7 API Integration Patterns

By Bartłomiej Żyliński

CORE

Your SDLC Has an Evil Twin — and AI Built It

You think you know your SDLC like the back of your carpal-tunnel-riddled hand: You've got your gates, your reviews, your carefully orchestrated dance of code commits and deployment pipelines. But here's a plot twist straight out of your auntie's favorite daytime soap: there's an evil twin lurking in your organization (cue the dramatic organ music). It looks identical to your SDLC — same commits, same repos, the same shiny outputs flowing into production. But this fake-goatee-wearing doppelgänger plays by its own rules, ignoring your security governance and standards. Welcome to the shadow SDLC — the one your team built with AI when you weren't looking: It generates code, dependencies, configs, and even tests at machine speed, but without any of your governance, review processes, or security guardrails. Checkmarx’s August Future of Application Security report, based on a survey of 1,500 CISOs, AppSec managers, and developers worldwide, just pulled back the curtain on this digital twin drama: 34% of developers say more than 60% of their code is now AI-generated. Only 18% of organizations have policies governing AI use in development. 26% of developers admit AI tools are being used without permission. It’s not just about insecure code sneaking into production, but rather about losing ownership of the very processes you’ve worked to streamline. Your “evil twin” SDLC comes with: Unknown provenance → You can’t always trace where AI-generated code or dependencies came from. Inconsistent reliability → AI may generate tests or configs that look fine but fail in production. Invisible vulnerabilities → Flaws that never hit a backlog because they bypass reviews entirely. This isn’t a story about AI being “bad”, but about AI moving faster than your controls — and the risk that your SDLC’s evil twin becomes the one in charge. The rest of this article is about how to prevent that. Specifically: How the shadow SDLC forms (and why it’s more than just code)The unique risks it introduces to security, reliability, and governanceWhat you can do today to take back ownership — without slowing down your team How the Evil Twin SDLC Emerges The evil twin isn’t malicious by design — it’s a byproduct of AI’s infiltration into nearly every stage of development: Code creation – AI writes large portions of your codebase at scale. Dependencies – AI pulls in open-source packages without vetting versions or provenance. Testing – AI generates unit tests or approves changes that may lack rigor. Configs and infra – AI auto-generates Kubernetes YAMLs, Dockerfiles, Terraform templates. Remediation – AI suggests fixes that may patch symptoms while leaving root causes. The result is a pipeline that resembles your own — but lacks the data integrity, reliability, and governance you’ve spent years building. Sure, It’s a Problem. But Is It Really That Bad? You love the velocity that AI provides, but this parallel SDLC compounds risk by its very nature. Unlike human-created debt, AI can replicate insecure patterns across dozens of repos in hours. And the stats from the FOA report speak for themselves: 81% of orgs knowingly ship vulnerable code — often to meet deadlines. 33% of developers admit they “hope vulnerabilities won’t be discovered” before release. 98% of organizations experienced at least one breach from vulnerable code in the past year — up from 91% in 2024 and 78% in 2023. The share of orgs reporting 4+ breaches jumped from 16% in 2024 to 27% in 2025. That surge isn’t random. It correlates with the explosive rise of AI use in development. As more teams hand over larger portions of code creation to AI without governance, the result is clear: risk is scaling at machine speed, too. Taking Back Control From the Evil Twin You can’t stop AI from reshaping your SDLC. But you can stop it from running rogue. Here’s how: 1. Establish Robust Governance for AI in Development Whitelist approved AI tools with built-in scanning and keep a lightweight approval workflow so devs don’t default to Shadow AI. Enforce provenance standards like SLSA or SBOMs for AI-generated code. Audit usage & tag AI contributions — use CodeQL to detect AI-generated code patterns and require devs to mark AI commits for transparency. This builds reliability and integrity into the audit trail. 2. Strengthen Supply Chain Oversight AI assistants are now pulling in OSS dependencies you didn’t choose — sometimes outdated, sometimes insecure, sometimes flat-out malicious. While your team already uses hygiene tools like Dependabot or Renovate, they’re only table stakes that don’t provide governance. They won’t tell you if AI just pulled in a transitive package with a critical vulnerability, or if your dependency chain is riddled with license risks. That’s why modern SCA is essential in the AI era. It goes beyond auto-bumping versions to: Generate SBOMs for visibility into everything AI adds to your repos. Analyze transitive dependencies several layers deep. Provide exploitable-path analysis so you prioritize what’s actually risky. Auto-updaters are hygiene. SCA is resilience. 3. Measure and Manage Debt Velocity Track debt velocity — measure how fast vulnerabilities are introduced and fixed across repos. Set sprint-based SLAs — if issues linger, AI will replicate them across projects before you’ve logged the ticket. Flag AI-generated commits for extra review to stop insecure patterns from multiplying. Adopt Agentic AI AppSec Assistants — The FOA report highlights that traditional remediation cycles can’t keep pace with machine-speed risk, making autonomous prevention and real-time remediation a necessity, not a luxury. 4. Foster a Culture of Reliable AI Use Train on AI risks like data poisoning and prompt injection. Make secure AI adoption part of the “definition of done.” Align incentives with delivery, not just speed. Create a reliable feedback loop — encourage devs to challenge governance rules that hurt productivity. Collaboration beats resistance. 5. Build Resilience for Legacy Systems Legacy apps are where your evil twin SDLC hides best. With years of accumulated debt and brittle architectures, AI-generated code can slip in undetected. These systems were built when cyber threats were far less sophisticated, lacking modern security features like multi-factor authentication, advanced encryption, and proper access controls. When AI is bolted onto these antiquated platforms, it doesn't just inherit the existing vulnerabilities, but can rapidly propagate insecure patterns across interconnected systems that were never designed to handle AI-generated code. The result is a cascade effect where a single compromised AI interaction can spread through poorly-secured legacy infrastructure faster than your security team can detect it. Here’s what’s often missed: Manual before automatic: Running full automation on legacy repos without a baseline can drown teams in false positives and noise. Start with manual SBOMs on the most critical apps to establish trust and accuracy, then scale automation. Triage by risk, not by age: Not every legacy system deserves equal attention. Prioritize repos with heavy AI use, repeated vulnerability patterns, or high business impact. Hybrid skills are mandatory: Devs need to learn how to validate AI-generated changes in legacy contexts, because AI doesn’t “understand” old frameworks. A dependency bump that looks harmless in 2025 might silently break a 2012-era API. Conclusion: Bring the ‘Evil Twin’ Back into the Family The “evil twin” of your SDLC isn’t going away. It’s already here, writing code, pulling dependencies, and shaping workflows. The question is whether you’ll treat it as an uncontrolled shadow pipeline — or bring it under the same governance and accountability as your human-led one. Because in today’s environment, you don’t just own the SDLC you designed. You also own the one AI is building — whether you control it or not. Interested to learn more about SDLC challenges in 2025 and beyond? More stats and insights are available in the Future of Appsec report mentioned above.

By Eran Kinsbruner

VS Code Agent Mode: An Architect's Perspective for the .NET Ecosystem

GitHub Copilot agent mode had several enhancements in VS Code as part of its July 2025 release, further bolstering its capabilities. The supported LLMs are getting better iteratively; however, both personal experience and academic research remain divided on future capabilities and gaps. I've had my own learnings exploring agent mode for the last few months, ever since it was released, and had the best possible outcomes with Claude Sonnet Models. After 18 years of building enterprise systems — ranging from integrating siloed COTS to making clouds talk, architecting IoT telemetry data ingestions and eCommerce platforms — I've seen plenty of "revolutionary" tools come and go. I've watched us transition from monoliths to microservices, from on-premises to cloud, from waterfall to agile. I've learned Java 1.4, .NET 9, and multiple flavors of JavaScript. Each transition revealed fundamental flaws in how we think about software construction. The integration of generative AI into software engineering is dominated by pattern matching and reasoning by analogy to past solutions. This approach is philosophically and practically flawed. There's active academic research that surfaces this problem, primarily the "Architectures of Error" framework that systematically differentiates the failure modes of human and AI-generated code. At the moment, I'm neither convinced by Copilot's capability nor have I found reasons to hate it. My focus in this article is more on the human-side errors that Agent Mode helps us recognize. Why This Isn't Just Another AI Tool Copilot's Agent Mode isn't just influencing how we build software — it's revealing why our current approaches are fundamentally flawed. The uncomfortable reality: Much of our architectural complexity exists because we've never had effective ways to encode and enforce design intent. We write architectural decision records that few read. We create coding standards that get violated under pressure. We design patterns that work beautifully when implemented correctly but fail catastrophically when they're not. Agent Mode surfaces this gap between architectural intent and implementation reality in ways we haven't experienced before. The Constraint Problem We've Been Avoiding Here's something I've learned from working on dozens of enterprise projects: Most architectural failures aren't technical failures — they're communication failures. We design a beautiful hexagonal architecture, document it thoroughly, and then watch as business pressure, tight deadlines, and human misunderstanding gradually erode it. By year three, what we have bears little resemblance to what we designed. C# // What we designed public class CustomerService : IDomainService<Customer> { // Clean separation, proper dependencies } // What we often end up with after several iterations public class CustomerService { // Direct database calls mixed with business logic // Scattered validation, unclear responsibilities // Works, but violates every architectural principle } Agent Mode forces us to confront this differently. AI can't read between the lines or make intuitive leaps. If our architectural constraints aren't explicit enough for an AI to follow, they probably aren't explicit enough for humans either. The Evolution from Documentation to Constraints In my experience, the most successful architectural approaches have moved progressively toward making correct usage easy and incorrect usage difficult. Early in my career, I relied heavily on documentation and code reviews. Later, I discovered the power of types, interfaces, and frameworks that guide developers toward correct implementations. Now, I'm exploring how to encode architectural knowledge directly into development tooling (and Copilot). C# / Evolution 1: Documentation-based (fragile) // "Please ensure all controllers inherit from BaseApiController" // Evolution 2: Framework-based (better) public abstract class BaseApiController : ControllerBase { // Common functionality, but still optional } // Evolution 3: Constraint-based (AI-compatible) public interface IApiEndpoint<TRequest, TResponse> where TRequest : IValidated where TResponse : IResult { // Impossible to create endpoints that bypass validation } The key insight: Each evolution makes architectural intent more explicit and mechanical. Agent Mode simply pushes us further along this path. We can work around most AI problems like the "AI 90/10 problem" arising from hallucinated APIs, non-existent libraries, context-window myopia, systematic pattern propagation, and model drift. LLM responses are probabilistic by nature, but they can be made deterministic by specifying constraints. Practical Implications Working with Agent Mode on real projects has revealed several practical patterns: 1. Requirement Specification Vague prompts produce (architecturally) inconsistent results. This isn't a limitation — it's feedback about the clarity of our thinking at any role, especially around SDLC, including the architect. We struggled with the same problems with the advent of the outsourcing era, too. SaaS inherits this problem through its extensibility and flexibility. Markdown [BAD] Inviting infinite possibilities: "Create a service for managing customers relationship" [GOOD] More effective: "Create a CustomerService implementing IDomainService<Customer> with validation using FluentValidation and error handling via Result<T> pattern" 2. The Composability Test If AI struggles to combine your architectural patterns correctly, human developers probably do too. They excel at pattern matching but fail at: Systematicity: Applying rules consistently across contextsProductivity: Scaling to larger, more complex compositionsSubstitutivity: Swapping components while maintaining correctnessLocalism: Understanding global vs. local scope implications This also helps to identify the architectural complexity. 3. The Constraint Discovery Process Working with AI has helped me identify implicit assumptions in existing architectures that weren't previously explicit. These discoveries often lead to better human-to-human communication as well. The Skills That Remain Valuable Based on my experience so far, certain architectural skills have become more important now: Domain understanding: AI can generate technically correct code, but understanding business context and constraints remains fundamentally human.Pattern recognition: Identifying when existing patterns apply and when new ones are needed becomes crucial for defining AI constraints.System thinking: Understanding emergent behaviors and system-level properties remains beyond current AI capabilities.Trade-off analysis: Evaluating architectural decisions based on business context, team capabilities, and long-term maintainability. What's Actually Changing The shift isn't as dramatic as "AI replacing architects or developers." It's more subtle: From implementation to intent: Less time writing boilerplate, more time clarifying what we actually want the system to do.From review to prevention: Instead of catching architectural violations in code review, we encode constraints that prevent them upfront.From documentation to automation: Architectural knowledge becomes executable rather than just descriptive. These changes feel significant to me, but they're evolutionary rather than revolutionary. Challenges I'm Still Working Through The learning curve: Developing fluency with constraint-driven development requires rethinking established habits.Team adoption: Not everyone is comfortable with AI-assisted development yet, and that's understandable.Tool maturity: Current AI tools are impressive but still have limitations around context understanding and complex reasoning.Validation strategies: Traditional testing approaches may not catch all AI-generated issues, so we're developing new validation patterns. A Measured Prediction Based on what I'm seeing, I expect a gradual shift over the next 3–5 years toward: More explicit architectural constraints in codebasesIncreased automation of pattern enforcementEnhanced focus on domain modeling and business rule specificationEvolution of code review practices to emphasize architectural composition over implementation details This won't happen overnight, and it won't replace fundamental architectural thinking. But it will change how we express and enforce architectural decisions. What I'm Experimenting With Currently, I'm exploring: 1. Machine-readable architecture definitions that can guide both AI and human developers. JSON { "architecture": { "layers": ["Api", "Application", "Domain", "Infrastructure"], "dependencies": { "Api": ["Application"], "Application": ["Domain"], "Infrastructure": ["Domain"] }, "patterns": { "cqrs": { "commands": "Application/Commands", "queries": "Application/Queries", "handlers": "required" } } } } 2. Architectural testing frameworks that validate system composition automatically. C# [Test] public void Architecture_Should_Enforce_Layer_Dependencies() { var result = Types.InCurrentDomain() .That().ResideInNamespace("Api") .ShouldNot().HaveDependencyOn("Infrastructure") .GetResult(); Assert.That(result.IsSuccessful, result.FailingTypes); } [Test] public void AI_Generated_Services_Should_Follow_Naming_Conventions() { var services = Types.InCurrentDomain() .That().AreClasses() .And().ImplementInterface(typeof(IDomainService)) .Should().HaveNameEndingWith("Service") .GetResult(); Assert.That(services.IsSuccessful); } 3. Constraint libraries that make common patterns easy to apply correctly, starting with domain primitives. C# ```csharp // Instead of generic controllers, define domain-specific primitives public abstract class DomainApiController<TEntity, TDto> : ControllerBase where TEntity : class, IEntity where TDto : class, IDto { // Constrained template that AI can safely compose } // Service registration primitive public static class ServiceCollectionExtensions { public static IServiceCollection AddDomainService<TService, TImplementation>( this IServiceCollection services) where TService : class where TImplementation : class, TService { // Validated, standard registration pattern return services.AddScoped<TService, TImplementation>(); } } 4. Documentation approaches that work well with AI-assisted development. An example is documenting architecture in the Arc42 template in Markdown, diagrams in Mermaid embedded in Markdown. Early results are promising, but there's still much to learn and explore. Looking Forward After 18 years in this field, I've learned to be both optimistic about new possibilities and realistic about the pace of change. VS Code Agent Mode represents an interesting step forward in human-AI collaboration for software development. It's not a silver bullet, but it is a useful tool that can help us build better systems — if we approach it thoughtfully. The architectures that thrive in an AI-assisted world won't necessarily be the most sophisticated ones. They'll be the ones that most clearly encode human insight in ways that both AI and human developers can understand and extend. That's a worthy goal, regardless of the tools we use to achieve it. Final Thoughts The most valuable architectural skill has always been clarity of thought about complex systems. AI tools like Agent Mode don't change this fundamental requirement — they just give us new ways to express and validate that clarity. As we navigate this transition, the architects who succeed will be those who remain focused on the essential questions: What are we trying to build? Why does it matter? How can we make success more likely than failure? The tools continue to evolve, but these questions remain constant. I'm curious about your experiences with AI-assisted development. What patterns are you seeing? What challenges are you facing? The best insights come from comparing experiences across different contexts and domains.

By Shashi Kumar

Distributed Cloud-Based Dynamic Configuration Management

It is not uncommon for back-end software to have a configuration file to start up with. These are generally YAML or JSON files, which are loaded by the system while starting up, and are then used to set up initial configuration for a system. Values included here may affect business logic or infrastructure. Let us create a new service called DumplingSale (because I love dumplings, or as we call them, momos). This service is used for managing the sales of dumplings. As an example, take a look at this YAML file, used to start a service called DumplingSale. Java # Production configuration for DumplingSale Java web application # prod.yaml redis: host: redis-prod.example.com port: 6379 password: ${REDIS_PASSWORD} timeout: 3000ms logging: level: com.example.dumplingsale: INFO org.springframework.web: WARN file: name: /var/log/dumpling-sale/application.log max-size: 100MB max-history: 30 # Section we might want to change dynamically dumpling-sale-config: max-orders-per-minute: 100 order-timeout-minutes: 30 enable-analytics: true payment-provider-config: active-provider: "your-payment-provider.com" transaction-timeout-seconds: 10 retry-attempts: 3 default-currency: "USD" api-version: "2024-05-26" enable-3ds-secure: true webhook-verification-enabled: true In the above section, let’s say we wanted to change our dumpling sale section dynamically. This could involve changing the order timeout minutes to 15 in times of increased sales, or reducing max orders per minute if the kitchen is backed up. In a default static configuration system, the static configuration will have to be changed by making a code change and then deploying it over our servers again. This would likely involve a restart of your servers. If there is a separation of code and config, you could possibly keep the code the same, but the servers would need to be restarted. However, in a dynamic configuration system, we could change the config at one place and have it changed in all our servers. Configuration Use Cases AllowLists and blocklists: Configs can allow you to manage allowlists or blocklists, and dynamically update them as your service runs.Performance tuning: You can change the number of threads, timeouts, workers, endpoints, etc., without having to restart your application.Flags: Think of any flags you pass to your application, and you could change them dynamically. In this article, we will follow the above Dumpling Store example and modify the payment provider and the dumpling sale config dynamically. Types of Config Delivery There are broadly two types of config delivery: push or pull. Push config delivery: In this system, the config mechanism delivers the configuration to all applications using the same mechanism. Pull config delivery: In this system, the config mechanism waits and responds with configuration when polled by your application. In this example, we will be using a pull delivery system. Data Structures We will have a parent data structure called dynamic configuration, but we will have child data structures for each config type you wish to support. I will be using Java to explain the example here, but feel free to use a language of your choice. Java import lombok.Data; import lombok.Builder; import java.util.List; import java.util.ArrayList; // Lombok Annotations @Data @Builder public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } Creating the Cache Similarly, you will need to create a cache to fetch these configs. We are using the Guava Cache. Java public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } Loading Contents for the Cache As you can see above, we created a cache that would return data of the type AppConfigData. However, the cache needs to fetch this data from somewhere as well, right? So, we would program a data source for the same, which allows dynamic configuration data to be loaded. Here are your options: Remote file: A remote file pulled by the servers. Could be stored in AWS S3, GCS, or any other object or file storage system you may have access to. Pros: Fast and easy deploymentYour object/file system may offer version history and audit logs.Cons: Not great tracking of versions, comparison of config across versions.A remote database Pros: All databases come with a great set of libraries and tools to integrate easily.Cons: Not great tracking of versions, comparison of config across versions.Depending on the database, unlikely to have auditing.Unless a custom version-based solution is created, no versioning.[Recommended] Cloud Config management system: Such as AWS Appconfig, Azure App Configuration, or GCP Firebase Remote Config. Pros: Fast, standardized rollout mechanisms: Can use both push and pull methods with slow/fast rollout across your service.Deploy changes across a variety of targets together, including compute, containers, docs, mobile applications, and serverless applications.Cons: You need to read the rest of this article to know how to use these. Creating Configuration Using AWS AppConfig Let’s use AWS Appconfig for this example. All the cloud solutions are great services, and we need just one to learn how to create this config. The above is a diagram from AWS AppConfig, which talks about the various steps you can take to ensure config deployments are safe and stable. 1. Choose config type: AWS allows you to use feature flags or freeform configs. We will choose freeform configs in this example to simplify config creation. 2. Choose a config name: my-config (I am keeping it simple.) 3. Choose config source: YAML dumpling_sale_config: maxOrdersPerMinute: 100 orderTimeoutMinutes: 30 enableAnalytics: true payment_provider_config: activeProvider: "your-payment-provider" transactionTimeoutSeconds: 10 retryAttempts: 3 defaultCurrency: "USD" apiVersion: "2024-05-26" enable3dsSecure: true Simply save this to an application (to keep it simple here: my-application). Now, let's save and deploy: As you can see above, I made the following choices: Environment: I created an environment called prod. Feel free to create as many as you need.Hosted config version: Right now, we will have only one version. In the future, you can choose to change the ‘latest’ version and deploy whichever version you would like.Deployment strategy: This is crucial. For simplicity, I have chosen ‘All at once.’ However, that is not always the best strategy, as you may want to roll out slowly and observe how your service is performing and roll back if necessary. You can read about other strategies here. Once your deployment is complete, the configuration will be deployable. Fetching Configuration Using AWS AppConfig Java private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } Bringing It All Together Java import lombok.AllArgsConstructor; import lombok.Builder; import lombok.Data; import lombok.NoArgsConstructor; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; import org.yaml.snakeyaml.Yaml; import com.amazonaws.services.appconfig.AWSAppConfig; import com.amazonaws.services.appconfig.model.GetLatestConfigurationRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionRequest; import com.amazonaws.services.appconfig.model.StartConfigurationSessionResult; import com.amazonaws.services.appconfig.model.GetLatestConfigurationResult; import com.amazonaws.util.IOUtils; // For converting ByteBuffer to String import java.nio.ByteBuffer; import java.util.concurrent.TimeUnit; import java.io.IOException; import java.nio.charset.StandardCharsets; @Data @Builder @NoArgsConstructor @AllArgsConstructor public class PaymentProviderConfig { private String activeProvider; private int transactionTimeoutSeconds; private int retryAttempts; private String defaultCurrency; private String apiVersion; private boolean enable3dsSecure; private boolean webhookVerificationEnabled; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class DumplingSaleConfig { private int maxOrdersPerMinute; private int orderTimeoutMinutes; private boolean enableAnalytics; } @Data @Builder @NoArgsConstructor @AllArgsConstructor public class AppConfigData { private DumplingSaleConfig dumplingSaleConfig; private PaymentProviderConfig paymentProviderConfig; } public class AppConfigManager { private final LoadingCache<String, AppConfigData> appConfigCache; private final Yaml yaml; private final AWSAppConfig awsAppConfig; private String configurationToken; // To store the session token for subsequent fetches public AppConfigManager(AWSAppConfig awsAppConfig) { this.yaml = new Yaml(); this.awsAppConfig = awsAppConfig; this.appConfigCache = CacheBuilder.newBuilder() .refreshAfterWrite(5, TimeUnit.MINUTES) .build(new CacheLoader<String, AppConfigData>() { @Override public AppConfigData load(String key) throws Exception { return fetchConfigFromAppConfig(); } }); } private AppConfigData fetchConfigFromAppConfig() throws Exception { // 1. Start a configuration session to get a token StartConfigurationSessionRequest sessionRequest = new StartConfigurationSessionRequest() .withApplicationIdentifier("my-application") .withConfigurationProfileIdentifier("my-config") .withEnvironmentIdentifier("prod") .withRequiredMinimumPollIntervalInSeconds(30); // Recommended to set a minimum poll interval StartConfigurationSessionResult sessionResult = awsAppConfig.startConfigurationSession(sessionRequest); this.configurationToken = sessionResult.getInitialConfigurationToken(); // 2. Get the latest configuration using the token GetLatestConfigurationRequest configRequest = new GetLatestConfigurationRequest() .withConfigurationToken(configurationToken); GetLatestConfigurationResult configResult = awsAppConfig.getLatestConfiguration(configRequest); ByteBuffer configurationContent = configResult.getConfiguration(); if (configurationContent == null) { throw new IOException("No configuration content received from AWS AppConfig."); } // 3. Convert ByteBuffer to String String fatYaml = IOUtils.toString(configurationContent.asInputStream(), StandardCharsets.UTF_8); // 4. Parse YAML into the AppConfigData class return yaml.loadAs(fatYaml, AppConfigData.class); } public AppConfigData getAppConfig() { try { return appConfigCache.get("appConfig"); } catch (Exception e) { System.err.println("Error loading app config: " + e.getMessage()); return null; } } } Using the Config Now let’s say you had to check if the orders per minute metric was breached and based on the same, you would take a decision, you could simply use this config manager to get details on the order. Java import lombok.AllArgsConstructor; @AllArgsConstructor public class OrderRateLimiter { private final AppConfigManager appConfigManager; public boolean isOrderLimitExceeded(int currentOrdersThisMinute) { AppConfigData appConfig = appConfigManager.getAppConfig(); if (appConfig == null || appConfig.getDumplingSaleConfig() == null) { System.err.println("DumplingSaleConfig not available from AppConfigManager."); return false; } DumplingSaleConfig config = appConfig.getDumplingSaleConfig(); return currentOrdersThisMinute > config.getMaxOrdersPerMinute(); } } As you see above, you can simply fetch the config you want, without worrying about where it is coming from (cache/AWS Appconfig), and can make decisions on the basis of the same. Key Takeaways Using the LoadingCache allows for : Faster retrieval.Thread safety, since the cache handles its own refresh logic, and any number of calls to the cache can be easily handled.Hands off management for value refresh.Low cost even with very high retrievals: as an example, let’s say you have a 100 servers running the application, needing a config 500 times per second, you will only still be billed for 100*12 = 1200 requests per hour since you are refreshing the cache every 5 minutes, as opposed to 100 * 500 * 3600 = 180 million requests if you didn’t have a cache.Low network utilization since requests are locally served.Higher availability, in case the config service is down. While using Cloud-based config management systems allows for: Easier management of config lifecycle.Better rollout strategies.Centralized management. Now, you are ready to create your own distributed cloud-based dynamic configurations.

By Alankrit Kharbanda

Enable AWS Budget Notifications With SNS Using AWS CDK

Keeping track of AWS spend is very important. Especially since it’s so easy to create resources, you might forget to turn off an EC2 instance or container you started, or remove a CDK stack for a specific experiment. Costs can creep up fast if you don’t put guardrails in place. Recently, I had to set up budgets across multiple AWS accounts for my team. Along the way, I learned a few gotchas (especially around SNS and KMS policies) that weren’t immediately clear to me as I started out writing AWS CDK code. In this post, we’ll go through how to: Create AWS Budgets with AWS CDKSend notifications via email and SNSHandle cases like encrypted topics and configuring resource policies If you’re setting up AWS Budgets for the first time, I hope this post will save you some trial and error. What Are AWS Budgets? AWS Budgets is part of AWS Billing and Cost Management. It lets you set guardrails for spend and usage limits. You can define a budget around cost, usage, or even commitment plans (like Reserved Instances and Savings Plans) and trigger alerts when you cross a threshold. You can think of Budgets as your planned spend tracker. Budgets are great for: Alerting when costs hit predefined thresholds (e.g., 80% of your budgeted spend)Driving team accountability by tying alerts to product or account ownersEnforcing a cap on monthly spend, triggering an action, and shutting down compute (EC2), if you go over budget (be careful with this) Keep in mind that budgets and their notifications are not instant. AWS billing data is processed multiple times a day, but you might trigger your budget a couple of hours after you’ve passed your threshold. This is clearly stated in the AWS documentation as: AWS billing data, which Budgets uses to monitor resources, is updated at least once per day. Keep in mind that budget information and associated alerts are updated and sent according to this data refresh cadence. Defining Budgets With AWS CDK You can create different kinds of budgets, depending on your requirements. Some examples are: Fixed budgets: Set one amount to monitor every budget period.Planned budgets: Set different amounts to monitor each budget period.Auto-adjusting budgets: Set a budget amount to be adjusted automatically based on the spending pattern over a time range that you specify. We’ll start with a simple example of how you can create a budget in the CDK. We’ll go for a fixed budget of about $100. The AWS CDK currently only has Level 1 constructs available for budgets, which means that the classes in the CDK are a 1 to 1 mapping to the CloudFormation resources. Because of this, you will have to explicitly define all required properties (constructs, IAM policies, resource policies, etc), which otherwise could be taken care of by a CDK L2 construct. It also means your CDK code will be a bit more verbose. We’ll start by using the CfnBudget construct. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' } } In the above example, we’ve created a budget with a limit of $100 per month. A budget alone isn’t very useful. You’d still have to check into the AWS console manually to see what your spend is compared to your budget. The important thing is that we want to get notified in case we reach our budget or our forecasted budget reaches our threshold, so let’s add a notification and a subscriber. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'EMAIL', address: '<your-email-address>' }] }] }); Based on the notification settings, interested parties are notified when the spend is forecasted to exceed 100% of our defined budget limit. You can put a notification on forecasted or actual percentages. When that happens, an email is sent to the designated email address. Subscribers, at the time of writing, can be either email recipients or a Simple Notification Service (SNS) topic. In the above code example, we use email subscribers for which you can add up to 10 recipients. Depending on your team or organization, it might be beneficial to switch to using an SNS topic. The advantage of using an SNS topic over a set of email subscribers is that you can add different kinds of subscribers (email, chat, custom lambda functions) to your SNS topic. With an SNS topic, you have a single place to configure subscribers, and if you change your mind, you can do so in one place instead of updating all budgets. Using an SNS Topic also allows you to push budget notifications to, for instance, a chat client like MS Teams or Slack. In this case, we will make use of SNS in combination with email subscribers. Let’s start by defining an SNS topic with the AWS CDK. TypeScript // Create a topic for email notifications let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic' }); Now, let’s add an email subscriber, as this is the simplest way to receive budget notifications. TypeScript // Add email subscription topic.addSubscription( new EmailSubscription("your-email-address")); This looks pretty straightforward, and you might think you’re done, but there is one important step to take next, which I initially forgot. The AWS budgets service will need to be granted permissions to publish messages to the topic. To be able to do this, we will need to add a resource policy to the topic that allows the budgets service to call the SNS:Publish action for our topic. TypeScript // Add resource policy to allow the budgets service to publish to the SNS topic topic.addToResourcePolicy(new PolicyStatement({ actions:["SNS:Publish"], effect: Effect.ALLOW, principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: [topic.topicArn], conditions: { ArnEquals: { 'aws:SourceArn': `arn:aws:budgets::${Stack.of(this).account}:*`, }, StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, }, })) Now, let’s assign the SNS topic as a subscriber in our CDK code. TypeScript // Define a fixed budget with SNS as subscriber new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'SNS', address: topic.topicArn }] }] }); Working With Encrypted Topics If you have an SNS topic with encryption enabled (via KMS), you will need to make sure that the corresponding service has access to the KMS key. If you don’t, you will not get any messages, and as far as I could tell, you will see no errors (at least I could find none in CloudTrail). I actually wasted a couple of hours trying to figure this part out. I should have read the documentation, as it is explicitly stated to do so. I guess I should start with the docs instead of diving right into the AWS CDK code. TypeScript // Create KMS key used for encryption let key = new Key(this,'sns-kms-key', { alias: 'sns-kms-key', enabled: true, description: 'Key used for SNS topic encryption' }); // Create topic and assign the KMS key let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic', masterKey: key }); Now, let’s add the resource policy to the key and try to trim down the permissions as much as possible. TypeScript // Allow access from budgets service key.addToResourcePolicy(new PolicyStatement({ effect: Effect.ALLOW, actions: ["kms:GenerateDataKey*","kms:Decrypt"], principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: ["*"], conditions: { StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, ArnLike: { "aws:SourceArn": "arn:aws:budgets::" + Stack.of(this).account +":*" } } })); Putting It All Together If you’ve configured everything correctly and deployed your stack to your target account, you should be good to go. Once you cross your threshold, you should be notified by email that your budget is exceeding one of your thresholds (depending on the threshold set). Summary In this post, we explored how to create AWS Budgets with AWS CDK and send notifications through email or SNS. Along the way, we covered some important topics like: Budgets alone aren’t useful until you add notifications.SNS topics need a resource policy so the Budgets service can publish.Encrypted topics require KMS permissions for the Budgets service. With these pieces in place, you’ll have a setup that alerts your team when costs exceed thresholds via email, chat, or custom integrations. A fully working CDK application with the code mentioned in this blog post can be found in the following GitHub repo.

By Jeroen Reijn

CORE

Building a Platform Abstraction for EKS Cluster Using Crossplane

Building on what we started earlier in an earlier article, here we’re going to learn how to extend our platform and create a platform abstraction for provisioning an AWS EKS cluster. EKS is AWS’s managed Kubernetes offering. Quick Refresher Crossplane is a Kubernetes CRD-based add-on that abstracts cloud implementations and lets us manage Infrastructure as code. Prerequisites Set up Docker Kubernetes.Follow the Crossplane installation based on the previous article.Follow the provider configuration based on the previous article.Apply all the network YAMLs from the previous article (including the updated network composition discussed later). This will create the necessary network resources for the EKS cluster. Some Plumbing When creating an EKS cluster, AWS needs to: Spin up the control plane (managed by AWS)Attach security groups Configure networking (ENIs, etc)Access the VPC and subnetsManage API endpointsInteract with other AWS services (e.g., CloudWatch for logging, Route53) To do this securely, AWS requires an IAM role that it can assume. We create that role here and reference it during cluster creation; details are provided below. Without this role, you'll get errors like "access denied" when creating the cluster. Steps to Create the AWS IAM Role Log in to the AWS Console and go to the IAM creation page.In the left sidebar, click RolesClick Create Role.Choose AWS service as the trusted entity type.Select the EKS use case, and choose the EKS Cluster.Attach the following policies: AmazonEKSClusterPolicyAmazonEKSServicePolicyAmazonEC2FullAccessAmazonEKSWorkerNodePolicyAmazonEC2ContainerRegistryReadOnlyAmazonEKS_CNI_PolicyProvide the name eks-crossplane-cluster and optionally add tags. Since we'll also create NodeGroups, which require additional permissions, for simplicity, I'm granting the Crossplane user (created in the previous article) permission to PassRole for the Crossplane cluster role, and this permission allows this user to tell AWS services (EKS) to assume the Crossplane cluster role on its behalf. Basically, this user can say, "Hey, EKS service, create a node group and use this role when doing it." To accomplish this, add the following inline policy to the Crossplane user: JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::914797696655:role/eks-crossplane-clsuter" } ] } Note: Typically, to follow the principle of Least Privilege, you should separate roles with policies: Control plane role with EKS admin permissionsNode role with permissions for node group creation. In the previous article, I had created only one subnet in the network composition, but the EKS control plane requires at least two AZs, with one subnet per AZ. You should modify the network composition from the previous article to add another subnet. To do so, just add the following to the network composition YAML, and don't forget to apply the composition and claim to re-create the network. YAML - name: subnet-b base: apiVersion: ec2.aws.upbound.io/v1beta1 kind: Subnet spec: forProvider: cidrBlock: 10.0.2.0/24 availabilityZone: us-east-1b mapPublicIpOnLaunch: true region: us-east-1 providerConfigRef: name: default patches: - fromFieldPath: status.vpcId toFieldPath: spec.forProvider.vpcId type: FromCompositeFieldPath - fromFieldPath: spec.claimRef.name toFieldPath: spec.forProvider.tags.Name type: FromCompositeFieldPath transforms: - type: string string: fmt: "%s-subnet-b" - fromFieldPath: status.atProvider.id toFieldPath: status.subnetIds[1] type: ToCompositeFieldPath We will also need a provider to support EKS resource creation, to create the necessary provider, save the following content into .yaml file. YAML apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws spec: package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2 controllerConfigRef: name: default And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composite Resource Definition (XRD) Below, we’re going to build a Composite Resource Definition for the EKS cluster. Before diving in, one thing to note: If you’ve already created the network resources using the previous article, you may have noticed that the network composition includes a field that places the subnet ID into the composition resource’s status, specifically under status.subnetIds[0]. This value comes from the cloud's Subnet resource and is needed by other XCluster compositions. By placing it in the status field, the network composition makes it possible for other Crossplane compositions to reference and use it. Similar to what we did for network creation in the previous article, we’re going to create a Crossplane XRD, a Crossplane Composition, and finally a Claim that will result in the creation of an EKS cluster. At the end, I’ve included a table that serves as an analogy to help illustrate the relationship between the Composite Resource Definition (XRD), Composite Resource (XR), Composition, and Claim. To create an EKS XRD, save the following content into .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xclusters.aws.platformref.crossplane.io spec: group: aws.platformref.crossplane.io names: kind: XCluster plural: xclusters claimNames: kind: Cluster plural: clusters versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - parameters properties: parameters: type: object required: - region - roleArn - networkRef properties: region: type: string description: AWS region to deploy the EKS cluster in. roleArn: type: string description: IAM role ARN for the EKS control plane. networkRef: type: object description: Reference to a pre-created XNetwork. required: - name properties: name: type: string status: type: object properties: network: type: object required: - subnetIds properties: subnetIds: type: array items: type: string And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composition Composition is the implementation; it tells Crossplane how to build all the underlying resources (Control Plane, NodeGroup). To create an EKS composition, save the below content into a .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: cluster.aws.platformref.crossplane.io spec: compositeTypeRef: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XCluster resources: - name: network base: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XNetwork patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.networkRef.name toFieldPath: metadata.name - type: ToCompositeFieldPath fromFieldPath: status.subnetIds toFieldPath: status.network.subnetIds - type: ToCompositeFieldPath fromFieldPath: status.subnetIds[0] toFieldPath: status.network.subnetIds[0] readinessChecks: - type: None - name: eks base: apiVersion: eks.aws.crossplane.io/v1beta1 kind: Cluster spec: forProvider: region: us-east-1 roleArn: "" resourcesVpcConfig: subnetIds: [] endpointPrivateAccess: true endpointPublicAccess: true providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.roleArn - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.resourcesVpcConfig.subnetIds - name: nodegroup base: apiVersion: eks.aws.crossplane.io/v1alpha1 kind: NodeGroup spec: forProvider: region: us-east-1 clusterNameSelector: matchControllerRef: true nodeRole: "" subnets: [] scalingConfig: desiredSize: 2 maxSize: 3 minSize: 1 instanceTypes: - t3.medium amiType: AL2_x86_64 diskSize: 20 providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.nodeRole - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.subnets And apply using: YAML kubectl apply -f <your-file-name>.yaml Claim I'm taking the liberty to explain the claim in more detail here. First, it's important to note that a claim is an entirely optional entity in Crossplane. It is essentially a Kubernetes Custom Resource Definition (CRD) that the platform team can expose to application developers as a self-service interface for requesting infrastructure, such as an EKS cluster. Think of it as an API payload: a lightweight, developer-friendly abstraction layer. In the earlier CompositeResourceDefinition (XRD), we created the Kind XCluster. But by using a claim, application developers can interact with a much simpler and more intuitive CRD like Cluster instead of XCluster. For simplicity, I have referenced the XNetwork composition name directly instead of the Network claim resource name. Crossplane creates the XNetwork resource and appends random characters to the claim name when naming it. As an additional step, you'll need to retrieve the actual XNetwork name from the Kubernetes API and use it here. While there are ways to automate this process, I’m keeping it simple here, let me know via comments if there are interest and I write more about how to automate that. To create a claim, save the content below into a .yaml file. Please note the roleArn being referenced in this, that is the role I had mentioned earlier, AWS uses it to create other resources. YAML apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: Cluster metadata: name: demo-cluster namespace: default spec: parameters: region: us-east-1 roleArn: arn:aws:iam::914797696655:role/eks-crossplane-clsuter networkRef: name: crossplane-demo-network-jpv49 # <important> this is how EKS composition refers the network created earlier not the random character "jpv49" from XNetwork name And apply using: YAML kubectl apply -f <your-file-name>.yaml After this, you should see an EKS cluster in your AWS console, and ensure you are looking in the correct region. If there are any issues, look for error logs in the composite and managed resource. You could look at them using: YAML -- to get XCluster detail k get XCluster demo-cluster -o yaml # look for reconciliation errors or messages, you will also find reference to managed resource -- to look for status of a managed resource, example. k get Cluster.eks.aws.crossplane.io As I mentioned before, below is a table where I attempt to provide another analogy for various components used in Crossplane: componentanalogy XRD The interface, or blueprint for a product, defines what knobs users can turn XR (XCluster) A specific product instance with user-provided values Composition The function that implements all the details of the product Claim A customer-friendly interface for ordering the product, or an api payload. Patch I also want to explain an important concept we've used in our Composition: patching. You may have noticed the patches field in the .yaml blocks. In Crossplane, a composite resource is the high-level abstraction we define — in our case, that's XCluster. Managed resources are the actual cloud resources Crossplane provisions on our behalf — for example, the AWS EKS Cluster, Nodegroup A patch in a Crossplane Composition is a way to copy or transform data from/to the composite resource (XCluster) to/from the managed resources (Cluster, NodeGroup, etc.). Patching allows us to map values like region, roleArn, and names from the high-level composite to the actual underlying infrastructure — ensuring that developer inputs (or platform-defined parameters) flow all the way down to the cloud resources. Conclusion Using Crossplane, you can build powerful abstractions that shield developers from the complexities of infrastructure, allowing them to focus on writing application code. These abstractions can also be made cloud-agnostic, enabling benefits like portability, cost optimization, resilience and redundancy, and greater standardization.

By Ramesh Sinha