DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

DZone Spotlight

Sunday, June 15 View All Articles »
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

By John Vester DZone Core CORE
The recent announcement of KubeMQ-Aiway caught my attention not as another AI platform launch, but as validation of a trend I've been tracking across the industry. After spending the last two decades building distributed systems and the past three years deep in AI infrastructure consulting, the patterns are becoming unmistakable: we're at the same inflection point that microservices faced a decade ago. The Distributed Systems Crisis in AI We've been here before. In the early 2010s, as monolithic architectures crumbled under scale pressures, we frantically cobbled together microservices with HTTP calls and prayed our systems wouldn't collapse. It took years to develop proper service meshes, message brokers, and orchestration layers that made distributed systems reliable rather than just functional. The same crisis is unfolding with AI systems, but the timeline is compressed. Organizations that started with single-purpose AI models are rapidly discovering they need multiple specialized agents working in concert, and their existing infrastructure simply wasn't designed for this level of coordination complexity. Why Traditional Infrastructure Fails AI Agents Across my consulting engagements, I'm seeing consistent patterns of infrastructure failure when organizations try to scale AI beyond proof-of-concepts: HTTP communication breaks down: Traditional request-response patterns work for stateless operations but fail when AI agents need to maintain context across extended workflows, coordinate parallel processing, or handle operations that take minutes rather than milliseconds. The synchronous nature of HTTP creates cascading failures that bring down entire AI workflows.Context fragmentation destroys intelligence: AI agents aren't just processing data — they're maintaining conversational state and building accumulated knowledge. When that context gets lost at service boundaries or fragmented across sessions, the system's collective intelligence degrades dramatically.Security models are fundamentally flawed: Most AI implementations share credentials through environment variables or configuration files. This creates lateral movement risks and privilege escalation vulnerabilities that traditional security models weren't designed to handle.Architectural constraints force bad decisions: Tool limitations in current AI systems force teams into anti-patterns, such as building meta-tools, fragmenting capabilities, or implementing complex dynamic loading mechanisms. Each workaround introduces new failure modes and operational complexity. Evaluating the KubeMQ-Aiway Technical Solution KubeMQ-Aiway is “the industry’s first purpose-built connectivity hub for AI agents and Model-Context-Protocol (MCP) servers. It enables seamless routing, security, and scaling of all interactions — whether synchronous RPC calls or asynchronous streaming — through a unified, multi-tenant-ready infrastructure layer.” In other words, it’s the hub that manages and routes messages between systems, services, and AI agents. Through their early access program, I recently explored KubeMQ-Aiway's architecture. Several aspects stood out as particularly well-designed for these challenges: Unified aggregation layer: Rather than forcing point-to-point connections between agents, they've created a single integration hub that all agents and MCP servers connect through. This is architecturally sound — it eliminates the N-squared connection problem that kills system reliability at scale. More importantly, it provides a single point of control for monitoring, security, and operational management.Multi-pattern communication architecture: The platform supports both synchronous and asynchronous messaging natively, with pub/sub patterns and message queuing built in. This is crucial because AI workflows aren't purely request-response — they're event-driven processes that need fire-and-forget capabilities, parallel processing, and long-running operations. The architecture includes automatic retry mechanisms, load balancing, and connection pooling that are essential for production reliability.Virtual MCP implementation: This is particularly clever — instead of trying to increase tool limits within existing LLM constraints, they've abstracted tool organization at the infrastructure layer. Virtual MCPs allow logical grouping of tools by domain or function while presenting a unified interface to the AI system. It's the same abstraction pattern that made container orchestration successful.Role-based security model: The built-in moderation system implements proper separation of concerns with consumer and administrator roles. More importantly, it handles credential management at the infrastructure level rather than forcing applications to manage secrets. This includes end-to-end encryption, certificate-based authentication, and comprehensive audit logging — security patterns that are proven in distributed systems but rarely implemented correctly in AI platforms. Technical Architecture Deep Dive What also impresses me is their attention to distributed systems fundamentals: Event sourcing and message durability: The platform maintains a complete audit trail of agent interactions, which is essential for debugging complex multi-agent workflows. Unlike HTTP-based systems, where you lose interaction history, this enables replay and analysis capabilities that are crucial for production systems.Circuit breaker and backpressure patterns: Built-in failure isolation prevents cascade failures when individual agents malfunction or become overloaded. The backpressure mechanisms ensure that fast-producing agents don't overwhelm slower downstream systems — a critical capability when dealing with AI agents that can generate work at unpredictable rates.Service discovery and health checking: Agents can discover and connect to other agents dynamically without hardcoded endpoints. The health checking ensures that failed agents are automatically removed from routing tables, maintaining system reliability.Context preservation architecture: Perhaps most importantly, they've solved the context management problem that plagues most AI orchestration attempts. The platform maintains conversational state and working memory across agent interactions, ensuring that the collective intelligence of the system doesn't degrade due to infrastructure limitations. Production Readiness Indicators From an operational perspective, KubeMQ-Aiway demonstrates several characteristics that distinguish production-ready infrastructure from experimental tooling: Observability: Comprehensive monitoring, metrics, and distributed tracing for multi-agent workflows. This is essential for operating AI systems at scale, where debugging requires understanding complex interaction patterns.Scalability design: The architecture supports horizontal scaling of both the infrastructure layer and individual agents without requiring system redesign. This is crucial as AI workloads are inherently unpredictable and bursty.Operational simplicity: Despite the sophisticated capabilities, the operational model is straightforward — agents connect to a single aggregation point rather than requiring complex service mesh configurations. Market Timing and Competitive Analysis The timing of this launch is significant. Most organizations are hitting the infrastructure wall with their AI implementations right now, but existing solutions are either too simplistic (basic HTTP APIs) or too complex (trying to adapt traditional service meshes for AI workloads). KubeMQ-Aiway appears to have found the right abstraction level — sophisticated enough to handle complex AI orchestration requirements, but simple enough for development teams to adopt without becoming distributed systems experts. Compared to building similar capabilities internally, the engineering effort would be substantial. The distributed systems expertise required, combined with AI-specific requirements, represents months or years of infrastructure engineering work that most organizations can't justify when production AI solutions are available. Strategic Implications For technology leaders, the emergence of production-ready AI infrastructure platforms changes the strategic calculation around AI implementation. The question shifts from "should we build AI infrastructure?" to "which platform enables our AI strategy most effectively?" Early adopters of proper AI infrastructure are successfully running complex multi-agent systems at production scale while their competitors struggle with basic agent coordination. This gap will only widen as AI implementations become more sophisticated. The distributed systems problems in AI won't solve themselves, and application-layer workarounds don't scale. Infrastructure solutions like KubeMQ-Aiway represent how AI transitions from experimental projects to production systems that deliver sustainable business value. Organizations that recognize this pattern and invest in proven AI infrastructure will maintain a competitive advantage over those that continue trying to solve infrastructure problems at the application layer. Have a really great day! More
Memory Leak Due to Uncleared ThreadLocal Variables

Memory Leak Due to Uncleared ThreadLocal Variables

By Ram Lakshmanan DZone Core CORE
In Java, we commonly use static, instance (member), and local variables. Occasionally, we use ThreadLocal variables. When a variable is declared as ThreadLocal, it will only be visible to that particular thread. ThreadLocal variables are extensively used in frameworks such as Log4J and Hibernate. If these ThreadLocal variables aren’t removed after their use, they will accumulate in memory and have the potential to trigger an OutOfMemoryError. In this post, let’s learn how to troubleshoot memory leaks that are caused by ThreadLocal variables. ThreadLocal Memory Leak Here is a sample program that simulates a ThreadLocal memory leak. Plain Text 01: public class ThreadLocalOOMDemo { 02: 03: private static final ThreadLocal<String> threadString = new ThreadLocal<>(); 04: 05: private static final String text = generateLargeString(); 06: 07: private static int count = 0; 08: 09: public static void main(String[] args) throws Exception { 10: while (true) { 11: 12: Thread thread = new Thread(() -> { 13: threadString.set("String-" + count + text); 14: try { 15: Thread.sleep(Long.MAX_VALUE); // Keep thread alive 16: } catch (InterruptedException e) { 17: Thread.currentThread().interrupt(); 18: } 19: }); 20: 21: thread.start(); 22: count++; 23: System.out.println("Started thread #" + count); 24: } 25: } 26: 27: private static String generateLargeString() { 28: StringBuilder sb = new StringBuilder(5 * 1024 * 1024); 29: while (sb.length() < 5 * 1024 * 1024) { 30: sb.append("X"); 31: } 32: return sb.toString(); 33: } 34:} 35: Before continuing to read, please take a moment to review the above program closely. In the above program, in line #3, ‘threadString’ is declared as a ‘ThreadLocal’ variable. In line #10, the program is infinitely (i.e., ‘while (true)’ condition) creating new threads. In line #13, to each created thread, it’s setting a large string (i.e., ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’) as a ThreadLocal variable. The program never removes the ThreadLocal variable once it’s created. So, in a nutshell, the program is creating new threads infinitely and slapping each new thread with a large string as its ThreadLocal variable and never removing it. Thus, when the program is executed, ThreadLocal variables will continuously accumulate into memory and finally result in ‘java.lang.OutOfMemoryError: Java heap space’. How to Diagnose ThreadLocal Memory Leak? You want to follow the steps highlighted in this post to diagnose the OutOfMemoryError: Java Heap Space. In a nutshell, you need to do: 1. Capture Heap Dump You need to capture a heap dump from the application, right before the JVM throws an OutOfMemoryError. In this post, eight options for capturing a heap dump are discussed. You may choose the option that best suits your needs. My favorite option is to pass the ‘-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<FILE_PATH_LOCATION>‘ JVM arguments to your application at the time of startup. Example: Shell -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/tmp/heapdump.hprof When you pass the above arguments, JVM will generate a heap dump and write it to ‘/opt/tmp/heapdump.hprof’ file whenever an OutOfMemoryError is thrown. 2. Analyze Heap Dump Once a heap dump is captured, you need to analyze the dump. In the next section, we will discuss how to do heap dump analysis. Heap Dump Analysis: ThreadLocal Memory Leak Heap dumps can be analyzed using various heap dump analysis tools, such as HeapHero, JHat, and JVisualVM. Here, let’s analyze the heap dump captured from this program using the HeapHero tool. HeapHero flags memory leak using ML algorithm The HeapHero tool utilizes machine learning algorithms internally to detect whether any memory leak patterns are present in the heap dump. Above is the screenshot from the heap dump analysis report, flagging a warning that there are 66 instances of ‘java.lang.Thread’ objects, which together is occupying 97.13% of overall memory. It’s a strong indication that the application is suffering from memory leak and it originates from the ‘java.lang.Thread’ objects. Largest Objects section highlights Threads consuming majority of heap space The ‘Largest Objects’ section in the HeapHero analysis report shows all the top memory-consuming objects, as shown in the above figure. Here you can clearly notice that all of these objects are of type ‘java.lang.Thread’ and each of them occupies ~10MB of memory. This clearly shows the culprit objects that are responsible for the memory leak. Outgoing Reference section shows the ThreadLocal strings Tools also give the capability to drill down into the object to investigate its content. When you drill down into any one of the Threads reported in the ‘Largest Object’ section, you can see all its child objects. From the above figure, you can notice the actual ThreadLocal string ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’ to be reported. Basically, this is the string that was added in line #13 of the above programs to be reported. Thus, the tool helps you to point out the memory-leaking object and its source with ease. How to Prevent ThreadLocal Memory Leak Once ThreadLocal variables are used, always call: Shell threadString.remove(); This clears the ThreadLocal variable value from the current thread and avoids the potential memory leaks. Conclusion Uncleared ThreadLocal variables are a subtle issue; however, when left unnoticed, they can accumulate over a period of time and have the potential to bring down the entire application. By being disciplined about removing the ThreadLocal variable after its use, and by using tools like HeapHero for faster root cause analysis, you can protect your applications from hard-to-detect outages. More

Trend Report

Generative AI

AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.

Generative AI

Refcard #158

Machine Learning Patterns and Anti-Patterns

By Tuhin Chattopadhyay DZone Core CORE
Machine Learning Patterns and Anti-Patterns

Refcard #269

Getting Started With Data Quality

By Miguel Garcia DZone Core CORE
Getting Started With Data Quality

More Articles

Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit to get control of logs on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to use Fluent Bit telemetry pipeline on a Kubernetes cluster to take control of all the logs being generated. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The backstory is that the project began as a log parsing project, using Fluentd, which joined the CNCF in 2026 and achieved Graduated status in 2019. Once it became apparent that the world was heading towards cloud-native Kubernetes environments, the solution was not designed to meet the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now a 10-year history in the release of v4 last week. Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Control Logs on a Kubernetes Cluster? When you dive into the cloud native world, this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data, and Fluent Bit is a wonderfully simple way to take control of them across a Kubernetes cluster. It provides a way of collecting everything through a central telemetry pipeline as you go, while providing the ability to parse, filter, and route all your telemetry data. For developers, this article will demonstrate using Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines Where to Get Started To ensure you are ready to start controlling your Kubernetes cluster logs, the rest of this article assumes you have completed the previous article. This ensures you are running a two-node Kubernetes cluster with a workload running in the form of Ghost CMS, and Fluent Bit is installed to collect all container logs. If you did not work through the previous article, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to spin up the Kubernetes cluster with the above setup on your local machine. Using either path, once set up, you are able to see the logs from Fluent Bit containing everything generated on this running cluster. This would be the logs across three namespaces: kube-system, ghost, and logging. You can verify that they are up and running by browsing those namespaces, shown here on my local machine: Go $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace kube-system NAME READY STATUS RESTARTS AGE coredns-668d6bf9bc-jrvrx 1/1 Running 0 69m coredns-668d6bf9bc-wbqjk 1/1 Running 0 69m etcd-2node-control-plane 1/1 Running 0 69m kindnet-fmf8l 1/1 Running 0 69m kindnet-rhlp6 1/1 Running 0 69m kube-apiserver-2node-control-plane 1/1 Running 0 69m kube-controller-manager-2node-control-plane 1/1 Running 0 69m kube-proxy-b5vjr 1/1 Running 0 69m kube-proxy-jxpqc 1/1 Running 0 69m kube-scheduler-2node-control-plane 1/1 Running 0 69m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace ghost NAME READY STATUS RESTARTS AGE ghost-dep-8d59966f4-87jsf 1/1 Running 0 77m ghost-dep-mysql-0 1/1 Running 0 77m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace logging NAME READY STATUS RESTARTS AGE fluent-bit-7qjmx 1/1 Running 0 41m The initial configuration for the Fluent Bit instance is to collect all container logs, from all namespaces, shown in the fluent-bit-helm.yaml configuration file used in our setup, highlighted in bold below: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri outputs: - name: stdout match: '*' To see all the logs collected, we can dump the Fluent Bit log file as follows, using the pod name we found above: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-7qjmx --nanmespace logging [OUTPUT-CUT-DUE-TO-LOG-VOLUME] ... You will notice if you browse that you have error messages, info messages, if you look hard enough, some logs from Ghost's MySQL workload, the Ghost CMS workload, and even your Fluent Bit instance. As a developer working on your cluster, how can you find anything useful in this flood of logging? The good thing is you do have a single place to look for them! Another point to mention is that by using the Fluent Bit tail input plugin and setting it to read from the beginning of each log file, we have ensured that our log telemetry data is taken from all our logs. If we didn't set this to collect from the beginning of the log file, our telemetry pipeline would miss everything that was generated before the Fluent Bit instance started. This ensures we have the workload startup messages and can test on standard log telemetry events each time we modify our pipeline configuration. Let's start taking control of our logs and see how we, as developers, can make some use of the log data we want to see during our local development testing. Taking Back Control The first thing we can do is to focus our log collection efforts on just the workload we are interested in, and in this example, we are looking to find problems with our Ghost CMS deployment. As you are not interested in the logs from anything happening in the kube-system namespace, you can narrow the focus of your Fluent Bit input plugin to only examine Ghost log files. This can be done by making a new configuration file called myfluent-bit-heml.yaml file and changing the default path as follows in bold: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' The next step is to update the Fluent Bit instance with a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-mzktk 1/1 Running 0 28s Now, explore the logs being collected by Fluent Bit and notice that all the kube-system namespace logs are no longer there, and we can focus on our deployed workload. Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-mzktk --nanmespace logging ... [11] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.278137067, {}], {"time"=>"2025-05-18T15:51:26.278137067Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.27 INFO ==> Configuring database"}] [12] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.318427288, {}], {"time"=>"2025-05-18T15:51:26.318427288Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.31 INFO ==> Setting up Ghost"}] [13] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.211337893, {}], {"time"=>"2025-05-18T15:51:31.211337893Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.21 INFO ==> Configuring Ghost URL to http://127.0.0.1:2368"}] [14] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.234609188, {}], {"time"=>"2025-05-18T15:51:31.234609188Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.23 INFO ==> Passing admin user creation wizard"}] [15] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.243222300, {}], {"time"=>"2025-05-18T15:51:31.2432223Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.24 INFO ==> Starting Ghost in background"}] [16] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583519.424206501, {}], {"time"=>"2025-05-18T15:51:59.424206501Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:59.42 INFO ==> Stopping Ghost"}] [17] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583520.921096963, {}], {"time"=>"2025-05-18T15:52:00.921096963Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:00.92 INFO ==> Persisting Ghost installation"}] [18] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583521.008567054, {}], {"time"=>"2025-05-18T15:52:01.008567054Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:01.00 INFO ==> ** Ghost setup finished! **"}] ... This is just a selection of log lines from the total output. If you look closer, you see these logs have their own sort of format, so let's standardize them so that JSON is the output format and make the various timestamps a bit more readable by changing your Fluent Bit output plugin configuration as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-gqsc8 1/1 Running 0 42s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-gqsc8 --nanmespace logging ... {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} {"date":"2025-06-05 13:49:59.387736","time":"2025-06-05T13:49:59.387736981Z","stream":"stdout","_p":"F","log":""} {"date":"2025-06-05 13:49:59.451176","time":"2025-06-05T13:49:59.451176821Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.45 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Starting Ghost **"} {"date":"2025-06-05 13:50:00.171207","time":"2025-06-05T13:50:00.171207812Z","stream":"stdout","_p":"F","log":""} ... Now, if we look closer at the array of messages and being the developer we are, we've noticed a mix of stderr and stdout log lines. Let's take control and trim out all the lines that do not contain stderr, as we are only interested in what is broken. We need to add a filter section to our Fluent Bit configuration using the grep filter and targeting a regular expression to select the keys stream or stderr as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-npn8n 1/1 Running 0 12s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-npn8n --nanmespace logging ... {"date":"2025-06-05 13:49:34.807524","time":"2025-06-05T13:49:34.807524266Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.80 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring database"} {"date":"2025-06-05 13:49:34.860722","time":"2025-06-05T13:49:34.860722188Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.86 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Setting up Ghost"} {"date":"2025-06-05 13:49:36.289847","time":"2025-06-05T13:49:36.289847086Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.28 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring Ghost URL to http://127.0.0.1:2368"} {"date":"2025-06-05 13:49:36.373376","time":"2025-06-05T13:49:36.373376803Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Passing admin user creation wizard"} {"date":"2025-06-05 13:49:36.379461","time":"2025-06-05T13:49:36.379461971Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Starting Ghost in background"} {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} ... We are no longer seeing standard output log events, as our telemetry pipeline is now filtering to only show standard error-tagged logs! This exercise has shown how to format and prune our logs using our Fluent Bit telemetry pipeline on a Kubernetes cluster. Now let's look at how to enrich our log telemetry data. We are going to add tags to every standard error line pointing the on-call developer to the SRE they need to contact. To do this, we expand our filter section of the Fluent Bit configuration using the modify filter and targeting the keys stream or stderr to remove those keys and add two new keys, STATUS and ACTION, as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr - name: modify match: '*' condition: Key_Value_Equals stream stderr remove: stream add: - STATUS REALLY_BAD - ACTION CALL_SRE outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-ftfs4 1/1 Running 0 32s Now, explore the logs being collected by Fluent Bit and notice the output changes where the stream key is missing and two new ones have been added at the end of each error log event: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-ftfs4 --nanmespace logging ... [CUT-LINE-FOR-VIEWING] Configuring database"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Setting up Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Configuring Ghost URL to http://127.0.0.1:2368"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Passing admin user creation wizard"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Starting Ghost in background"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Stopping Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Persisting Ghost installation"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] ** Ghost setup finished! **"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} ... Now we have a running Kubernetes cluster, with two nodes generating logs, a workload in the form of a Ghost CMS generating logs, and using a Fluent Bit telemetry pipeline to gather and take control of our log telemetry data. Initially, we found that gathering all log telemetry data was flooding too much information to be able to sift out the important events for our development needs. We then started taking control of our log telemetry data by narrowing our collection strategy, by filtering, and finally by enriching our telemetry data. More in the Series In this article, you learned how to use Fluent Bit on a Kubernetes cluster to take control of your telemetry data. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, integrating Fluent Bit telemetry pipelines with OpenTelemetry.

By Eric D. Schabell DZone Core CORE
HTAP Using a Star Query on MongoDB Atlas Search Index
HTAP Using a Star Query on MongoDB Atlas Search Index

MongoDB is often chosen for online transaction processing (OLTP) due to its flexible document model, which can align with domain-specific data structures and access patterns. Beyond basic transactional workloads, MongoDB also supports search capabilities through Atlas Search, built on Apache Lucene. When combined with the aggregation pipeline, this enables limited online analytical processing (OLAP) functionality suitable for near-real-time analytics. Because MongoDB uses a unified document model, these analytical queries can run without restructuring the data, allowing for certain hybrid transactional and analytical (HTAP) workloads. This article explores such a use case in the context of healthcare. Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, to support efficient ad-hoc queries. This typically requires a dimensional schema, or star schema, which is distinct from the normalized operational schema used for transactional updates. MongoDB can support a similar querying approach using its document schema, which is often designed for operational use. By adding an Atlas Search index to the collection storing transactional data, certain analytical queries can be supported without restructuring the schema. To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilized the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning. Import the Dataset The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records. I downloaded the CSV files and uncompressed them: curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw" unzip medsynora-dw.zip I imported each file into a collection, using mongoimport from the MongoDB Database Tools: for i in "MedSynora DW"/*.csv do mongoimport -d "MedSynoraDW" --file="$i" --type=csv --headerline -c "$(basename "$i" .csv)" -j 8 done For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers: # head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv ==> MedSynora DW/FactEncounter.csv <== Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent ==> MedSynora DW/FactLabTests.csv <== Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue The fact tables referenced the following dimensions: # head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv ==> MedSynora DW/DimDisease.csv <== Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit ==> MedSynora DW/DimDoctor.csv <== Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count ==> MedSynora DW/DimInsurance.csv <== InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments ==> MedSynora DW/DimPatient.csv <== Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type ==> MedSynora DW/DimRoom.csv <== RoomKey,Care_Level,Room Type Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition, it is described in two CSV files to fit into two SQL tables: Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections. MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship. Since this represents the appropriate data model for an operational database with such data, I created a new collection using an aggregation pipeline to replace the two imported from the normalized CSV: db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 }); db.FactEncounter.aggregate([ { $lookup: { from: "FactLabTests", localField: "Encounter_ID", foreignField: "Encounter_ID", as: "LabTests" } }, { $addFields: { LabTests: { $map: { input: "$LabTests", as: "test", in: { Phase: "$$test.Phase", LabType: "$$test.LabType", TestName: "$$test.TestName", TestValue: "$$test.TestValue" } } } } }, { $out: "FactEncounterLabTests" } ]); Here is how one document looks: AtlasLocalDev atlas [direct: primary] MedSynoraDW> db.FactEncounterLabTests.find().limit(1) [ { _id: ObjectId('67fc3d2f40d2b3c843949c97'), Encounter_ID: 2158, Patient_ID: 'TR479', Disease_ID: 1632, ResponsibleDoctorID: 905, InsuranceKey: 82, RoomKey: 203, CheckinDate: '2024-01-23 11:09:00', CheckoutDate: '2024-03-29 17:00:00', CheckinDateKey: 20240123, CheckoutDateKey: 20240329, Patient_Severity_Score: 63.2, RadiologyType: 'None', RadiologyProcedureCount: 0, EndoscopyType: 'None', EndoscopyProcedureCount: 0, CompanionPresent: 'True', LabTests: [ { Phase: 'Admission', LabType: 'CBC', TestName: 'Lymphocytes_abs (10^3/µl)', TestValue: 1.34 }, { Phase: 'Admission', LabType: 'Chem', TestName: 'ALT (U/l)', TestValue: 20.5 }, { Phase: 'Admission', LabType: 'Lipids', TestName: 'Triglycerides (mg/dl)', TestValue: 129.1 }, { Phase: 'Discharge', LabType: 'CBC', TestName: 'RBC (10^6/µl)', TestValue: 4.08 }, ... In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections. Atlas Search Index Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined. I created a single Atlas Search index encompassing all dimensions and measures I intended to use in predicates, including those in embedded documents. db.FactEncounterLabTests.createSearchIndex( "SearchFactEncounterLabTests", { mappings: { dynamic: false, fields: { "Encounter_ID": { "type": "number" }, "Patient_ID": { "type": "token" }, "Disease_ID": { "type": "number" }, "InsuranceKey": { "type": "number" }, "RoomKey": { "type": "number" }, "ResponsibleDoctorID": { "type": "number" }, "CheckinDate": { "type": "token" }, "CheckoutDate": { "type": "token" }, "LabTests": { "type": "document" , fields: { "Phase": { "type": "token" }, "LabType": { "type": "token" }, "TestName": { "type": "token" }, "TestValue": { "type": "number" } } } } } } ); Since I don't need extra text searching on the keys, I designated the character string ones as token. I labeled the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD. In relational databases, the star schema approach involves limiting the number of columns in fact tables due to their typically large number of rows. Dimension tables, which are generally smaller, can include more columns and are often denormalized in SQL databases, making the star schema more common than the snowflake schema. Similarly, in document modeling, embedding all dimension fields can increase the size of fact documents unnecessarily, so referencing dimension collections is often preferred. MongoDB’s data modeling principles allow it to be queried similarly to a star schema without additional complexity, as its design aligns with common application access patterns. Star Query A star schema allows processing queries which filter fields within dimension collections in several stages: In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted. For an example query, I aimed to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions. Search Aggregation Pipeline To optimize the fact collection process and apply all filters, I began with a simple aggregation pipeline that started with a search on the search index. This enabled filters to be applied directly to the fields in the fact collection, while additional filters were incorporated in the first stage of the star query. I used a local variable with a compound operator to facilitate adding more filters for each dimension during this stage. Before proceeding through the star query stages to add filters on dimensions, my query included a filter on the lab type, which was part of the fact collection and indexed. const search = { "$search": { "index": "SearchFactEncounterLabTests", "compound": { "must": [ { "in": { "path": "LabTests.LabType" , "value": "Lipids" } }, ] }, "sort": { CheckoutDate: -1 } } } I added a sort operation to order the results by check-out date in descending order. This illustrated the advantage of sorting during the index search rather than in later stages of the aggregation pipeline, especially when a limit was applied. I used this local variable to add more filters in Stage 1 of the star query, so that it could be executed for Stage 2 and collect documents for Stage 3. Stage 1: Query the Dimension Collections In the first phase of the star query, I obtained the dimension keys from the dimension collections. For every dimension with a filter, I retrieved the dimension keys using a find() on the dimension collection and appended a must condition to the compound of the fact index search. The following added the conditions on the Patient (female patients over 170 cm): search["$search"]["compound"]["must"].push( { in: { path: "Patient_ID", // Foreign Key in Fact value: db.DimPatient.find( // Dimension collection {Gender: "Female", Height: { "$gt": 170 } // filter on Dimension ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension }) The following added the conditions on the Doctor (Japanese): search["$search"]["compound"]["must"].push( { in: { path: "ResponsibleDoctorID", // Foreign Key in Fact value: db.DimDoctor.find( // Dimension collection {"Doctor Nationality": "Japanese" } // filter on Dimension ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Room (Deluxe): search["$search"]["compound"]["must"].push( { in: { path: "RoomKey", // Foreign Key in Fact value: db.DimRoom.find( // Dimension collection {"Room Type": "Deluxe" } // filter on Dimension ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Disease (Hematology): search["$search"]["compound"]["must"].push( { in: { path: "Disease_ID", // Foreign Key in Fact value: db.DimDisease.find( // Dimension collection {"Disease Type": "Hematology" } // filter on Dimension ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension }) Finally, here's the condition on the Insurance coverage (greater than 80%): search["$search"]["compound"]["must"].push( { in: { path: "InsuranceKey", // Foreign Key in Fact value: db.DimInsurance.find( // Dimension collection {"Coverage Limit": { "$gt": 0.8 } } // filter on Dimension ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension }) All these search criteria had the same structure: a find() on the dimension collection with the filters from the query, resulting in an array of dimension keys (similar to primary keys in a dimension table) that were used to search the fact documents by referencing them (like foreign keys in a fact table). Each of these steps queried the dimension collection to obtain a simple array of dimension keys, which were then added to the aggregation pipeline. Rather than joining tables as in a relational database, the criteria on the dimensions were pushed down into the query on the fact collection. Stage 2: Query the Fact Search Index Using the results from the dimension queries, I built the following pipeline search step: AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search) { '$search': { index: 'SearchFactEncounterLabTests', compound: { must: [ { in: { path: 'LabTests.LabType', value: 'Lipids' } }, { in: { path: 'Patient_ID', value: [ 'TR551', 'TR751', 'TR897', 'TRGT201', 'TRJB261', 'TRQG448', 'TRSQ510', 'TRTP535', 'TRUC548', 'TRVT591', 'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896', 'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672', 'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003', 'TRE004', 'TRMN351', 'TRRY492', 'TRTI528', 'TRAKA962', 'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384', 'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419', 'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079', 'TR663', 'TR808', 'TR849', 'TRKA286', 'TRLC314', 'TRMG344', 'TRPT435', 'TRVZ597', 'TRXC626', 'TRACT773', 'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167', 'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514', 'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990', 'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374', 'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026', 'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354', 'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589', 'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987', 'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121', ... 6184 more items ] } }, { in: { path: 'ResponsibleDoctorID', value: [ 830, 844, 862, 921 ] } }, { in: { path: 'RoomKey', value: [ 203 ] } }, { in: { path: 'Disease_ID', value: [ 1519, 1506, 1504, 1510, 1515, 1507, 1503, 1502, 1518, 1517, 1508, 1513, 1509, 1512, 1516, 1511, 1505, 1514 ] } }, { in: { path: 'InsuranceKey', value: [ 83, 84 ] } } ] }, sort: { CheckoutDate: -1 } } MongoDB Atlas Search indexes, which are built on Apache Lucene, handle queries with multiple conditions and long arrays of values. In this example, a search operation uses the compound operator with the must clause to apply filters across attributes. This approach applies filters after resolving complex conditions into lists of dimension keys. Using the search operation defined above, I ran an aggregation pipeline to retrieve the document of interest: db.FactEncounterLabTests.aggregate([ search, ]) With my example, nine documents were returned in 50 milliseconds. Estimate the Count This approach works well for queries with multiple filters where individual conditions are not very selective but their combination is. Querying dimensions and using a search index on facts helps avoid scanning unnecessary documents. However, depending on additional operations in the aggregation pipeline, it is advisable to estimate the number of records returned by the search index to prevent expensive queries. In applications that allow multi-criteria queries, it is common to set a threshold and return an error or warning if the estimated number of documents exceeds it, prompting users to add more filters. To support this, you can run a $searchMeta operation on the index before a $search. For example, the following checks that the number of documents returned by the filter is less than 10,000: MedSynoraDW> db.FactEncounterLabTests.aggregate([ { "$searchMeta": { index: search["$search"].index, compound: search["$search"].compound, count: { "type": "lowerBound" , threshold: 10000 } } } ]) [ { count: { lowerBound: Long('9') } } ] In my case, with nine documents, I can add more operations to the aggregation pipeline without expecting a long response time. If there are more documents than expected, additional steps in the aggregation pipeline may take longer. If tens or hundreds of thousands of documents are expected as input to a complex aggregation pipeline, the application may warn the user that the query execution will not be instantaneous, and may offer the choice to run it as a background job with a notification when done. With such a warning, the user may decide to add more filters, or a limit to work on a Top-n result, which will be added to the aggregation pipeline after a sorted search. Stage 3: Join Cack to Dimensions for Projection The first step of the aggregation pipeline fetches all the documents needed for the result, and only those documents, using efficient access through the search index. Once filtering is complete, the smaller set of documents is used for aggregation or projection in the later stages of the aggregation pipeline. In the third stage of the star query, it performs lookups on the dimensions to retrieve additional attributes needed for aggregation or projection. It might re-examine some collections used for filtering, which is not a problem since the dimensions remain small. For larger dimensions, the initial stage could save this information in a temporary array to avoid extra lookups, although this is often unnecessary. For example, when I wanted to display additional information about the patient and the doctor, I added two lookup stages to my aggregation pipeline: { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, For the simplicity of this demo, I imported the dimensions directly from the CSV file. In a well-designed database, the primary key for dimensions should be the document's _id field, and the collection ought to be established as a clustered collection. This design ensures efficient joins from fact documents. Most of the dimensions are compact and stay in memory. I added a final projection to fetch only the fields I needed. The full aggregation pipeline, using the search defined above with filters and arrays of dimension keys, is: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, { "$project": { "Patient_Severity_Score": 1, "CheckinDate": 1, "CheckoutDate": 1, "Patient.name": { "$concat": [ { "$arrayElemAt": ["$Patient.First Name", 0] }, " ", { "$arrayElemAt": ["$Patient.Last Name", 0] } ] }, "ResponsibleDoctor.name": { "$concat": [ { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Name", 0] }, " ", { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Surname", 0] } ] } } } ]) On a small instance, it returned the following result in 50 milliseconds: [ { _id: ObjectId('67fc3d2f40d2b3c843949a97'), CheckinDate: '2024-02-12 17:00:00', CheckoutDate: '2024-03-30 13:04:00', Patient_Severity_Score: 61.4, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Niina Johanson' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f5c'), CheckinDate: '2024-04-29 06:44:00', CheckoutDate: '2024-05-30 19:53:00', Patient_Severity_Score: 57.7, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Cindy Wibisono' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f0e'), CheckinDate: '2024-10-06 13:43:00', CheckoutDate: '2024-11-29 09:37:00', Patient_Severity_Score: 55.1, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Asta Koch' } ] }, { _id: ObjectId('67fc3d2f40d2b3c8439523de'), CheckinDate: '2024-08-24 22:40:00', CheckoutDate: '2024-10-09 12:18:00', Patient_Severity_Score: 66, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Paloma Aguero' } ] }, { _id: ObjectId('67fc3d3040d2b3c843956f7e'), CheckinDate: '2024-11-04 14:50:00', CheckoutDate: '2024-12-31 22:59:59', Patient_Severity_Score: 51.5, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Aulikki Johansson' } ] }, { _id: ObjectId('67fc3d3040d2b3c84395e0ff'), CheckinDate: '2024-01-14 19:09:00', CheckoutDate: '2024-02-07 15:43:00', Patient_Severity_Score: 47.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Laura Potter' } ] }, { _id: ObjectId('67fc3d3140d2b3c843965ed2'), CheckinDate: '2024-01-03 09:39:00', CheckoutDate: '2024-02-09 12:55:00', Patient_Severity_Score: 57.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Gabriela Cassiano' } ] }, { _id: ObjectId('67fc3d3140d2b3c843966ba1'), CheckinDate: '2024-07-03 13:38:00', CheckoutDate: '2024-07-17 07:46:00', Patient_Severity_Score: 60.3, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Monica Zuniga' } ] }, { _id: ObjectId('67fc3d3140d2b3c843969226'), CheckinDate: '2024-04-06 11:36:00', CheckoutDate: '2024-04-26 07:02:00', Patient_Severity_Score: 62.9, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Stanislava Beranova' } ] } ] The star query approach focuses solely on filtering to obtain the input for further processing, while retaining the full power of aggregation pipelines. Additional Aggregation after Filtering When I have the set of documents efficiently filtered upfront, I can apply some aggregations before the projection. For example, the following grouped per doctor and counted the number of patients and the range of severity score: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$unwind": "$ResponsibleDoctor" }, { "$group": { "_id": { "doctor_id": "$ResponsibleDoctor.Doctor_ID", "doctor_name": { "$concat": [ "$ResponsibleDoctor.Doctor Name", " ", "$ResponsibleDoctor.Doctor Surname" ] } }, "min_severity_score": { "$min": "$Patient_Severity_Score" }, "max_severity_score": { "$max": "$Patient_Severity_Score" }, "patient_count": { "$sum": 1 } // Count the number of patients } }, { "$project": { "doctor_name": "$_id.doctor_name", "min_severity_score": 1, "max_severity_score": 1, "patient_count": 1 } } ]) My filters got documents from only one doctor and nine patients: [ { _id: { doctor_id: 862, doctor_name: 'Sayuri Shan Kou' }, min_severity_score: 47.6, max_severity_score: 66, patient_count: 9, doctor_name: 'Sayuri Shan Kou' } ] Using a MongoDB document model, this method enables direct analytical queries on the operational database, removing the need for a separate analytical database. The search index operates as the analytical component for the operational database and works with the MongoDB aggregation pipeline. Since the search index runs as a separate process, it can be deployed on a dedicated search node to isolate resource usage. When running analytics on an operational database, queries should be designed to minimize impact on the operational workload. Conclusion MongoDB’s document model with Atlas Search indexes supports managing and querying data following a star schema approach. By using a single search index on the fact collection and querying dimension collections for filters, it is possible to perform ad-hoc queries without replicating data into a separate analytical schema as typically done in relational databases. This method resembles the approach used in SQL databases, where a star schema data mart is maintained apart from the normalized operational database. In MongoDB, the document model uses embedding and referencing patterns similar to a star schema and is structured for operational transactions. Search indexes provide similar functionality without moving data to a separate system. The method, implemented as a three-stage star query, can be integrated into client applications to optimize query execution and enable near-real-time analytics on complex data. This approach supports hybrid transactional and analytical processing (HTAP) workloads.

By Franck Pachot
AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering
AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering

Let's be honest. Building developer platforms, especially for AI-native teams, is a complex art, a constant challenge. It's about finding a delicate balance: granting maximum autonomy to development teams without spiraling into chaos, and providing incredibly powerful, cutting-edge tools without adding superfluous complexity to their already dense workload. Our objective as Platform Engineers has always been to pave the way, remove obstacles, and accelerate innovation. But what if the next, inevitable phase of platform evolution wasn't just about what we build and provide, but what Generative AI can help us co-build, co-design, and co-manage? We're not talking about a mere incremental improvement, a minor optimization, or a marginal new feature. We're facing a genuine paradigm shift, a conceptual earthquake where artificial intelligence is no longer merely the final product of our efforts, the result of our development toils, but becomes the silent partner, the tireless ally that is already reimagining, rewriting, and redefining our entire development experience. This is the real gamble, the challenge that awaits us: transforming our platforms from simple toolsets, however sophisticated, into intelligent, dynamic, and self-optimizing ecosystems. A place where productivity isn't just high, but exceptionally high, and innovation flows frictionlessly. What if We Unlock 100% of Our Platform’s Potential? Your primary goal, like that of any good Platform Engineer, is already to make developers' lives simpler, faster, and, let's admit it, significantly more enjoyable. Now, imagine endowing your platform with genuine intelligence, with the ability to understand, anticipate, and even generate. GenAI, in this context, isn't just an additional feature that layers onto existing ones; it's the catalyst that is already fundamentally redefining the Developer Experience (DevEx), exponentially accelerating the entire software development lifecycle, and, even more fascinating, creating new, intuitive, and natural interfaces for interacting with the platform's intrinsic capabilities. Let's momentarily consider the most common and frustrating pain points that still afflict the average developer: the exhaustive and often fruitless hunt through infinite and fragmented documentation, the obligation to memorize dozens, if not hundreds, of specific and often cryptic CLI commands, or the tedious and repetitive generation of boilerplate code. With the intelligent integration of GenAI, your platform magically evolves into a true intelligent co-pilot. Imagine a developer who can simply express a request in natural language, as if speaking to an expert colleague: "Provision a new staging environment for my authentication microservice, complete with a PostgreSQL database, a dedicated Kafka topic, and integration with our monitoring system." The GenAI-powered platform not only understands the deep meaning and context of the request, not only translates the intention into a series of technical actions, but executes the operation autonomously, providing immediate feedback and magically configuring everything needed. This isn't mere automation, which we already know; it's a conversational interaction, deep and contextual, that almost completely zeroes out the developer's cognitive load, freeing their mind and creative energies to focus on innovation, not on the complex and often tedious infrastructural "plumbing". But the impact extends far beyond simple commands. GenAI can act as an omnipresent expert, an always-available and incredibly informed figure, providing real-time, contextual assistance. Imagine being stuck on a dependency error, a hard-to-diagnose configuration problem, or a security vulnerability. Instead of spending hours searching forums or asking colleagues, you can ask the platform directly. And it, magically, suggests practical solutions, directs you to relevant internal best practices (perhaps your own guides, finally usable in an intelligent way!), or even proposes complete code patches to solve the problem. It can proactively identify potential security vulnerabilities in the code you've just generated or modified, suggest intelligent refactorings to improve performance, or even scaffold entire new modules or microservices based on high-level descriptions. This drastically accelerates the entire software development lifecycle, making best practices inherent to the process and transforming bottlenecks into opportunities for automation. Your platform is no longer a mere collection of passive tools, but an intelligent and proactive partner at every single stage of the developer's workflow, from conception to implementation, from testing to deployment. Crucially, for this to work, the GenAI model must be fed with the right platform context. By ingesting all platform documentation, internal APIs, service catalogs, and architectural patterns, the AI becomes an unparalleled tool for discoverability of platform items. Developers can now query in natural language to find the right component, service, or golden path for their needs. Furthermore, this contextual understanding allows the AI to interrogate and access all data and assets within the platform itself, as well as from the applications being developed on it, providing insights and recommendations in real-time. This elevates the concept of a composable architecture, already enabled by your platform, to an entirely new level. With an AI co-pilot that not only knows all available platform items but also understands how to use them optimally and how others have used them effectively, the development of new composable applications or rapid Proofs of Concept (PoCs) becomes faster than ever before. The new interfaces enabled by GenAI go beyond mere suggestion. Think of natural language chatbot interfaces for giving commands, where the platform responds like a virtual assistant. Crucially, thanks to advancements like Model Context Protocol (MCP) or similar tool-use capabilities, the GenAI-powered platform can move beyond just "suggesting" and actively "doing". It can execute complex workflows, interact with external APIs, and trigger actions within your infrastructure. This fosters a true cognitive architecture where the model isn't just generating text but is an active participant in your operations, capable of generating architectural diagrams, provisioning resources, or even deploying components based on a simple natural language description. The vision is that of a "platform agent" or an "AI persona" that learns and adapts to the specific needs of the team and the individual developer, constantly optimizing their path and facilitating the adoption of best practices. Platforms: The Launchpad for Ai-Powered Applications This synergy is two-way, a deep symbiotic relationship. If, on one hand, GenAI infuses new intelligence and vitality into platforms, on the other, your Internal Developer Platforms are, and will increasingly become, the essential launchpad for the unstoppable explosion of AI-powered applications. The complex and often winding journey of an artificial intelligence model—from the very first phase of experimentation and prototyping, through intensive training, to serving in production and scalable inference—is riddled with often daunting infrastructural complexities. Dedicated GPU clusters, specialized Machine Learning frameworks, complex data pipelines, and scalable, secure, and performant serving endpoints are by no means trivial for every single team to manage independently. And this is where your platform uniquely shines. It has the power to abstract away all the thorny and technical details of AI infrastructure, providing self-service and on-demand provisioning of the exact compute resources (CPU, various types of GPUs), storage (object storage, data lakes), and networking required for every single phase of the model's lifecycle. Imagine a developer who has just finished training a new model and needs to deploy an inference service. Instead of interacting with the Ops team for days or weeks, they simply request it through an intuitive self-service portal on the platform, and within minutes, the platform automatically provisions the necessary hardware (perhaps a dedicated GPU instance), deploys the model to a scalable endpoint (e.g., a serverless service or a container on a dedicated cluster), and, transparently, even generates a secure API key for access and consumption. This process eliminates days or weeks of manual configuration, of tickets and waiting times, transforming a complex and often frustrating MLOps challenge into a fluid, instant, and completely self-service operation. The platform manages not only serving but the entire lifecycle: from data preparation, to training clusters, to evaluation and A/B testing phases, all the way to post-deployment monitoring. Furthermore, platforms provide crucial golden paths for AI application development at the application layer. There's no longer a need for every team to reinvent the wheel for common AI patterns. Your platform can offer pre-built templates and codified best practices for integrating Large Language Models (LLMs), implementing patterns like Retrieval-Augmented Generation (RAG) with connectors to your internal data sources, or setting up complete pipelines for model monitoring and evaluation. Think of robust libraries and opinionated frameworks for prompt engineering, for managing model and dataset versions, for specific AI model observability (e.g., tools for bias detection, model interpretation, or drift management). The platform becomes a hub for collaboration on AI assets, facilitating the sharing and reuse of models, datasets, and components, including the development of AI agents. By embedding best practices and pre-integrating the most common and necessary AI services, every single developer, even one without a deep Machine Learning background, is empowered to infuse their applications with intelligent, cutting-edge capabilities. This not only democratizes AI development across the organization but unlocks unprecedented innovation that was previously limited to a few specialized teams. The Future Is Symbiotic: Your Next Move The era of AI-native development isn't an option; it's an imminent reality, and it urgently demands AI-native platforms. The marriage of GenAI and Platform Engineering isn't just an evolutionary step; it's a revolutionary leap destined to redefine the very foundations of our craft. GenAI makes platforms intrinsically smarter, more intuitive, more responsive, and consequently, incredibly more powerful. Platforms, in turn, provide the robust, self-service infrastructure and the well-paved roads necessary to massively accelerate the adoption and deployment of AI across the enterprise, transforming potential into reality. Are you ready to stop building for AI and start building with AI? Now is the time to act. Identify the most painful bottlenecks in your current DevEx and think about how GenAI could transform them. Prioritize the creation of self-service capabilities for AI infrastructure, making model deployment as simple as that of a traditional microservice. Cultivate a culture of "platform as a product", where AI is not just a consumer, but a fundamental feature of the platform itself. The future of software development isn't just about AI-powered applications; it's about an AI-powered development experience that completely redefines the concepts of productivity, creativity, and the very act of value creation. Embrace this unstoppable alliance, and unlock the next fascinating frontier of innovation. The time of static platforms is over. The era of intelligent platforms has just begun.

By Graziano Casto DZone Core CORE
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset

In recent years, Agile has become closely associated with modern software development, promoting customer-focused value delivery, regular feedback loops, and empowered teams. However, beneath the familiar terminology, many technical professionals are beginning to question whether Agile is achieving its intended outcomes or simply adding complexity. Many experienced developers and engineers voice discontent with excessive processes, poorly executed rituals, and a disconnect between Agile principles and the realities of their daily work. As organizations push for broader Agile adoption, understanding the roots of this discontent is crucial — not only for improving team morale but also for ensuring that Agile practices genuinely add value rather than becoming just another management fad. The Agile Manifesto The Agile Manifesto defines a set of values and principles that guide software development (and other products). It inspires various frameworks and methods to support iterative delivery, early and continuous value creation, team collaboration, and continuous improvement through regular feedback and adaptation. Teams may misinterpret their core purpose when implementing Agile methodologies that do not adhere to their foundational principles. This misinterpretation can distort the framework’s adaptability and focus on customer-centric value delivery. The sooner we assess the health of Agile practices and take corrective action, the greater the benefits for business outcomes and team morale. Feedback on Agile Practices Here are some common feedback themes based on Scrum teams' perceptions of their experience with Agile practices. 1. Disconnect Between Agile Theory and Practice The Agile Manifesto sounds excellent, but real-world Agile feels like “Agile theater” with ceremonies and buzzwords. Cause: Many teams adopt Agile practices solely to undergo the process without embracing its values. Change in perception: Recognize the difference between doing vs. being Agile. Foster a culture of self-organized teams delivering value with continuous improvement to customers. 2. Lack of Autonomy Agile can feel prescriptive, with strict roles and rituals that constrain engineers. Cause: An overly rigid application of Agile can stifle creativity and reduce a sense of ownership. Engineers thrive when given the freedom to solve problems rather than being confined to a prescriptive approach. Change in perception: Agile teams are empowered to make decisions. They don’t dwell on obstacles—they take ownership, lead through collaboration, and focus on delivering solutions with achievable delivery commitments. 3. Misuse of Agile as a Management Tool Agile is used for micromanagement to track velocity and demand commitments. Cause: Agile is sometimes misunderstood to focus on metrics over outcomes. When velocity is prioritized over value, the purpose gets lost. Change in perception: Focus on principles and purpose, not just processes. Processes aren’t about restriction, but repeatable and reliable success. Agile processes support the team by reinforcing what works and making success scalable. 4. Lack of Visible Improvement Despite Agile processes, teams still face delays, unclear requirements, or poor decisions. Cause: When teams struggle to show visible improvement, foundational elements — like a clear roadmap and meaningful engagement with engineers around the product vision — are often missing. Change in perception: Anchor Agile practices to tangible outcomes, such as faster feedback loops, improved quality, and reduced defects. Continuously inspect and adapt the process and product direction, ensuring both evolve together to drive meaningful progress. How to Bridge the Gap With Kaizen The disconnect between Agile’s theoretical benefits and practical execution can undermine empowerment and autonomy for a self-organized team, ultimately producing outcomes antithetical to the methodology’s intent of delivering iterative, user-focused solutions. Without proper contextualization and leadership buy-in, such implementations risk reducing Agile to a superficial process rather than a cultural shift toward continuous improvement. As the Japanese philosophy of Kaizen reminds us, meaningful change happens incrementally. Agile retrospectives embody this mindset. When the process isn't working, the team must come together — not to assign blame but to reflect, realign, and evolve. Leveraging the Power of Retrospective for Continuous Improvement Misalignment with the value statement is a core reason Agile processes fail. Agile teams should go beyond surface-level issues and explore more profound, value-driven questions in the retrospective to get the most out of them. Some of the recommended core areas for effective Agile retrospectives: Value Alignment What does “value” mean to us in this sprint or project? Are we clear on what our customer truly needs right now? Flow and Process Efficiency Where did work get blocked and delayed, and is the team aware of the communication path to seek support? Are our ceremonies (stand-ups, planning, reviews) meaningful, valuable, or just rituals? Commitment and Focus Were our sprint goals clear and achievable? Did we commit to too much or too little? Customer Centricity Did we receive or act on honest feedback from users or stakeholders? Do we know how the work impacted the customer? Suggested Template for Agile Retrospective Takeaways Use this template to capture and communicate the outcomes of your retrospective. It helps ensure accountability, transparency, and alignment going forward. A structured retrospective framework for teams to reflect on performance and improve workflows. 1. Keep doing what’s working well: Practical and valuable habits and Practices. What reinforces team strengths and morale? Examples: Effective and outcome-based meeting Collaboration for efficient dependency management 2. Do less of what we are doing too much of: Process overdose. Encourage balance and efficiency. Overused activities are not always valuable. Examples: Too many long meetings drain team morale and disrupt daily progress. Excessive code reviews on trivial commits delay code merge and integration. 3. Stop doing what’s not working and should be eliminated: Identify waste or negative patterns. Break unhealthy habits that reduce productivity or hurt team morale. Examples: Starting work before stories and the Definition of Done are fully defined - action before understanding purpose, business value, and success criteria Skipping retrospectives - detached from improvement 4. Start doing what new practices or improvements we should try: Encourages innovation, experimentation, and growth. A great place to introduce ideas that the team hasn't tried yet. Examples: Add a mid-sprint check-in Start using sprint goals more actively Conclusion Agile is based on the principle of progressing through continuous improvement and incrementally delivering value. Retrospective meetings are crucial in this process, as they allow teams to pause, reflect, and realign themselves to ensure they are progressing in the right direction. This approach aligns with the Kaizen philosophy of ongoing improvement.

By Pabitra Saikia
Automating Sentiment Analysis Using Snowflake Cortex
Automating Sentiment Analysis Using Snowflake Cortex

In this hands-on tutorial, you'll learn how to automate sentiment analysis and categorize customer feedback using Snowflake Cortex, all through a simple SQL query without needing to build heavy and complex machine learning algorithms. No MLOps is required. We'll work with sample data simulating real customer feedback comments about a fictional company, "DemoMart," and classify each customer feedback entry using Cortex's built-in function. We'll determine sentiment (positive, negative, neutral) and label the feedback into different categories. The goal is to: Load a sample dataset of customer feedback into a Snowflake table.Use the built-in LLM-powered classification (CLASSIFY_TEXT) to tag each entry with a sentiment and classify the feedback into a specific category. Automate this entire workflow to run weekly using Snowflake Task.Generate insights from the classified data. Prerequisites A Snowflake account with access to Snowflake CortexRole privileges to create tables, tasks, and proceduresBasic SQL knowledge Step 1: Create Sample Feedback Table We'll use a sample dataset of customer feedback that covers products, delivery, customer support, and other areas. Let's create a table in Snowflake to store this data. Here is the SQL for creating the required table to hold customer feedback. SQL CREATE OR REPLACE TABLE customer.csat.feedback ( feedback_id INT, feedback_ts DATE, feedback_text STRING ); Now, you can load the data into the table using Snowflake's Snowsight interface. The sample data "customer_feedback_demomart.csv" is available in the GitHub repo. You can download and use it. Step 2: Use Cortex to Classify Sentiment and Category Let's read and process each row from the feedback table. Here's the magic. This single query classifies each piece of feedback for both sentiment and category: SQL SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING AS sentiment, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_category FROM customer.csat.feedback LIMIT 10; I have used the CLASSIFY_TEXT Function available within Snowflake's cortex to derive the sentiment based on the feedback_text and further classify it into a specific category the feedback is associated with, such as 'Product', 'Customer Service', 'Delivery', and so on. P.S.: You can change the categories based on your business needs. Step 3: Store Classified Results Let's store the classified results in a separate table for further reporting and analysis purposes. For this, I have created a table with the name feedback_classified as shown below. SQL CREATE OR REPLACE TABLE customer.csat.feedback_classified ( feedback_id INT, feedback_ts DATE, feedback_text STRING, sentiment STRING, feedback_category STRING ); Initial Bulk Load Now, let's do an initial bulk classification for all existing data before moving on to the incremental processing of newly arriving data. SQL -- Initial Load INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback; Once the initial load is completed successfully, let's build an SQL that fetches only incremental data based on the processed_ts column value. For the incremental load, we need fresh data with customer feedback. For that, let's insert ten new records into our raw table customer.csat.feedback SQL INSERT INTO customer.csat.feedback (feedback_id, feedback_ts, feedback_text) VALUES (5001, CURRENT_DATE, 'My DemoMart order was delivered to the wrong address again. Very disappointing.'), (5002, CURRENT_DATE, 'I love the new packaging DemoMart is using. So eco-friendly!'), (5003, CURRENT_DATE, 'The delivery speed was slower than promised. Hope this improves.'), (5004, CURRENT_DATE, 'The product quality is excellent, I’m genuinely impressed with DemoMart.'), (5005, CURRENT_DATE, 'Customer service helped me cancel and reorder with no issues.'), (5006, CURRENT_DATE, 'DemoMart’s website was down when I tried to place my order.'), (5007, CURRENT_DATE, 'Thanks DemoMart for the fast shipping and great support!'), (5008, CURRENT_DATE, 'Received a damaged item. This is the second time with DemoMart.'), (5009, CURRENT_DATE, 'DemoMart app is very user-friendly. Shopping is a breeze.'), (5010, CURRENT_DATE, 'The feature I wanted is missing. Hope DemoMart adds it soon.'); Step 4: Automate Incremental Data Processing With TASK Now that we have newly added (incremental) fresh data into our raw table, let's create a task to pick up only new data and classify it automatically. We will schedule this task to run every Sunday at midnight UTC. SQL --Creating task CREATE OR REPLACE TASK CUSTOMER.CSAT.FEEDBACK_CLASSIFIED WAREHOUSE = COMPUTE_WH SCHEDULE = 'USING CRON 0 0 * * 0 UTC' -- Run evey Sunday at midnight UTC AS INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback WHERE feedback_ts > (SELECT COALESCE(MAX(PROCESSED_TIMESTAMP),'1900-01-01') FROM CUSTOMER.CSAT.FEEDBACK_CLASSIFIED ); This will automatically run every Sunday at midnight UTC, process any newly arrived customer feedback, and classify it. Step 5: Visualize Insights You can now build dashboards in Snowsight to see weekly trends using a simple query like this: SQL SELECT feedback_category, sentiment, COUNT(*) AS total FROM customer.csat.feedback_classified GROUP BY feedback_category, sentiment ORDER BY total DESC; Conclusion With just a few lines of SQL, you: Ingested raw feedback into a Snowflake table.Used Snowflake Cortex to classify customer feedback and derive sentiment and feedback categoriesAutomated the process to run weeklyBuilt insights into the classified feedback for business users/leadership team to act upon by category and sentiment This approach is ideal for support teams, product teams, and leadership, as it allows them to continuously monitor customer experience without building or maintaining ML infrastructure. GitHub I have created a GitHub page with all the code and sample data. You can access it freely. The whole dataset generator and SQL scripts are available on GitHub.

By Rajanikantarao Vellaturi
Converting List to String in Terraform
Converting List to String in Terraform

In Terraform, you will often need to convert a list to a string when passing values to configurations that require a string format, such as resource names, cloud instance metadata, or labels. Terraform uses HCL (HashiCorp Configuration Language), so handling lists requires functions like join() or format(), depending on the context. How to Convert a List to a String in Terraform The join() function is the most effective way to convert a list into a string in Terraform. This concatenates list elements using a specified delimiter, making it especially useful when formatting data for use in resource names, cloud tags, or dynamically generated scripts. The join(", ", var.list_variable) function, where list_variable is the name of your list variable, merges the list elements with ", " as the separator. Here’s a simple example: Shell variable "tags" { default = ["dev", "staging", "prod"] } output "tag_list" { value = join(", ", var.tags) } The output would be: Shell "dev, staging, prod" Example 1: Formatting a Command-Line Alias for Multiple Commands In DevOps and development workflows, it’s common to run multiple commands sequentially, such as updating repositories, installing dependencies, and deploying infrastructure. Using Terraform, you can dynamically generate a shell alias that combines these commands into a single, easy-to-use shortcut. Shell variable "commands" { default = ["git pull", "npm install", "terraform apply -auto-approve"] } output "alias_command" { value = "alias deploy='${join(" && ", var.commands)}'" } Output: Shell "alias deploy='git pull && npm install && terraform apply -auto-approve'" Example 2: Creating an AWS Security Group Description Imagine you need to generate a security group rule description listing allowed ports dynamically: Shell variable "allowed_ports" { default = [22, 80, 443] } resource "aws_security_group" "example" { name = "example_sg" description = "Allowed ports: ${join(", ", [for p in var.allowed_ports : tostring(p)])}" dynamic "ingress" { for_each = var.allowed_ports content { from_port = ingress.value to_port = ingress.value protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } } } The join() function, combined with a list comprehension, generates a dynamic description like "Allowed ports: 22, 80, 443". This ensures the security group documentation remains in sync with the actual rules. Alternative Methods For most use cases, the join() function is the best choice for converting a list into a string in Terraform, but the format() and jsonencode() functions can also be useful in specific scenarios. 1. Using format() for Custom Formatting The format() function helps control the output structure while joining list items. It does not directly convert lists to strings, but it can be used in combination with join() to achieve custom formatting. Shell variable "ports" { default = [22, 80, 443] } output "formatted_ports" { value = format("Allowed ports: %s", join(" | ", var.ports)) } Output: Shell "Allowed ports: 22 | 80 | 443" 2. Using jsonencode() for JSON Output When passing structured data to APIs or Terraform modules, you can use the jsonencode() function, which converts a list into a JSON-formatted string. Shell variable "tags" { default = ["dev", "staging", "prod"] } output "json_encoded" { value = jsonencode(var.tags) } Output: Shell "["dev", "staging", "prod"]" Unlike join(), this format retains the structured array representation, which is useful for JSON-based configurations. Creating a Literal String Representation in Terraform Sometimes you need to convert a list into a literal string representation, meaning the output should preserve the exact structure as a string (e.g., including brackets, quotes, and commas like a JSON array). This is useful when passing data to APIs, logging structured information, or generating configuration files. For most cases, jsonencode() is the best option due to its structured formatting and reliability in API-related use cases. However, if you need a simple comma-separated string without additional formatting, join() is the better choice. Common Scenarios for List-to-String Conversion in Terraform Converting a list to a string in Terraform is useful in multiple scenarios where Terraform requires string values instead of lists. Here are some common use cases: Naming resources dynamically: When creating resources with names that incorporate multiple dynamic elements, such as environment, application name, and region, these components are often stored as a list for modularity. Converting them into a single string allows for consistent and descriptive naming conventions that comply with provider or organizational naming standards.Tagging infrastructure with meaningful identifiers: Tags are often key-value pairs where the value needs to be a string. If you’re tagging resources based on a list of attributes (like team names, cost centers, or project phases), converting the list into a single delimited string ensures compatibility with tagging schemas and improves downstream usability in cost analysis or inventory tools.Improving documentation via descriptions in security rules: Security groups, firewall rules, and IAM policies sometimes allow for free-form text descriptions. Providing a readable summary of a rule’s purpose, derived from a list of source services or intended users, can help operators quickly understand the intent behind the configuration without digging into implementation details.Passing variables to scripts (e.g., user_data in EC2 instances): When injecting dynamic values into startup scripts or configuration files (such as a shell script passed via user_data), you often need to convert structured data like lists into strings. This ensures the script interprets the input correctly, particularly when using loops or configuration variables derived from Terraform resources.Logging and monitoring, ensuring human-readable outputs: Terraform output values are often used for diagnostics or integration with logging/monitoring systems. Presenting a list as a human-readable string improves clarity in logs or dashboards, making it easier to audit deployments and troubleshoot issues by conveying aggregated information in a concise format. Key Points Converting lists to strings in Terraform is crucial for dynamically naming resources, structuring security group descriptions, formatting user data scripts, and generating readable logs. Using join() for readable concatenation, format() for creating formatted strings, and jsonencode() for structured output ensures clarity and consistency in Terraform configurations.

By Mariusz Michalowski
How to Install and Set Up Jenkins With Docker Compose
How to Install and Set Up Jenkins With Docker Compose

Jenkins is an open-source CI/CD tool written in Java that is used for organising the CI/CD pipelines. Currently, at the time of writing this blog, it has 24k stars and 9.1k forks on GitHub. With over 2000 plugin support, Jenkins is a well-known tool in the DevOps world. The following are multiple ways to install and set up Jenkins: Using the Jenkins Installer package for WindowsUsing Homebrew for macOSUsing the Generic Java Package (war)Using DockerUsing KubernetesUsing apt for Ubuntu/Debian Linux OS In this tutorial blog, I will cover the step-by-step process to install and setup Jenkins using Docker Compose for an efficient and seamless CI/CD experience. Using Dockerwith Jenkins allows users to set up a Jenkins instance quickly with minimal manual configuration. It ensures portability and scalability, as with Docker Compose, users can easily set up Jenkins and its required services, such as volumes and networks, using a single YAML file. This allows the users to easily manage and replicate the setup in different environments. Installing Jenkins Using Docker Compose Installing Jenkins with Docker Compose makes the setup process simple and efficient, and allows us to define configurations in a single file. This approach removes the complexity and difficulty faced while installing Jenkins manually and ensures easy deployment, portability, and quick scaling. Prerequisite As a prerequisite, Docker Desktop needs to be installed, up and running on the local machine. Docker Compose is included in Docker Desktop along with Docker Engine and Docker CLI. Jenkins With Docker Compose Jenkins could be instantly set up by running the following docker-compose command using the terminal: Plain Text docker compose up -d This docker-compose command could be run by navigating to the folder where the Docker Compose file is placed. So, let’s create a new folder jenkins-demo and inside this folder, let’s create another new folder jenkins-configuration and a new file docker-compose.yaml. The following is the folder structure: Plain Text jenkins-demo/ ├── jenkins-configuration/ └── docker-compose.yaml The following content should be added to the docker-compose.yaml file. YAML # docker-compose.yaml version: '3.8' services: jenkins: image: jenkins/jenkins:lts privileged: true user: root ports: - 8080:8080 - 50000:50000 container_name: jenkins volumes: - /Users/faisalkhatri/jenkins-demo/jenkins-configuration:/var/jenkins_home - /var/run/docker.sock:/var/run/docker.sock Decoding the Docker Compose File The first line in the file is a comment. The services block starts from the second line, which includes the details of the Jenkins service. The Jenkins service block contains the image, user, and port details. The Jenkins service will run the latest Jenkins image with root privileges and name the container as jenkins. The ports are responsible for mapping container ports to the host machine. The details of these ports are as follows: 8080:8080:This will map the port 8080 inside the container to the port 8080 on the host machine. It is important, as it is required for accessing the Jenkins web interface. It will help us in accessing Jenkins in the browser by navigating to http://localhost:808050000:50000:This will map the port 50000 inside the container to port 50000 on the host machine. It is the JNLP (Java Network Launch Protocol) agent port, which is used for connecting Jenkins build agents to the Jenkins Controller instance. It is important, as we would be using distributed Jenkins setups, where remote build agents connect to the Jenkins Controller instance. The privileged: true setting will grant the container full access to the host system and allow running the process as the root user on the host machine. This will enable the container to perform the following actions : Access all the host devicesModify the system configurationsMount file systemsManage network interfacesPerform admin tasks that a regular container cannot perform These actions are important, as Jenkins may require permissions to run specific tasks while interacting with the host system, like managing Docker containers, executing system commands, or modifying files outside the container. Any data stored inside the container is lost when the container stops or is removed. To overcome this issue, Volumes are used in Docker to persist data beyond the container’s lifecycle. We will use Docker Volumes to keep the Jenkins data intact, as it is needed every time we start Jenkins. Jenkins data would be stored in the jenkins-configuration folder on the local machine. The /Users/faisalkhatri/jenkins-demo/jenkins-configuration on the host is mapped to /var/jenkins_home in the container. The changes made inside the container in the respective folder will reflect on the folder on the host machine and vice versa. This line /var/run/docker.sock:/var/run/docker.sock, mounts the Docker socket from the host into the container, allowing the Jenkins container to directly communicate with the Docker daemon running on the host machine. This enables Jenkins, which is running inside the container, to manage and run Docker commands on the host, allowing it to build and run other Docker containers as a part of CI/CD pipelines. Installing Jenkins With Docker Compose Let’s run the installation process step by step as follows: Step 1 — Running Jenkins Setup Open a terminal, navigate to the jenkins-demo folder, and run the following command: Plain Text docker compose up -d After the command is successfully executed, open any browser on your machine and navigate to https://localhost:8080, you should be able to find the Unlock Jenkins screen as shown in the screenshot below: Step 2 — Finding the Jenkins Password From the Docker Container The password to unlock Jenkins could be found by navigating to the jenkins container (remember we had given the name jenkins to the container in the Docker Compose file) and checking out its logs by running the following command on the terminal: Plain Text docker logs jenkins Copy the password from the logs, paste it in the Administrator password field on the Unlock Jenkins screen in the browser, and click on the Continue button. Step 3 — Setting up Jenkins The “Getting Started” screen will be displayed next, which will prompt us to install plugins to set up Jenkins. Select the Install suggested plugins and proceed with the installation. It will take some time for the installations to complete. Step 4 — Creating Jenkins user After the installation is complete, Jenkins will show the next screen to update the user details. It is recommended to update the user details with a password and click on Save and Continue. This username and password can then be used to log in to Jenkins. Step 5 — Instance Configuration In this window, we can update the Jenkins accessible link so it can be further used to navigate and run Jenkins. However, we can leave it as it is now — http://localhost:8080. Click on the Save and Finish button to complete the set up. With this, the Jenkins installation and set up are complete; we are now ready to use Jenkins. Summary Docker is the go-to tool for instantly spinning up a Jenkins instance. Using Docker Compose, we installed Jenkins successfully in just 5 simple steps. Once Jenkins is up and started, we can install the required plugin and set up CI/CD workflows as required. Using Docker Volumes allows us to use Jenkins seamlessly, as it saves the instance data between restarts. In the next tutorial, we will learn about installing and setting up Jenkins agents that will help us run the Jenkins jobs.

By Faisal Khatri DZone Core CORE
Beyond Java Streams: Exploring Alternative Functional Programming Approaches in Java
Beyond Java Streams: Exploring Alternative Functional Programming Approaches in Java

Few concepts in Java software development have changed how we approach writing code in Java than Java Streams. They provide a clean, declarative way to process collections and have thus become a staple in modern Java applications. However, for all their power, Streams present their own challenges, especially where flexibility, composability, and performance optimization are priorities. What if your programming needs more expressive functional paradigms? What if you are looking for laziness and safety beyond what Streams provide and want to explore functional composition at a lower level? In this article, we will be exploring other functional programming techniques you can use in Java that do not involve using the Streams API. Java Streams: Power and Constraints Java Streams are built on a simple premise—declaratively process collections of data using a pipeline of transformations. You can map, filter, reduce, and collect data with clean syntax. They eliminate boilerplate and allow chaining operations fluently. However, Streams fall short in some areas: They are not designed for complex error handling.They offer limited lazy evaluation capabilities.They don’t integrate well with asynchronous processing.They lack persistent and immutable data structures. One of our fellow DZone members wrote a very good article on "The Power and Limitations of Java Streams," which describes both the advantages and limitations of what you can do using Java Streams. I agree that Streams provide a solid basis for functional programming, but I suggest looking around for something even more powerful. The following alternatives are discussed within the remainder of this article, expanding upon points introduced in the referenced piece. Vavr: A Functional Java Library Why Vavr? Provides persistent and immutable collections (e.g., List, Set, Map)Includes Try, Either, and Option types for robust error handlingSupports advanced constructs like pattern matching and function composition Vavr is often referred to as a "Scala-like" library for Java. It brings in a strong functional flavor that bridges Java's verbosity and the expressive needs of functional paradigms. Example: Java Option<String> name = Option.of("Bodapati"); String result = name .map(n -> n.toUpperCase()) .getOrElse("Anonymous"); System.out.println(result); // Output: BODAPATI Using Try, developers can encapsulate exceptions functionally without writing try-catch blocks: Java Try<Integer> safeDivide = Try.of(() -> 10 / 0); System.out.println(safeDivide.getOrElse(-1)); // Output: -1 Vavr’s value becomes even more obvious in concurrent and microservice environments where immutability and predictability matter. Reactor and RxJava: Going Asynchronous Reactive programming frameworks such as Project Reactor and RxJava provide more sophisticated functional processing streams that go beyond what Java Streams can offer, especially in the context of asynchrony and event-driven systems. Key Features: Backpressure control and lazy evaluationAsynchronous stream compositionRich set of operators and lifecycle hooks Example: Java Flux<Integer> numbers = Flux.range(1, 5) .map(i -> i * 2) .filter(i -> i % 3 == 0); numbers.subscribe(System.out::println); Use cases include live data feeds, user interaction streams, and network-bound operations. In the Java ecosystem, Reactor is heavily used in Spring WebFlux, where non-blocking systems are built from the ground up. RxJava, on the other hand, has been widely adopted in Android development where UI responsiveness and multithreading are critical. Both libraries teach developers to think reactively, replacing imperative patterns with a declarative flow of data. Functional Composition with Java’s Function Interface Even without Streams or third-party libraries, Java offers the Function<T, R> interface that supports method chaining and composition. Example: Java Function<Integer, Integer> multiplyBy2 = x -> x * 2; Function<Integer, Integer> add10 = x -> x + 10; Function<Integer, Integer> combined = multiplyBy2.andThen(add10); System.out.println(combined.apply(5)); // Output: 20 This simple pattern is surprisingly powerful. For example, in validation or transformation pipelines, you can modularize each logic step, test them independently, and chain them without side effects. This promotes clean architecture and easier testing. JEP 406 — Pattern Matching for Switch Pattern matching, introduced in Java 17 as a preview feature, continues to evolve and simplify conditional logic. It allows type-safe extraction and handling of data. Example: Java static String formatter(Object obj) { return switch (obj) { case Integer i -> "Integer: " + i; case String s -> "String: " + s; default -> "Unknown type"; }; } Pattern matching isn’t just syntactic sugar. It introduces a safer, more readable approach to decision trees. It reduces the number of nested conditions, minimizes boilerplate, and enhances clarity when dealing with polymorphic data. Future versions of Java are expected to enhance this capability further with deconstruction patterns and sealed class integration, bringing Java closer to pattern-rich languages like Scala. Recursion and Tail Call Optimization Workarounds Recursion is fundamental in functional programming. However, Java doesn’t optimize tail calls, unlike languages like Haskell or Scala. That means recursive functions can easily overflow the stack. Vavr offers a workaround via trampolines: Java static Trampoline<Integer> factorial(int n, int acc) { return n == 0 ? Trampoline.done(acc) : Trampoline.more(() -> factorial(n - 1, n * acc)); } System.out.println(factorial(5, 1).result()); Trampolining ensures that recursive calls don’t consume additional stack frames. Though slightly verbose, this pattern enables functional recursion in Java safely. Conclusion: More Than Just Streams "The Power and Limitations of Java Streams" offers a good overview of what to expect from Streams, and I like how it starts with a discussion on efficiency and other constraints. So, I believe Java functional programming is more than just Streams. There is a need to adopt libraries like Vavr, frameworks like Reactor/RxJava, composition, pattern matching, and recursion techniques. To keep pace with the evolution of the Java enterprise platform, pursuing hybrid patterns of functional programming allows software architects to create systems that are more expressive, testable, and maintainable. Adopting these tools doesn’t require abandoning Java Streams—it means extending your toolbox. What’s Next? Interested in even more expressive power? Explore JVM-based functional-first languages like Kotlin or Scala. They offer stronger FP constructs, full TCO, and tighter integration with functional idioms. Want to build smarter, more testable, and concurrent-ready Java systems? Time to explore functional programming beyond Streams. The ecosystem is richer than ever—and evolving fast. What are your thoughts about functional programming in Java beyond Streams? Let’s talk in the comments!

By Rama Krishna Prasad Bodapati
Serverless IAM: Implementing IAM in Serverless Architectures with Lessons from the Security Trenches
Serverless IAM: Implementing IAM in Serverless Architectures with Lessons from the Security Trenches

When I first began working with serverless architectures in 2018, I quickly discovered that my traditional security playbook wasn't going to cut it. The ephemeral nature of functions, the distributed service architecture, and the multiplicity of entry points created a fundamentally different security landscape. After several years of implementing IAM strategies for serverless applications across various industries, I've compiled the approaches that have proven most effective in real-world scenarios. This article shares these insights, focusing on practical Python implementations that address the unique security challenges of serverless environments. The Shifting Security Paradigm in Serverless Traditional security models rely heavily on network perimeters and long-running servers where security agents can monitor activity. Serverless computing dismantles this model through several key characteristics: Execution lifetime measured in milliseconds: Functions that spin up, execute, and terminate in the blink of an eye make traditional agent-based security impracticalHighly distributed components: Instead of monolithic services, serverless apps often comprise dozens or hundreds of small functionsMultiple ingress points: Rather than funneling traffic through a single application gatewayComplex service-to-service communication patterns: With functions frequently calling other servicesPerformance sensitivity: Where security overhead can significantly impact cold start times During a financial services project last year, we learned this lesson the hard way when our initial security approach added nearly 800ms to function cold starts—unacceptable for an API that needed to respond in under 300ms total. Core Components of Effective Serverless IAM Through trial and error across multiple projects, I've found that serverless IAM strategies should address four key areas: 1. User and Service Authentication Authenticating users and services in a serverless context requires approaches optimized for stateless, distributed execution: JWT-based authentication: These stateless tokens align perfectly with the ephemeral nature of serverless functionsOpenID Connect (OIDC): For standardized authentication flows that work across service boundariesAPI keys and client secrets: When service-to-service authentication is requiredFederated identity: Leveraging identity providers to offload authentication complexity 2. Authorization and Access Control After verifying identity, you need robust mechanisms to control access: Role-based access control (RBAC): Assigning permissions based on user rolesAttribute-based access control (ABAC): More dynamic permissions based on user attributes and contextPolicy enforcement points: Strategic locations within your architecture where access decisions occur 3. Function-Level Permissions The functions themselves need careful permission management: Principle of least privilege: Granting only the minimal permissions requiredFunction-specific IAM roles: Approving tailored permissions for each functionResource-based policies: Controlling which identities can invoke your functions 4. Secrets Management Secure handling of credentials and sensitive information: Managed secrets services: Cloud-native solutions for storing and accessing secretsEnvironment variables: For injecting configuration at runtimeParameter stores: For less sensitive configuration information Provider-Specific Implementation Patterns Having implemented serverless security across major cloud providers, I've developed practical patterns for each platform. These examples reflect real-world implementations with necessary simplifications for clarity. AWS: Pragmatic IAM Approaches AWS offers several robust options for serverless authentication: Authentication with Amazon Cognito Here's a streamlined example of validating Cognito tokens in a Lambda function, with performance optimizations I've found effective in production: Python # Example: Validating Cognito tokens in a Lambda function import json import os import boto3 import jwt import requests from jwt.algorithms import RSAAlgorithm # Cache of JWKs - crucial for performance jwks_cache = {} def lambda_handler(event, context): try: # Extract token from Authorization header auth_header = event.get('headers', {}).get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return { 'statusCode': 401, 'body': json.dumps({'message': 'Missing or invalid authorization header'}) } token = auth_header.replace('Bearer ', '') # Verify the token decoded_token = verify_token(token) # Process authenticated request with user context user_id = decoded_token.get('sub') user_groups = decoded_token.get('cognito:groups', []) # Your business logic here, using the authenticated user context response_data = process_authorized_request(user_id, user_groups, event) return { 'statusCode': 200, 'body': json.dumps(response_data) } except jwt.ExpiredSignatureError: return { 'statusCode': 401, 'body': json.dumps({'message': 'Token expired'}) } except Exception as e: print(f"Authentication error: {str(e)}") return { 'statusCode': 401, 'body': json.dumps({'message': 'Authentication failed'}) } def verify_token(token): # Decode the token header header = jwt.get_unverified_header(token) kid = header['kid'] # Get the public keys if not cached region = os.environ['AWS_REGION'] user_pool_id = os.environ['USER_POOL_ID'] if not jwks_cache: keys_url = f'https://cognito-idp.{region}.amazonaws.com/{user_pool_id}/.well-known/jwks.json' jwks = requests.get(keys_url).json() jwks_cache.update(jwks) # Find the key that matches the kid in the token key = None for jwk in jwks_cache['keys']: if jwk['kid'] == kid: key = jwk break if not key: raise Exception('Public key not found') # Construct the public key public_key = RSAAlgorithm.from_jwk(json.dumps(key)) # Verify the token payload = jwt.decode( token, public_key, algorithms=['RS256'], audience=os.environ['APP_CLIENT_ID'] ) return payload This pattern has performed well in production, with the key caching strategy reducing token verification time by up to 80% compared to our initial implementation. Secrets Management with AWS Secrets Manager After securing several high-compliance applications, I've found this pattern for secrets management to be both secure and performant: Python # Example: Using AWS Secrets Manager in Lambda with caching import json import boto3 import os from botocore.exceptions import ClientError # Cache for secrets to minimize API calls secrets_cache = {} secrets_ttl = {} SECRET_CACHE_TTL = 300 # 5 minutes in seconds def lambda_handler(event, context): try: # Get the secret - using cache if available and not expired api_key = get_secret('payment-api-key') # Use secret for external API call result = call_payment_api(api_key, event.get('body', {})) return { 'statusCode': 200, 'body': json.dumps({'transactionId': result['id']}) } except ClientError as e: print(f"Error retrieving secret: {e}") return { 'statusCode': 500, 'body': json.dumps({'message': 'Internal error'}) } def get_secret(secret_id): import time current_time = int(time.time()) # Return cached secret if valid if secret_id in secrets_cache and secrets_ttl.get(secret_id, 0) > current_time: return secrets_cache[secret_id] # Create a Secrets Manager client secrets_manager = boto3.client('secretsmanager') # Retrieve secret response = secrets_manager.get_secret_value(SecretId=secret_id) # Parse the secret if 'SecretString' in response: secret_data = json.loads(response['SecretString']) # Cache the secret with TTL secrets_cache[secret_id] = secret_data secrets_ttl[secret_id] = current_time + SECRET_CACHE_TTL return secret_data else: raise Exception("Secret is not a string") The caching strategy here has been crucial in high-volume applications, where we've seen up to 95% reduction in Secrets Manager API calls while maintaining reasonable security through controlled TTL. Azure Serverless IAM Implementation When working with Azure Functions, I've developed these patterns for robust security: Authentication with Azure Active Directory (Entra ID) For enterprise applications on Azure, this pattern has provided a good balance of security and performance: Python # Example: Validating AAD token in Azure Function import json import os import jwt import requests import azure.functions as func from jwt.algorithms import RSAAlgorithm import logging from datetime import datetime, timedelta # Cache for JWKS with TTL jwks_cache = {} jwks_timestamp = None JWKS_CACHE_TTL = timedelta(hours=24) # Refresh keys daily def main(req: func.HttpRequest) -> func.HttpResponse: try: # Extract token auth_header = req.headers.get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return func.HttpResponse( json.dumps({'message': 'Missing or invalid authorization header'}), mimetype="application/json", status_code=401 ) token = auth_header.replace('Bearer ', '') # Validate token start_time = datetime.now() decoded_token = validate_token(token) validation_time = (datetime.now() - start_time).total_seconds() # Log performance for monitoring logging.info(f"Token validation completed in {validation_time} seconds") # Process authenticated request user_email = decoded_token.get('email', 'unknown') user_name = decoded_token.get('name', 'User') return func.HttpResponse( json.dumps({ 'message': f'Hello, {user_name}', 'email': user_email, 'authenticated': True }), mimetype="application/json", status_code=200 ) except Exception as e: logging.error(f"Authentication error: {str(e)}") return func.HttpResponse( json.dumps({'message': 'Authentication failed'}), mimetype="application/json", status_code=401 ) def validate_token(token): global jwks_cache, jwks_timestamp # Decode without verification to get the kid header = jwt.get_unverified_header(token) kid = header['kid'] # Get tenant ID from environment tenant_id = os.environ['TENANT_ID'] # Get the keys if not cached or expired current_time = datetime.now() if not jwks_cache or not jwks_timestamp or current_time - jwks_timestamp > JWKS_CACHE_TTL: keys_url = f'https://login.microsoftonline.com/{tenant_id}/discovery/v2.0/keys' jwks = requests.get(keys_url).json() jwks_cache = jwks jwks_timestamp = current_time logging.info("JWKS cache refreshed") # Find the key matching the kid key = None for jwk in jwks_cache['keys']: if jwk['kid'] == kid: key = jwk break if not key: raise Exception('Public key not found') # Construct the public key public_key = RSAAlgorithm.from_jwk(json.dumps(key)) # Verify the token client_id = os.environ['CLIENT_ID'] issuer = f'https://login.microsoftonline.com/{tenant_id}/v2.0' payload = jwt.decode( token, public_key, algorithms=['RS256'], audience=client_id, issuer=issuer ) return payload The key implementation detail here is the TTL-based JWKS cache, which has dramatically improved performance while ensuring keys are periodically refreshed. Google Cloud Serverless IAM Implementation For Google Cloud Functions, these patterns have proven effective in production environments: Authentication with Firebase This approach works well for consumer-facing applications with Firebase Authentication: Python # Example: Validating Firebase Auth token in Cloud Function import json import firebase_admin from firebase_admin import auth from firebase_admin import credentials import time import logging from functools import wraps # Initialize Firebase Admin SDK (with exception handling for warm instances) try: app = firebase_admin.get_app() except ValueError: cred = credentials.ApplicationDefault() firebase_admin.initialize_app(cred) def require_auth(f): @wraps(f) def decorated_function(request): # Performance tracking start_time = time.time() # Get the ID token auth_header = request.headers.get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): return json.dumps({'error': 'Unauthorized - Missing token'}), 401, {'Content-Type': 'application/json'} id_token = auth_header.split('Bearer ')[1] try: # Verify the token decoded_token = auth.verify_id_token(id_token) # Check if token is issued in the past auth_time = decoded_token.get('auth_time', 0) if auth_time > time.time(): return json.dumps({'error': 'Invalid token auth time'}), 401, {'Content-Type': 'application/json'} # Track performance validation_time = time.time() - start_time logging.info(f"Token validation took {validation_time*1000:.2f}ms") # Add user info to request request.user = { 'uid': decoded_token['uid'], 'email': decoded_token.get('email'), 'email_verified': decoded_token.get('email_verified', False), 'auth_time': auth_time } # Continue to the actual function return f(request) except Exception as e: logging.error(f'Error verifying authentication token: {e}') return json.dumps({'error': 'Unauthorized'}), 401, {'Content-Type': 'application/json'} return decorated_function @require_auth def secure_function(request): # The function only executes if auth is successful user = request.user return json.dumps({ 'message': f'Hello, {user["email"]}!', 'userId': user['uid'], 'verified': user['email_verified'] }), 200, {'Content-Type': 'application/json'} The decorator pattern has been particularly valuable, standardizing authentication across dozens of functions in larger projects. Hard-Earned Lessons and Best Practices After several years of implementing serverless IAM in production, I've learned these critical lessons: 1. Implement Least Privilege with Precision One of our earlier projects granted overly broad permissions to Lambda functions. This came back to haunt us when a vulnerability in a dependency was exploited, giving the attacker more access than necessary. Now, we religiously follow function-specific permissions: YAML # AWS SAM example with precise permissions Resources: ProcessPaymentFunction: Type: AWS::Serverless::Function Properties: Handler: payment_handler.lambda_handler Runtime: python3.9 Policies: - DynamoDBReadPolicy: TableName: !Ref CustomerTable - SSMParameterReadPolicy: ParameterName: /prod/payment/api-key - Statement: - Effect: Allow Action: - secretsmanager:GetSecretValue Resource: !Sub arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:payment/* 2. Implement Smart Caching for Performance Authentication processes can significantly impact cold start times. Our testing showed that a poorly implemented token validation flow could add 300-500ms to function execution time. This optimized caching approach has been effective in real-world applications: Python # Example: Smart caching for token validation import json import jwt import time from functools import lru_cache import threading # Thread-safe token cache with TTL class TokenCache: def __init__(self, ttl_seconds=300): self.cache = {} self.lock = threading.RLock() self.ttl = ttl_seconds def get(self, token_hash): with self.lock: cache_item = self.cache.get(token_hash) if not cache_item: return None expiry, user_data = cache_item if time.time() > expiry: # Token cache entry expired del self.cache[token_hash] return None return user_data def set(self, token_hash, user_data): with self.lock: expiry = time.time() + self.ttl self.cache[token_hash] = (expiry, user_data) # Initialize cache token_cache = TokenCache() def get_token_hash(token): # Create a hash of the token for cache key import hashlib return hashlib.sha256(token.encode()).hexdigest() def validate_token(token): # Check cache first token_hash = get_token_hash(token) cached_user = token_cache.get(token_hash) if cached_user: print("Cache hit for token validation") return cached_user print("Cache miss - validating token") # Actual token validation logic here decoded = jwt.decode(token, verify=False) # Placeholder for actual verification # Extract user data user_data = { 'sub': decoded.get('sub'), 'email': decoded.get('email'), 'roles': decoded.get('roles', []) } # Cache the result token_cache.set(token_hash, user_data) return user_data In high-volume applications, intelligent caching like this has improved average response times by 30-40%. 3. Implement Proper Defense in Depth During a security audit of a serverless financial application, we discovered that while our API Gateway had authentication enabled, several functions weren't verifying the JWT token payload. This created a vulnerability where valid but expired tokens could be reused. We now implement defense in depth consistently: Python # Example: Multiple validation layers def process_order(event, context): try: # 1. Verify authentication token (already checked by API Gateway, but verify again) auth_result = verify_token(event) if not auth_result['valid']: return { 'statusCode': 401, 'body': json.dumps({'error': auth_result['error']}) } user = auth_result['user'] # 2. Validate input data structure body = json.loads(event.get('body', '{}')) validation_errors = validate_order_schema(body) if validation_errors: return { 'statusCode': 400, 'body': json.dumps({'errors': validation_errors}) } # 3. Verify business-level authorization auth_result = check_order_authorization(user, body) if not auth_result['authorized']: return { 'statusCode': 403, 'body': json.dumps({'error': auth_result['reason']}) } # 4. Process with proper input sanitization processed_data = sanitize_order_input(body) # 5. Execute with error handling result = create_order(user['id'], processed_data) # 6. Return success with minimal information return { 'statusCode': 200, 'body': json.dumps({'orderId': result['id']}) } except Exception as e: # Log detailed error internally but return generic message log_detailed_error(e) return { 'statusCode': 500, 'body': json.dumps({'error': 'An unexpected error occurred'}) } This approach has proven effective in preventing various attack vectors. 4. Build Secure Service-to-Service Communication One of the more challenging aspects of serverless security is function-to-function communication. In a recent project, we implemented this pattern for secure internal communication: Python # Example: Service-to-service communication with JWT import json import jwt import time import os import requests def generate_service_token(service_name, target_service): # Create a signed JWT for service-to-service auth secret = os.environ['SERVICE_JWT_SECRET'] payload = { 'iss': service_name, 'sub': f'service:{service_name}', 'aud': target_service, 'iat': int(time.time()), 'exp': int(time.time() + 60), # Short-lived token (60 seconds) 'scope': 'service' } return jwt.encode(payload, secret, algorithm='HS256') def call_order_service(customer_id, order_data): service_token = generate_service_token('payment-service', 'order-service') # Call the order service with the token response = requests.post( os.environ['ORDER_SERVICE_URL'], json={ 'customerId': customer_id, 'orderDetails': order_data }, headers={ 'Authorization': f'Bearer {service_token}', 'Content-Type': 'application/json' } ) if response.status_code != 200: raise Exception(f"Order service error: {response.text}") return response.json() This pattern ensures that even if one function is compromised, the attacker has limited time to exploit the service token. 5. Implement Comprehensive Security Monitoring After a security incident where unauthorized token usage went undetected for days, we implemented enhanced security monitoring: Python # Example: Enhanced security logging for authentication import json import time import logging from datetime import datetime import traceback def log_auth_event(event_type, user_id, ip_address, success, details=None): """Log authentication events in a standardized format""" log_entry = { 'timestamp': datetime.utcnow().isoformat(), 'event': f'auth:{event_type}', 'userId': user_id, 'ipAddress': ip_address, 'success': success, 'region': os.environ.get('AWS_REGION', 'unknown'), 'functionName': os.environ.get('AWS_LAMBDA_FUNCTION_NAME', 'unknown') } if details: log_entry['details'] = details # Log in JSON format for easy parsing logging.info(json.dumps(log_entry)) def authenticate_user(event): try: # Extract IP from request context ip_address = event.get('requestContext', {}).get('identity', {}).get('sourceIp', 'unknown') # Extract and validate token auth_header = event.get('headers', {}).get('Authorization', '') if not auth_header or not auth_header.startswith('Bearer '): log_auth_event('token_missing', 'anonymous', ip_address, False) return {'authenticated': False, 'error': 'Missing authentication token'} token = auth_header.replace('Bearer ', '') # Track timing for performance monitoring start_time = time.time() try: # Validate token (implementation details omitted) decoded_token = validate_token(token) validation_time = time.time() - start_time user_id = decoded_token.get('sub', 'unknown') # Log successful authentication log_auth_event('login', user_id, ip_address, True, { 'validationTimeMs': round(validation_time * 1000), 'tokenExpiry': datetime.fromtimestamp(decoded_token.get('exp')).isoformat() }) return { 'authenticated': True, 'user': { 'id': user_id, 'email': decoded_token.get('email'), 'roles': decoded_token.get('roles', []) } } except jwt.ExpiredSignatureError: # Extract user ID from expired token for logging try: expired_payload = jwt.decode(token, options={'verify_signature': False}) user_id = expired_payload.get('sub', 'unknown') except: user_id = 'unknown' log_auth_event('token_expired', user_id, ip_address, False) return {'authenticated': False, 'error': 'Authentication token expired'} except Exception as e: log_auth_event('token_invalid', 'unknown', ip_address, False, { 'error': str(e), 'tokenFragment': token[:10] + '...' if len(token) > 10 else token }) return {'authenticated': False, 'error': 'Invalid authentication token'} except Exception as e: # Unexpected error in authentication process error_details = { 'error': str(e), 'trace': traceback.format_exc() } log_auth_event('auth_error', 'unknown', 'unknown', False, error_details) return {'authenticated': False, 'error': 'Authentication system error'} This comprehensive logging approach has helped us identify suspicious patterns and potential attacks before they succeed. Advanced Patterns from Production Systems As our serverless systems have matured, we've implemented several advanced patterns that have proven valuable: 1. Fine-Grained Authorization with OPA For a healthcare application with complex authorization requirements, we implemented Open Policy Agent: Python # Example: Using OPA for authorization in AWS Lambda import json import requests import os def check_authorization(user, resource, action): """Check if user is authorized to perform action on resource using OPA""" # Create authorization query auth_query = { 'input': { 'user': { 'id': user['id'], 'roles': user['roles'], 'department': user.get('department'), 'attributes': user.get('attributes', {}) }, 'resource': resource, 'action': action, 'context': { 'environment': os.environ.get('ENVIRONMENT', 'dev'), 'timestamp': datetime.utcnow().isoformat() } } } # Query OPA for authorization decision try: opa_url = os.environ['OPA_URL'] response = requests.post( f"{opa_url}/v1/data/app/authz/allow", json=auth_query, timeout=0.5 # Set reasonable timeout ) # Parse response if response.status_code == 200: result = response.json() is_allowed = result.get('result', False) # Log authorization decision log_auth_event( 'authorization', user['id'], 'N/A', is_allowed, { 'resource': resource.get('type') + ':' + resource.get('id'), 'action': action, 'allowed': is_allowed } ) return { 'authorized': is_allowed, 'reason': None if is_allowed else "Not authorized for this operation" } else: # OPA service error log_auth_event( 'authorization_error', user['id'], 'N/A', False, { 'statusCode': response.status_code, 'response': response.text } ) # Fall back to deny by default return { 'authorized': False, 'reason': "Authorization service error" } except Exception as e: # Error communicating with OPA log_auth_event( 'authorization_error', user['id'], 'N/A', False, {'error': str(e)} ) # Default deny on errors return { 'authorized': False, 'reason': "Authorization service unavailable" } This approach has allowed us to implement complex authorization rules that would be unwieldy to code directly in application logic. 2. Multi-Tenant Security Pattern For SaaS applications with multi-tenant requirements, we've developed this pattern: Python # Example: Multi-tenant request handling in AWS Lambda import json import boto3 import os from boto3.dynamodb.conditions import Key def lambda_handler(event, context): try: # Authenticate user auth_result = authenticate_user(event) if not auth_result['authenticated']: return { 'statusCode': 401, 'body': json.dumps({'error': auth_result['error']}) } user = auth_result['user'] # Extract tenant ID from token or path parameter requested_tenant_id = event.get('pathParameters', {}).get('tenantId') user_tenant_id = user.get('tenantId') # Security check: User can only access their assigned tenant if not user.get('isAdmin', False) and requested_tenant_id != user_tenant_id: log_auth_event( 'tenant_access_denied', user['id'], get_source_ip(event), False, { 'requestedTenant': requested_tenant_id, 'userTenant': user_tenant_id } ) return { 'statusCode': 403, 'body': json.dumps({'error': 'Access denied to this tenant'}) } # Create tenant-specific DynamoDB client dynamodb = boto3.resource('dynamodb') table = dynamodb.Table(os.environ['DATA_TABLE']) # Query with tenant isolation to prevent data leakage result = table.query( KeyConditionExpression=Key('tenantId').eq(requested_tenant_id) ) # Audit the data access log_data_access( user['id'], requested_tenant_id, 'query', result['Count'] ) return { 'statusCode': 200, 'body': json.dumps({ 'items': result['Items'], 'count': result['Count'] }) } except Exception as e: # Log the error but return generic message log_error(str(e), event) return { 'statusCode': 500, 'body': json.dumps({'error': 'Internal server error'}) } This pattern has successfully prevented tenant data leakage even in complex multi-tenant systems. Conclusion: Security is a Journey, Not a Destination Implementing IAM in serverless architectures requires a different mindset from traditional application security. Rather than focusing on perimeter security, the emphasis shifts to identity-centric, fine-grained permissions that align with the distributed nature of serverless applications. Through my journey implementing serverless security across various projects, I've found that success depends on several key factors: Designing with least privilege from the start - It's much harder to reduce permissions later than to grant them correctly initiallyBalancing security with performance - Intelligent caching and optimization strategies are essentialBuilding defense in depth - No single security control should be your only line of defenseMonitoring and responding to security events - Comprehensive logging and alerting provides visibilityContinuously adapting security practices - Serverless security is evolving rapidly as the technology matures The serverless paradigm has fundamentally changed how we approach application security. By embracing these changes and implementing the patterns described in this article, you can build serverless applications that are both secure and scalable. Remember that while cloud providers secure the underlying infrastructure, the security of your application logic, authentication flows, and data access patterns remains your responsibility. The shared responsibility model is especially important in serverless architectures where the division of security duties is less clear than in traditional deployments. As serverless adoption continues to grow, expect to see more sophisticated security solutions emerge that address the unique challenges of highly distributed, ephemeral computing environments. By implementing the practices outlined here, you'll be well-positioned to leverage these advancements while maintaining strong security fundamentals.

By Mahesh Vaijainthymala Krishnamoorthy
Defining Effective Microservice Boundaries - A Practical Approach To Avoiding The Most Common Mistakes
Defining Effective Microservice Boundaries - A Practical Approach To Avoiding The Most Common Mistakes

Have you found yourself staring at an entire whiteboard filled with boxes and arrows, pondering whether this would become the next awesome microservices architecture or the worst distributed monolith that ever existed? Same here, and more often than I would like to admit. Last month, I was talking to one of my cofounder friends, and he mentioned, “We have 47 services!” with pride. Then two weeks later, I was going through their docs and found out that to deploy a simple feature, I need to make changes in six of their services. What I thought was their “microservices” architecture turned out to be a monolith split into pieces, with distribution complexity but no benefits whatsoever. Perhaps the most critically important and the most underappreciated step in this architectural style is correct partitioning of microservices. Doing so increases modular independent deployability, isolation of faults, and swiftness in team operations. Mess it up, and welcome to a distributed system that is a thousand times harder to maintain than the monolith you wanted to remove. The Anti-patterns: How Boundaries Fail – Consider the Case The Distributed Monolith: Death by a Thousand Cuts An application made up of multiple services that are interdependent is an example of a pattern I encounter frequently, known as “distributed monolith.” This occurs when the complexity of distribution exists, but not the advantages. Here are some indicators that your distributed Monolith is operating below peak efficiency: One modification prompts and requires multiple adaptations across different services.Disabling dependency across services results in a breakage.Cross-team coordination complexity for release planning. A team I recently interacted with had to cross-coordinate across eight services for deployment just to add a field in their user’s profile. That is neither a microservices nor a service; that is just an unnecessarily intricate web of self-ensuing torture. The Shared Database Trap Again, “We need to use the same data!” falls under an alarm that can lead to this trap. Having many services access the same database tables leads to direct hidden coupling that eliminates every siloed advantage your architecture stands for. I saw a retail company suffering through Black Friday as a consequence of four hours of downtime when their order service’s inventory service changed a database schema that their order service relied on. Nanoservice Growth: Over-Indulging on the Good Stuff This can also go in the opposite direction. Sometimes I refer to it as “nanoservice madness.” You create an endless number of services and it turns your architecture into something resembling spaghetti. One of the gaming companies I consulted for was creating individual microservices for user achievements, user added preferences, user friends, and even user authentication and profile. Each of these services had their own deployment pipeline, database, and even an on-call rotation. The operational overhead was too much for their small team. A Defined Scenario: Redesigning an E-Commerce Boundary Let me show you an actual scenario from last year. I was consulting for an e-commerce business that had a typical case of a “distributed monolith.” Their initial architecture was something along the lines of this: YAML # Original architecture with poor boundaries services: product-service: responsibilities: - Management of product catalogs - Inventory management - Rules associated with pricing - Discount calculations database: shared_product_db dependencies: - user-service - order-service order-service: responsibilities: - Management and creation of orders - Processing of payments - Coordination of shipping database: shared_order_db dependencies: - product-service - user-service user-service: responsibilities: - User profiles - Authentication - Authorization - User preferences database: shared_user_db dependencies: - product-service It was obvious what the problems were. Services did have an appropriate amount of responsibilities but were overloaded with circular dependencies and too much knowledge of each other. Changes required coordinating at minimum three separate teams which is a disaster waiting to happen. Their business professionals were with us for a week. By the end of day one, the sticky notes had taken over the walls. The product team was in a heated debate with the inventory folks over who “owned” the concept of a product being “in stock.” It was chaotic, but by the end of the week, we had much clearer boundaries. The end result is as follows: YAML services: catalog-service: responsibilities: - Product information - Categorization - Search database: catalog_db dependencies: [] inventory-service: responsibilities: - Stock tracking - Reservations database: inventory_db dependencies: [] pricing-service: responsibilities: - Base prices - Discounts - Promotions database: pricing_db dependencies: [] order-service: responsibilities: - Order creation - Tracking - History database: order_db dependencies: - catalog-service - inventory-service - pricing-service (all async) payment-service: responsibilities: - Payment processing - Refunds database: payment_db dependencies: [] user-profile-service: responsibilities: - Profile management - Preferences database: user_profile_db dependencies: [] auth-service: responsibilities: - Authentication - Authorization database: auth_db dependencies: [] I understand your initial thoughts, “You went from 3 services to 7? That is increasing complexity, not decreasing it,” right? The thing is, every service now has one, dedicated responsibility. The dependencies are reduced and mostly asynchronous. Each service is fully in control of its data. The outcome was drastic. The average time to implement new features decreased by 60%, while deployment frequency went up by 300%. Their Black Friday sale was the real test for us six months later. Each service scaled on its load patterns rather than overstocking resources like the previous year. While the catalog service required 20 instances, payment only needed five. In the middle of the night, their CTO texted me a beer emoji, the universal sign of a successful launch. A Practical Strategy Finding The Right Boundaries Start With Domain-Driven Design (But Make It Work) As much as Domain-Driven Design (DDD) purists would like to disagree, you don’t need to be a purist to benefit from DDD’s tools for exploring boundaries. Start with Event Storming. This is a workshop approach where you gather domain experts and developers to construct business process models using sticky notes representing domain events, commands, and aggregates. This type of collaboration often exposes boundaries that are already a feature in your domain. The “Two Pizza Team” Rule Still Works Amazon continues to enforce their famous rule that states a team should be small enough to be fed by two pizzas. The team should be able to fit into a single meeting room alongside to microservices. If a service grows so complicated that it takes more than 5-8 engineers to maintain it, that's often an indication it should be split. But the inverse is also true, if you have 20 services and only 5 engineers, there is an increasing likelihood you’ve become too fine grained. The Cognitive Load Test Introduction 2025 An interesting approach I adopted in 2025 is what I like to refer to as 'the cognitive load test' for boundary determination, which seems to be very effective. It’s very straightforward: does any new team member manage to understand the goals, duties, and functions of the service within a day? If not, your service might have too many operations or is fragmented. Actionable Insights for 2025 and Further The Strangler Fig Pattern: Expand Your Horizons When remodeling an existing system, don’t sweat the boundaries on the first attempt. Implement the Strangler Fig pattern which replaces parts of monolithic architecture gradually with well-structured microservices (named after a vine that gradually overtakes its host tree). A healthcare client of mine tried to create the perfect microservices architecture for 18 months without writing a single line of code. Their design became completely obsolete after many changes within the business requirements during the tangled time-consuming process. The Newest Pattern: Boundary Observability A trend that I've started noticing in 2025 is something I'm calling “boundary-testing observability”—monitoring cross service dependencies and consistency data, essentially. ServiceMesh and BoundaryGuard are tools that will notify you when services are getting too talkative or when data redundancy is posing a consistency threat. Concluding Remarks: Boundaries Are a Journey, Not a Destination After assisting countless businesses with adopting microservices, my domain understanding boundaries have shifted as business needs change. This learning will always remain agile, and boundless. If there’s a strong “value” in doing so, initiate with coarse-grained services and progress from there. Boundaries and borders are subjective. There is a fine line that dictates whether data should be shared or duplicated, so be reasonable. Most importantly, pay attention to the problems and pain your teams face, there is a strong chance that it will give clues to boundary issues. As my mentor used to say, “the best microservices architecture isn’t the one that looks prettiest on a whiteboard—it’s the one that lets your teams ship features quickly and safely without wanting to quit every other Tuesday.”

By Mohit Menghnani

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Want to Become a Senior Software Engineer? Do These Things

June 13, 2025 by Seun Matt DZone Core CORE

When Incentives Sabotage Product Strategy

June 13, 2025 by Stefan Wolpers DZone Core CORE

Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset

June 12, 2025 by Pabitra Saikia

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Understanding the Fundamentals of Cryptography

June 13, 2025 by Siri Varma Vegiraju DZone Core CORE

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

When Incentives Sabotage Product Strategy

June 13, 2025 by Stefan Wolpers DZone Core CORE

Exploring the IBM App Connect Enterprise SELECT, ROW and THE Functions in ESQL

June 13, 2025 by Matthias Blomme

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

KubeVirt: Can VM Management With Kubernetes Work?

June 12, 2025 by Chris Ward DZone Core CORE

The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

June 12, 2025 by John Vester DZone Core CORE

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: