DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

DZone Spotlight

Monday, June 16 View All Articles »
Code of Shadows: Master Shifu and Po Use Functional Java to Solve the Decorator Pattern Mystery

Code of Shadows: Master Shifu and Po Use Functional Java to Solve the Decorator Pattern Mystery

By Shamik Mitra
It was a cold, misty morning at the Jade Palace. The silence was broken not by combat… but by a mysterious glitch in the logs. Po (rushing in): "Shifu! The logs… they're missing timestamps!" Shifu (narrowing his eyes): "This is no accident, Po. This is a breach in the sacred code path. The timekeeper has been silenced." Traditional OOP Decorator Shifu unfurled an old Java scroll: Java //Interface package com.javaonfly.designpatterns.decorator.oops; public interface Loggable { public void logMessage(String message); } //Implementation package com.javaonfly.designpatterns.decorator.oops.impl; import com.javaonfly.designpatterns.decorator.oops.Loggable; public class SimpleLogger implements Loggable { @Override public void logMessage(String message) { System.out.println(message); } } //Implementation class TimestampLogger implements Loggable { private Loggable wrapped; public TimestampLogger(Loggable wrapped) { this.wrapped = wrapped; } public void logMessage(String message) { String timestamped = "[" + System.currentTimeMillis() + "] " + message; wrapped.logMessage(timestamped); } } //Calling the decorator public class Logger { public static void main(String[] args){ Loggable simpleLogger = new SimpleLogger(); simpleLogger.logMessage("This is a simple log message."); Loggable timestampedLogger = new TimestampLogger(simpleLogger); timestampedLogger.logMessage("This is a timestamped log message."); } } //Output This is a simple log message. [1748594769477] This is a timestamped log message. Po: "Wait, we’re creating all these classes just to add a timestamp?" Shifu: "That is the illusion of control. Each wrapper adds bulk. True elegance lies in Functional Programming." Functional Decorator Pattern With Lambdas Shifu waved his staff and rewrote the scroll: Java package com.javaonfly.designpatterns.decorator.fp; import java.time.LocalDateTime; import java.util.function.Function; public class Logger { //higer order function public void decoratedLogMessage(Function<String, String> simpleLogger, Function<String, String> timestampLogger) { String message = simpleLogger.andThen(timestampLogger).apply("This is a log message."); System.out.println(message); } public static void main(String[] args){ Logger logger = new Logger(); Function<String, String> simpleLogger = message -> { System.out.println(message); return message; }; Function<String, String> timestampLogger = message -> { String timestampedMessage = "[" + System.currentTimeMillis() + "] " + ": " + message; return timestampedMessage; }; logger.decoratedLogMessage(simpleLogger, timestampLogger); } } //Output This is a log message. [1748595357335] This is a log message. Po (blinking): "So... no more wrappers, just function transformers?" Shifu (nodding wisely): "Yes, Po. In Functional Programming, functions are first-class citizens. The Function<T, R> interface lets us compose behavior. Each transformation can be chained using andThen, like stacking skills in Kung Fu." Breaking Down the Code – Functional Wisdom Explained Po (scratching his head): "Shifu, what exactly is this Function<T, R> thing? Is it some kind of scroll?" Shifu (gently): "Ah, Po. It is not a scroll. It is a powerful interface from the java.util.function package—a tool forged in the fires of Java 8." "Function<T, R> represents a function that accepts an input of type T and produces a result of type R." In our case: Java Function<String, String> simpleLogger This means: “Take a String message, and return a modified String message.” Each logger lambda—like simpleLogger and timestampLogger—does exactly that. The Art of Composition — andThen Po (eyes wide): "But how do they all work together? Like… kung fu moves in a combo?" Shifu (smiling): "Yes. That combo is called composition. And the technique is called andThen." Java simpleLogger.andThen(timestampLogger) This means: First, execute simpleLogger, which prints the message and passes it on.Then, take the result and pass it to timestampLogger, which adds the timestamp. This is function chaining—the essence of functional design. Java String message = simpleLogger .andThen(timestampLogger) .apply("This is a log message."); Like chaining martial arts techniques, each function passes its result to the next—clean, fluid, precise. Po: "So the message flows through each function like a river through stones?" Shifu: "Exactly. That is the way of the Stream." Functional Flow vs OOP Structure Shifu (serenely): "Po, unlike the OOP approach where you must wrap one class inside another—creating bulky layers—the functional approach lets you decorate behavior on the fly, without classes or inheritance." No need to create SimpleLogger, TimestampLogger, or interfaces.Just use Function<String, String> lambdas and compose them. The Secret to Clean Code “A true master does not add weight to power. He adds precision to purpose.” – Master Shifu This approach: Eliminates boilerplate.Encourages reusability.Enables testability (each function can be unit-tested in isolation).Supports dynamic behavior chaining. Po's New Move: Making the Logger Generic After mastering the basics, Po's eyes sparkled with curiosity. Po: "Shifu, what if I want this technique to work with any type—not just strings?" Shifu (with a deep breath): "Yes of course you can! Try to write it, Dragon warrior." Po meditated for a moment, and then rewrote the logger: Java public <T> void decoratedLogMessage(Function<T, T>... loggers) { Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); T message = pipeline.apply((T) "This is a log message."); System.out.println(message); } Po (bowing): "Master Shifu, after learning to compose logging functions using Function<String, String>, I asked myself — what if I could decorate not just strings, but any type of data? Numbers, objects, anything! So I used generics and built this move..." Java public <T> void decoratedLogMessage(Function<T, T>... loggers) { "This declares a generic method where T can be any type — String, Integer, or even a custom User object. The method takes a varargs of Function<T, T> — that means a flexible number of functions that take and return the same type." Java Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); "I stream all the logger functions and reduce them into a single pipeline function using Function::andThen. Function.identity() is the neutral starting point — like standing still before striking.Function::andThen chains each logger — like chaining combos in kung fu!" Java T message = pipeline.apply((T) "This is a log message."); I apply the final pipeline function to a sample input. Since this time I tested it with a String, I cast it as (T). But this method can now accept any type!" Shifu (smiling, eyes narrowing with pride): "You’ve taken the form beyond its scroll, Po. You have learned not just to use functions—but to respect their essence. This generic version... is the true Dragon Scroll of the Decorator." Modified Code by Po Java package com.javaonfly.designpatterns.decorator.fp; import java.time.LocalDateTime; import java.util.Arrays; import java.util.function.Function; public class Logger { public <T> void decoratedLogMessage(Function<T, T>... loggers) { Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); T message = pipeline.apply((T) "This is a log message."); System.out.println(message); } public static void main(String[] args){ Logger logger = new Logger(); Function<String, String> simpleLogger = message -> { System.out.println(message); return message; }; Function<String, String> timestampLogger = message -> { String timestampedMessage = "[" + System.currentTimeMillis() + "] " + message; return timestampedMessage; }; Function<String, String> JadeLogger = message -> { String JadeLoggedMessage = "[jadelog] " + message; return JadeLoggedMessage; }; logger.decoratedLogMessage(simpleLogger, timestampLogger,JadeLogger); } } //Output This is a log message. [jadelog] [1748598136677] This is a log message. Wisdom Scroll: OOP vs. Functional Decorator FeatureOOP DecoratorFunctional DecoratorNeeds ClassYesNoUses InterfaceYesOptionalComposabilityRigidElegantBoilerplateHighMinimalFlexibilityModerateHigh (thanks to lambdas) Final Words from Master Shifu "Po, the world of code is full of distractions—designs that look powerful but slow us down. A true Kung Fu developer learns to adapt. To decorate without weight. To enhance without inheritance. To flow with functions, not fight the structure." More
Want to Become a Senior Software Engineer? Do These Things

Want to Become a Senior Software Engineer? Do These Things

By Seun Matt DZone Core CORE
In my experience working with and leading software engineers, I have seen mid-level Engineers produce outcomes worthy of a Senior, and seniors who are only so in title. High-performing mid-levels eventually overtook under-performing seniors. How you become a Senior Software Engineer is important. If you become a Senior because you're the last man standing or the one with the longest tenure. I am afraid that future upward movement may be challenging. Especially, if you decide to go elsewhere. I have been fortunate to directly mentor a couple of engineers to become Senior, and witness the journey of others. In this article, I am going to discuss the day-to-day activities that distinguish the best and how you can join their ranks. Know the Basics Bruce Lee famously said: "I fear not the man who has practised 10,000 kicks once, but I fear the man who has practised one kick 10,000 times." This is a homage to the importance of getting the basics right. 1. Write Clean Code If you want to become a Senior Software Engineer, you have to write clean and reliable code. The pull request you authored should not be like a Twitter thread due to the myriad of corrections. Your code contributions should **completely** address the assigned task. If the task is to create a function that sums two digits. In addition to the `+` operation, add validations. Take care of null cases, and use the correct data type in the function parameters. Think about number overflow and other cases. This is what it means to have your code contribution address the task at hand completely. Pay attention to the coding standard and ensure your code changes adhere to it. When you create pull requests that do not require too many corrections, work as expected, and more, you'll be able to complete more tasks per sprint and become one of the top contributors on the team. You see where this is going already. You should pay attention to the smallest details in your code. Perform null checks, and use the appropriate data types and more. For example, in Java, do not use Integer everywhere just because you can; it takes more memory and may impair the performance of your application in production. Instead of writing multiple, nested if...else constructs, use early return. Don't do this: Java public boolean sendEmailToUser(User user) { if(user != null && !user.getEmail().isEmpty()) { String template = "src/path/to/email/template"; template = template .replace("username", user.getFirstName() + " " + user.getLastName()) .replace("link", "https://password-reset.example.com"); emailService.sendEmail(template); return true; } return false; } Do this instead. It's cleaner and more readable: Java public boolean sendEmailToUser(User user) { if(user == null || user.getEmail().isEmpty()) { return false; } String template = "src/path/to/email/template"; template = template .replace("username", user.getFirstName() + " " + user.getLastName()) .replace("link", "https://password-reset.example.com"); emailService.sendEmail(template); return true; } Ensure you handle different scenarios in your logic. If you are making external HTTP calls, ensure there's exception handling that caters to 5XX and 4XX. Validate that the return payload has the expected data points. Implement retry logic where applicable. Write the simplest and most performant version of your logic. Needlessly fanciful and complicated code only impressed one person: your current self. Your future self will wonder what on earth you had to drink the day you wrote that code. Say less about how other people will perceive it down the line. What typically happens to such complicated code, which is not maintainable, is that it gets rewritten and deprecated. So, if your goal is to leave a legacy behind, needlessly complicated, non-performant, hard-to-maintain code will not help. If you're using reactive Java programming, please do not write deeply nested code - the callback hell of JavaScript. Use functional programming to separate the different aspects and have a single clean pipeline. 2. Write Readable Code In addition to writing clean code, your code should be readable. Don't write code as if you're a minifier of some sort. Use white-space properly. Coding is akin to creating art. Write beautiful code that others want to read. Use the right variable names. var a = 1 + 2; might make sense now, until you need to troubleshoot and then begin to wonder what on earth is a. Now, you have to run the application in debug mode and observe the values to decode what it means. This extra step (read extra minutes or hours) could have been avoided from the outset. Write meaningful comments and Javadoc. Please don't do this and call it a Javadoc: Java /** * @author smatt */ We will be able to tell you're the author of the code when we do Git Blame. Therefore, kindly stop adding a Javadoc to a method or class just to put your name there. You're contributing to the company's codebase and not an open-source repo on GitHub. Moreover, if your contribution is substantial enough, we will definitely remember you wrote it. Writing meaningful comments and Javadoc is all the more necessary when you're writing special business logic. Your comment or Javadoc can be the saving grace for your future self or colleague when that business logic needs to be improved. I once spent about 2 weeks trying to understand the logic for generating "identifiers". It wasn't funny. Brilliant logic, but it took me weeks to appreciate it. A well-written Javadoc and documentation could have saved me some hours. Avoid spelling mistakes in variable names, comments, function names, etc. Unless your codebase is not in English, please use comprehensible English variable names. We should not need Alan Turing to infer the function of a variable, a method, or a class from its name. Think about it, this is why the Java ecosystem seems to have long method names. We would rather have long names with explicit meaning than require a codex to translate one. Deepen Your Knowledge Software engineering is a scientific and knowledge-based profession. What you know counts a lot towards your growth. What you know how to do is the currency of the trade. If you want to become a Senior Software Engineer, you need to know how to use the tools and platforms employed in your organization. I have interviewed great candidates who did not get the maximum available offer because they only knew as far as committing to the production branch. When it comes to how the deployment pipeline works, how the logs, alerts, and other observability component works, they don't know; "The DevOps team handles that one." As a Senior Software Engineer, you need to be able to follow your code everywhere from the product requirement, technical specification, slicing, refinement, writing, code reviews, deployment, monitoring, and support. This is when you establish your knowledge and become a "Senior". Your organization uses Kibana or Grafana for log visualization, New Relic, Datadog, etc. Do you know how to filter for all the logs for a single service? Do you know how to view the logs for a single HTTP request? Let's say you have an APM platform, such as Datadog, New Relic, or Grafana. Do you know how to set up alerts? Can you interpret an alert, or do you believe your work is limited to writing code and merging to master? While every other thing should be handled by "other people." If you want to become a Senior Software Engineer, you have to learn how these things are set up, how they work, be able to fix them if they break, and improve them too. Currently, you're not a Senior Software Engineer, but have you ever wondered what your "Senior Engineer" or "Tech Lead" had to do before assigning a task to you? There are important steps that happen before and after the writing of the code. It is expected that a Senior Software Engineer should know them and be able to do them well. If your company writes technical specifications, observe refinement sessions, poker planning, or ticket slicing. Don't be satisfied just being in attendance. Attach yourself to someone who's already leading these and ask to help them out. When given the opportunity, pour your heart into it. Get feedback, and you become better over time. If you want to become a Senior Software Engineer, be the embodiment of your organization's coding standard. If there's none — jackpot! Research and implement one. In this process, you'll move from someone who ONLY executes to someone who's involved in the execution planning and thus ready for more responsibility, a.k.a. the Senior Software Engineer role. Still on deepening your knowledge, you should know how the application in your custody works. One rule of thumb I have for myself is this: "Seun, if you leave this place today and someone asks you to come and build similar things, will you be able to do it?". It's a simple concept but powerful. If your team has special implementations and logic somewhere that's hard to understand, make it your job to understand them. Be the guy who knows the hard stuff. There's a special authentication mechanism, and you're the one who knows all about it. There's a special user flow that gets confusing, be the one who knows about it off-hand. Be the guy in Engineering who knows how the CI/CD pipeline works and is able to fix issues. Ask yourself this: Do you know how your organization's deployment pipeline works, or do you just write your code and pass it on to someone else to deploy? Without deepening your knowledge, you will not be equipped to take on more responsibilities, help during the time of an incident, or proffer solutions to major pain points. Be the resident expert, and you can be guaranteed that your ascension will be swift. Be Proactive and Responsible I have once interviewed someone who seems to have a good grasp of the basics and the coding aspect. However, we were able to infer from their submissions that they've never led any project before. While they may be a good executor, they're not yet ready for the Senior role. Volunteer and actively seek opportunities to lead initiatives and do hard things in your organization. Your team is about to start a new project or build a new feature? Volunteer to be the technical owner. When you are given the opportunity, give it your best. Write the best technical specification there is, and review pull requests from other people in time. Submit the most code contributions, organize and carry out water-tight user acceptance tests. When the feature/project is in production, follow up with it to ensure it is doing exactly what it is supposed to do and delivering value for the business. Do these, and now you have things to reference when your name comes up for a promotion. Furthermore, take responsibility for team and organizational challenges. For example, no one wants to update a certain codebase because the test cases are flaky and boring. Be the guy who fixes that without being asked. Of course, solving a problem of that magnitude shows you as a dependable team member who is hungry for more. Another example, your CI/CD pipeline takes 60 minutes to run. Why not be the person who takes some time out to optimize it? If you get it from 60 minutes to 45 minutes, that's a 25% improvement in the pipeline. If we compute the number of times the job has to run per day, and multiply that by 5 days a week. We are looking at saving 375 minutes of man-hours per day. Holy karamba! That's on a single initiative. Now, that's a Senior-worthy outcome. I'm sure, if you look at your Engineering organization, there are countless issues to fix and things to improve. You just need to do them. Another practical thing you can do to demonstrate proactivity and responsibility is simply being available. You most likely have heard something about "the available becomes the desire." There's an ongoing production incident and a bridge call. Join the call. Try to contribute as much as possible to help solve the problem. Attend the post-incident review calls and contribute. You see, by joining these calls, you'll see and learn how the current Seniors troubleshoot issues. You'll see how they use the tools and you'll learn a thing or two. It may not be a production incident, it may be a customer support ticket, or an alert on Slack about something going wrong. Don't just "look and pass" or suddenly go offline. Show some care, attempt to fix it, and you can only get better at it. The best thing about getting better at it is that you become a critical asset to the company. While it is true that anyone is expendable, provided the company is ready to bear the cost. I have also been in a salary review session where some people get a little bit more than others, on the same level, because they're considered a "critical asset." It's a thing, and either you know it or not, it applies. Be proactive (do things without being told to do so) and be responsible (take ownership), and see yourself grow beyond your imagination. Be a Great Communicator Being a great communicator is crucial to your career progression to the Senior Software Engineer role. The reason is that you work with other human beings, and they are not mind readers. Are you responding to the customer support ticket or responding to an alert on Slack? Say so in that Slack thread so other people can do something else. Are you blocked? Please say so. Mention what exactly you have tried and ask for ideas. We don't want to find out on the eve of the delivery date that you've been stuck for the past 1 week, and so the team won't be able to meet their deadline. When other people ask for help, and you are able to, please unblock them. You only get better by sharing with your team. Adopt open communication as much as possible. It will save you from having to private message 10 different people, on the same subject, to reach a decision. Just start a Slack thread in the channel, and everyone can contribute and reach a decision faster. It also helps with accountability and responsibility. What If? Seun Matt, "What if I do all these things and still do not get promoted? I am the resident expert, I know my stuff, I am proactive, and a great communicator. I have the basics covered, and I even set the standards. Despite all of this, I have not been promoted in years." I hear you loud and clear, my friend. We have all been there at some point. There are times when an organization is not able to perform pay raises or do promotions due to economic challenges, lack of profitability, and other prevailing market forces. Remember, the companies we work for do not print money, it is from their profit that they do promotions, raises and bonuses. For you, it is a win-win situation no matter how you look at it. These skill sets that you now have. The path you have taken they are all yours and applicable in the next company. In this profession, your next Salary is influenced by what you're doing at your current place and what you've done in the past. So, even if it does not work out in your current place, when you put yourself out there, you'll get a better offer, all things being equal. Conclusion No matter where you work, ensure you have a good experience. "I have done it before" trumps the number of years. In Software Engineering, experience is not just about the number of years you've worked in a company or have been coding. Experience has to do with the number of challenges you have solved yourself. How many "I have seen this before" you have under your belt? Being passive or just clocking in 9–5 will not get you that level of experience. You need to participate and lead. The interesting part is that your next salary in your next job will be determined by the experience you're garnering in your current work. Doing all of the above is a call of duty. It requires extra work, and it has extra rewards for those able to pursue it. Stay curious, see you at the top, and happy coding! Want to learn how to effectively evaluate the performance of Engineers? Watch the video below. Video More

Trend Report

Generative AI

AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.

Generative AI

Refcard #158

Machine Learning Patterns and Anti-Patterns

By Tuhin Chattopadhyay DZone Core CORE
Machine Learning Patterns and Anti-Patterns

Refcard #269

Getting Started With Data Quality

By Miguel Garcia DZone Core CORE
Getting Started With Data Quality

More Articles

Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems
Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems

Reliability is critical, specifically, when services are interconnected, and failures in one component can lead to cascading effect on other services. The Circuit Breaker Pattern is an important design pattern used to build fault tolerant and resilient systems. Particularly in microservices architecture. This article explains the fundamentals of the circuit breaker pattern, its benefits, and how to implement it to protect your systems from failure. What is the Circuit Breaker Pattern? The Circuit Breaker Pattern is actually inspired by electrical circuit breakers you see at your home, which is designed to prevent system failures by detecting faults and stopping the flow of electricity when problems occur. In software, this pattern monitors service interactions, preventing continuous calls/retries to a failing/failed service, which could overload the service with problem. by “Breaking” the circuit between services, this pattern allows a system to gracefully handle failures and avoid cascading problems. How Does It Actually Work? State Diagram showing the differnt states of CB pattern The circuit breaker has three distinct states: Closed, Open, and Half-Open. Closed State: Normally, the circuit breaker is “closed,” meaning (loop is closed) requests are flowing as usual between services. (In electrical terms wires are connected to allow flow of electricity) Open State: When the circuit breaker is open, it immediately rejects requests to the failing service, preventing further stress on the service and giving it time to recover. During this time, fallback mechanisms can be triggered, such as returning cached data or default responses. Half-Open State: Following a defined timeout, the circuit breaker switches to the half-open state and allows for varying numbers of requests from endpoints to determine if the service has been restored. In case of successful requests, the circuit breaker is closed again, but it goes back to the open state in other cases. The main idea behind this design pattern is to prevent a failing service from pulling down the entire system and to provide a way for recovery once the service becomes healthy. Electrical analogy to remember the open and close states Why Use the Circuit Breaker Pattern? In complex distributed systems, failures are unavoidable. Here are some real reasons why the circuit breaker pattern is essential: Preventing Cascading Failures: In a microservices architecture, if one service fails and others depend on it, the failure can spread across the entire system. The circuit breaker stops this by isolating the faulty service. Improving System Stability: By stopping requests to a failing service, the circuit breaker prevents resource burn down and lowers the load on dependent services, helping to stabilize the system. Better UX: Instead of having requests stuck for too long or return unhandled errors, the circuit breaker allows for graceful degradation by serving fallback responses, improving the user experience even during failures. Automated Recovery: The half-open state allows the system to automatically test the health of a service and recover without manual intervention. How to Implement the Circuit Breaker Pattern The implementation of the circuit breaker pattern depends on the specific stack you’re using, but the standard approach remains same. Below are the high-level overview of how to implement it: Set Failure Thresholds: Define the conditions under which the circuit breaker should open. This can be based on consecutive failures, error rates, or timeouts.Monitor Requests: Continuously track the success or failure of requests to a service. If the failure threshold is attained then trip the circuit breaker.Handle Open State: When the circuit breaker is open, reject further requests to the service and trigger fallback mechanisms.Implement Half-Open State: After some timeout let limited requests hit the service to test if it has recovered. If successful close the circuit breaker.Provide Fallback Mechanisms: During failures, fallback mechanisms can provide default responses, use cached data, or switch to alternate services. The following example demonstrates how to implement a circuit breaker in Java using the widely adopted Resilience4jlibrary: Resilience4j is a powerful Java library designed to help you implement resilience patterns, such as the Circuit Breaker, Rate Limiter, Retry, Bulkhead, and Time Limiter patterns. One of the main advantages of Resilience4j is its flexibility and easy configuration. Correct configuration of these resilience patterns allows developers to fine tune the systems for maximum fault tolerance, improved stability, and better performance in the face of errors. Java import io.github.resilience4j.circuitbreaker.CircuitBreaker; import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig; import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry; import java.time.Duration; public class CircuitBreakerExample { public static void main(String[] args) { // Create a custom configuration for the Circuit Breaker CircuitBreakerConfig config = CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofSeconds(5)) .ringBufferSizeInHalfOpenState(5) .ringBufferSizeInClosedState(20) .build(); // Create a CircuitBreakerRegistry with a custom global configuration CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config); // Get or create a CircuitBreaker from the CircuitBreakerRegistry CircuitBreaker circuitBreaker = registry.circuitBreaker("myService"); // Decorate the service call with the circuit breaker Supplier<String> decoratedSupplier = CircuitBreaker .decorateSupplier(circuitBreaker, myService::call); // Execute the decorated supplier and handle the result Try<String> result = Try.ofSupplier(decoratedSupplier) .recover(throwable -> "Fallback response"); System.out.println(result.get()); } } In this example, the circuit breaker is configured to open if 50% of the requests fail. It stays open for 5 seconds before entering the half-open state, during which it allows 5 requests to test the service. If the requests are successful, it closes the circuit breaker, allowing normal operation to resume. Important Configuration Options for Circuit Breaker in Resilience4j Resilience4j provides a flexible and robust implementation of the Circuit Breaker Pattern, allowing developers to configure various aspects to tailor the behavior to their application’s needs. Correct configuration is crucial to balancing fault tolerance, system stability, and recovery mechanisms. Below are the key configuration options for Resilience4j’s Circuit Breaker: 1. Failure Rate Threshold: This is the percentage of failed requests that will cause the circuit breaker to transition from a Closed state (normal operation) to an Open state (where requests are blocked). The purpose is to controls when the circuit breaker should stop forwarding requests to a failing service. For example, a threshold of 50% means the circuit breaker will open after half of the requests fail. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .failureRateThreshold(50) // Open the circuit when 50% of requests fail .build(); 2. Wait Duration in Open State: The time the circuit breaker remains in the Open state before it transitions to the Half-Open state, where it starts allowing a limited number of requests to test if the service has recovered. This prevents retrying failed services immediately, allowing the downstream service time to recover before testing it again. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .waitDurationInOpenState(Duration.ofSeconds(30)) // Wait for 30 seconds before transitioning to Half-Open .build(); 3. Ring Buffer Size in Closed State: The number of requests that the circuit breaker records while in the Closed state (before failure rates are evaluated). This acts as a sliding window for error monitoring. Helps the circuit breaker determine the failure rate based on recent requests. a larger ring buffer size means more data points are considered when deciding whether to open the circuit. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .ringBufferSizeInClosedState(50) // Consider the last 50 requests to calculate the failure rate .build(); 4. Ring Buffer Size in Half-Open State: The number of permitted requests in the Half-Open state before deciding whether to close the circuit or revert to the Open state based on success or failure rates. determines how many requests will be tested in the half-open state to decide whether the service is stable to close the circuit or still its failing. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .ringBufferSizeInHalfOpenState(5) // Test with 5 requests in Half-Open state .build(); 5. Sliding Window Type and Size: Defines how failure rates are measured: either by a count-based sliding window or time-based sliding window. provides flexibility in handling failure rates are computed. A count based window is useful in hightraffic systems, whereas a time based window works well in low traffic environments. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) // Use a count-based window with the last 100 requests .build(); 6. Minimum Number of Calls: specifies the minimum number of requests required before the failure rate is evaluated. prevents the circuit breaker from opening prematurely when there isnt enough data to calculate a meaningful failure rate, specially during low traffic. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .minimumNumberOfCalls(20) // Require at least 20 calls before evaluating failure rate .build(); 7. Permitted Number of Calls in Half-Open State: The number of requests allowed to pass through in the Half-Open state to check if the service has recovered. After transitioning to the half-open state, this config controls how many requests are allowed to evaluate service recovery. a smaller value can catch issues fastr, while a larger value can promise that temporary issues don’t result in reopening circuit. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .permittedNumberOfCallsInHalfOpenState(5) // Test recovery with 5 requests .build(); 8. Slow Call Duration Threshold: Defines the threshold for a slow call. Calls taking longer than this threshold are considered “slow” and can contribute to the failure rate. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slowCallDurationThreshold(Duration.ofSeconds(2)) // Any call over 2 seconds is considered slow .build(); 9. Slow Call Rate Threshold: The percentage of “slow” calls that will trigger the circuit breaker to open, similar to the failure rate threshold. Detects services that are degrading in performance before they fail outright, allowing systems to respond to performance issues early. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slowCallRateThreshold(50) // Open the circuit when 50% of calls are slow .build(); 10. Automatic Transition from Open to Half-Open: Controls how the circuit breaker automatically transitions from the Open state to the Half-Open state after a set wait duration. Enables the system to recover automatically by testing the service periodically, avoiding the need for manual intervention. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .automaticTransitionFromOpenToHalfOpenEnabled(true) // Enable automatic transition .build(); 11. Fallback Mechanism: Helps configure fallback actions when the circuit breaker is open and requests are blocked. Prevents cascading failures and improves us by serving cached data/default responses. Java Try<String> result = Try.ofSupplier( CircuitBreaker.decorateSupplier(circuitBreaker, service::call) ).recover(throwable -> "Fallback response"); Conclusion The Circuit Breaker Pattern is a vital tool in building resilient, fault-tolerant systems. By preventing cascading failures, improving system stability, and enabling graceful recovery, it plays a crucial role in modern software architecture, especially in microservices environments. Whether you’re building a large-scale enterprise application or a smaller distributed system, the circuit breaker can be a game-changer in maintaining reliable operations under failure conditions.

By Narendra Lakshmana gowda
AI Agent Architectures: Patterns, Applications, and Implementation Guide
AI Agent Architectures: Patterns, Applications, and Implementation Guide

Architecture is something I am very much interested in. As I was exploring AI agents, I was curious to understand the agentic architectures. That led me to this awesome resource, The 2025 Guide to AI Agents, published by IBM on their Think page. One of the sections of the guide is around architecture. The architecture section explains that agentic architecture refers to the design and structure enabling AI agents to automate workflows, reason through tasks, and utilize tools to achieve their objectives. This architecture is built to support autonomous, goal-driven behavior by allowing agents to perceive their environment, process information, and act independently within defined rules and constraints. It often incorporates frameworks that facilitate collaboration between multiple agents, known as multi-agent systems, and provide the necessary infrastructure for integrating with external tools, APIs, and data sources. By leveraging agentic architecture, organizations can create scalable, flexible AI solutions that automate complex business processes and adapt to changing requirements. Introduction to AI Agent Architectures AI agent architectures provide structural blueprints for designing intelligent systems that perceive environments, process information, and execute actions. These frameworks define how components interact, manage data flow, and make decisions, critically impacting performance, scalability, and adaptability. As AI systems evolve from narrow applications to complex reasoning engines, architectural choices determine their ability to handle uncertainty, integrate new capabilities, and operate in dynamic environments. This guide explores essential patterns with practical implementation insights. Here are some core architecture patterns: 1. Orchestrator-Worker Architecture The orchestrator-worker pattern represents a centralized approach to task management where a single intelligent controller (orchestrator) maintains global oversight of system operations. This architecture excels at decomposing complex problems into manageable subtasks, distributing them to specialized worker agents, and synthesizing partial results into complete solutions. The orchestrator serves as the system's "brain," making strategic decisions about task allocation, monitoring worker performance, and implementing fallback strategies when errors occur. Workers operate as domain-specific experts, focusing solely on executing their assigned tasks with maximum efficiency. This separation of concerns enables parallel processing while maintaining centralized control, particularly valuable when auditability, reproducibility, or coordinated error recovery are required. Orchestrator worker Concept The central coordinator decomposes tasks, assigns subtasks to specialized workers, and synthesizes results. Key Components Orchestrator (task decomposition/assignment)Worker pool (specialized capabilities)Task queue (work distribution)Result aggregator When to Use Complex workflows requiring multiple capabilitiesSystems needing centralized monitoringApplications with parallelizable tasks Real-World Case Banking Fraud Detection: Orchestrator routes transactions to workers analyzing patterns, location data, and behavior history. Suspicious cases trigger human review. 2. Hierarchical Architecture Hierarchical architectures model organizational command structures by arranging decision-making into multiple layers of abstraction. At the highest level, strategic planners operate with long-term horizons and broad objectives, while successive layers handle progressively more immediate concerns until reaching real-time actuators at the base level. This architecture naturally handles systems where different time scales of decision-making coexist; for example, an autonomous vehicle simultaneously plans a multi-day route (strategic), navigates city blocks (tactical), and adjusts wheel torque (execution). Information flows bi-directionally: sensor data aggregates upward through abstraction layers while commands propagate downward with increasing specificity. The hierarchy provides inherent fail-safes, as lower layers can implement emergency behaviors when higher-level planning becomes unresponsive. Concept Multi-layered control with increasing abstraction levels (strategic → tactical → execution). Key Components: Strategic layer (long-term goals)Tactical layer (resource allocation)Execution layer (real-time control)Feedback loops between layers Hierarchical When to Use Systems with natural command chainsProblems requiring different time-scale decisionsSafety-critical applications Real-World Case Smart Factory: Strategic layer optimizes quarterly production, tactical layer schedules weekly shifts, execution layer controls robotic arms in real-time. 3. Blackboard Architecture The Backboard pattern mimics human expert panels solving complex problems through collaborative contribution. At its core lies a shared data space (the blackboard), where knowledge sources —such as independent specialists like image recognizers, database query engines, or statistical analyzers—post partial solutions and read others' contributions. Unlike orchestrated systems, no central controller directs the problem-solving; instead, knowledge sources activate opportunistically when their expertise becomes relevant to the evolving solution. This emergent behavior makes blackboard systems uniquely suited for ill-defined problems where solution paths are unpredictable, such as medical diagnosis or scientific discovery. The architecture naturally accommodates contradictory hypotheses (represented as competing entries on the blackboard) and converges toward consensus through evidence accumulation. Concept Independent specialists contribute to a shared data space ("blackboard"), collaboratively evolving solutions. Key Components Blackboard (shared data repository)Knowledge sources (specialized agents)Control mechanism (activation coordinator) Blackboard architecture When to Use Ill-defined problems with multiple approachesDiagnostic systems requiring expert collaborationResearch environments Real-World Case Oil Rig Monitoring: Geologists, engineers, and equipment sensors contribute data to predict maintenance needs and drilling risks. 4. Event-Driven Architecture Event-driven architectures treat system state changes as first-class citizens, with components reacting to asynchronous notifications rather than polling for updates. This paradigm shift enables highly responsive systems that scale efficiently under variable loads. Producers (sensors, user interfaces, or other agents) emit events when state changes occur — a temperature threshold breach, a new chat message arrival, or a stock price movement. Consumers subscribe to relevant events through a message broker, which handles routing, persistence, and delivery guarantees. The architecture's inherent decoupling allows components to evolve independently, making it ideal for distributed systems and microservices. Event sourcing variants maintain complete system state as an ordered log of events, enabling time-travel debugging and audit capabilities unmatched by traditional architectures. Concept Agents communicate through asynchronous events triggered by state changes. Key Components Event producers (sensors/user inputs)Message broker (event routing)Event consumers (processing agents)State stores Event-driven architecture When to Use Real-time reactive systemsDecoupled components with independent scalingIoT and monitoring applications Real-World Case Smart Building: Motion detectors trigger lighting adjustments, energy price changes activate HVAC optimization, and smoke sensors initiate evacuation protocols. 5. Multi-Agent Systems (MAS) Multi-agent systems distribute intelligence across autonomous entities that collaborate through negotiation rather than central command. Each agent maintains its own goals, knowledge base, and decision-making processes, interacting with peers through standardized protocols like contract net (task auctions) or voting mechanisms. This architecture excels in environments where central control is impractical, such as disaster response robots exploring rubble, blockchain oracles providing decentralized data feeds, or competing traders in financial markets. MAS implementations carefully balance local autonomy against global coordination needs through incentive structures and communication protocols. The architecture's resilience comes from redundancy — agent failures rarely cripple the system - while emergent behaviors can produce innovative solutions unpredictable from individual agent designs. Multi-agent systems Concept Autonomous agents collaborate through negotiation to achieve individual or collective goals. Key Components Autonomous agentsCommunication protocols (FIPA/ACL)Coordination mechanisms (auctions/voting)Environment model When to Use Distributed problems without a central authoritySystems requiring high fault toleranceCompetitive or collaborative environments Real-World Case Port Logistics: Cranes, trucks, and ships negotiate berthing schedules and container transfers using contract-net protocols. 6. Reflexive vs. Deliberative Architectures These contrasting paradigms represent two fundamental approaches to agent decision-making. Reflexive architectures implement direct stimulus-response mappings through condition-action rules ("if temperature > 100°C then shutdown"), providing ultra-fast reactions at the cost of contextual awareness. They excel in safety-critical applications like industrial emergency stops or network intrusion prevention. Deliberative architectures instead maintain internal world models, using planning algorithms to sequence actions toward goals while considering constraints. Though computationally heavier, they enable sophisticated behaviors like supply chain optimization or clinical treatment planning. Hybrid implementations often layer reflexive systems atop deliberative bases — autonomous vehicles use deliberative route planning but rely on reflexive collision avoidance when milliseconds matter. Reflexive Concept Direct stimulus-response mapping without internal state. Structure: Condition-action rulesUse: Time-critical reactionsCase: Industrial E-Stop - Immediately cuts power when a safety breach is detected Deliberative Concept Internal world model with planning/reasoning. Structure: Perception → Model Update → Planning → ActionUse: Complex decision-makingCase: Supply Chain Optimization - Simulates multiple scenarios before committing resources Hybrid Approach Autonomous Vehicles: Reflexive layer handles collision avoidance while the deliberative layer plans routes. 7. Memory-Augmented Architectures Memory-augmented architectures explicitly separate processing from knowledge retention, overcoming the context window limitations of stateless systems. These designs incorporate multiple memory systems: working memory for immediate task context, episodic memory for experience recording, and semantic memory for factual knowledge. Retrieval mechanisms range from simple keyword lookup to sophisticated vector similarity searches across embedding spaces. The architecture enables continuous learning, as new experiences update memory content without requiring model retraining, and supports reasoning across extended timelines. Modern implementations combine neural networks with symbolic knowledge graphs, allowing both pattern recognition and logical inference over memorized content. This proves invaluable for applications like medical diagnosis systems that must recall patient histories while staying current with the latest research. Concept Agents with explicit memory systems for long-term context. Key Components Short-term memory (working context)Long-term memory (vector databases/knowledge graphs)Retrieval mechanisms (semantic search)Memory update policies When to Use Conversational agents require contextSystems needing continuous learningApplications leveraging historical data Real-World Case Medical Assistant: Recalls patient history, researches latest treatments, and maintains consultation context across sessions. Architecture Selection Table ArchitectureBest ForStrengthsLimitationsImplementation ComplexityOrchestrator-WorkerComplex task coordinationCentralized control, auditabilitySingle point of failureMediumHierarchicalLarge-scale systemsClear responsibility chainsCommunication bottlenecksHighBlackboardCollaborative problem-solvingFlexible expertise integrationUnpredictable timingHighEvent-DrivenReal-time reactive systemsLoose coupling, scalabilityEvent tracing difficultiesMediumMulti-AgentDistributed environmentsHigh fault toleranceCoordination complexityHighReflexiveTime-critical responsesLow latency, simplicityLimited intelligenceLowDeliberativeStrategic planningSophisticated reasoningComputational overheadHighMemory-AugmentedContextual applicationsLong-term knowledge retentionMemory management costsMedium-High Conclusion The most effective implementations combine patterns strategically, such as using hierarchical organization for enterprise-scale systems with event-driven components for real-time responsiveness, or memory-augmented orchestrators that manage specialized workers. As AI systems advance, architectures will increasingly incorporate self-monitoring and dynamic reconfiguration capabilities, enabling systems that evolve their own organization based on performance requirements. Selecting the right architectural foundation remains the most critical determinant of an AI system's long-term viability and effectiveness. For AI Developer tools, check my article here.

By Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

The recent announcement of KubeMQ-Aiway caught my attention not as another AI platform launch, but as validation of a trend I've been tracking across the industry. After spending the last two decades building distributed systems and the past three years deep in AI infrastructure consulting, the patterns are becoming unmistakable: we're at the same inflection point that microservices faced a decade ago. The Distributed Systems Crisis in AI We've been here before. In the early 2010s, as monolithic architectures crumbled under scale pressures, we frantically cobbled together microservices with HTTP calls and prayed our systems wouldn't collapse. It took years to develop proper service meshes, message brokers, and orchestration layers that made distributed systems reliable rather than just functional. The same crisis is unfolding with AI systems, but the timeline is compressed. Organizations that started with single-purpose AI models are rapidly discovering they need multiple specialized agents working in concert, and their existing infrastructure simply wasn't designed for this level of coordination complexity. Why Traditional Infrastructure Fails AI Agents Across my consulting engagements, I'm seeing consistent patterns of infrastructure failure when organizations try to scale AI beyond proof-of-concepts: HTTP communication breaks down: Traditional request-response patterns work for stateless operations but fail when AI agents need to maintain context across extended workflows, coordinate parallel processing, or handle operations that take minutes rather than milliseconds. The synchronous nature of HTTP creates cascading failures that bring down entire AI workflows.Context fragmentation destroys intelligence: AI agents aren't just processing data — they're maintaining conversational state and building accumulated knowledge. When that context gets lost at service boundaries or fragmented across sessions, the system's collective intelligence degrades dramatically.Security models are fundamentally flawed: Most AI implementations share credentials through environment variables or configuration files. This creates lateral movement risks and privilege escalation vulnerabilities that traditional security models weren't designed to handle.Architectural constraints force bad decisions: Tool limitations in current AI systems force teams into anti-patterns, such as building meta-tools, fragmenting capabilities, or implementing complex dynamic loading mechanisms. Each workaround introduces new failure modes and operational complexity. Evaluating the KubeMQ-Aiway Technical Solution KubeMQ-Aiway is “the industry’s first purpose-built connectivity hub for AI agents and Model-Context-Protocol (MCP) servers. It enables seamless routing, security, and scaling of all interactions — whether synchronous RPC calls or asynchronous streaming — through a unified, multi-tenant-ready infrastructure layer.” In other words, it’s the hub that manages and routes messages between systems, services, and AI agents. Through their early access program, I recently explored KubeMQ-Aiway's architecture. Several aspects stood out as particularly well-designed for these challenges: Unified aggregation layer: Rather than forcing point-to-point connections between agents, they've created a single integration hub that all agents and MCP servers connect through. This is architecturally sound — it eliminates the N-squared connection problem that kills system reliability at scale. More importantly, it provides a single point of control for monitoring, security, and operational management.Multi-pattern communication architecture: The platform supports both synchronous and asynchronous messaging natively, with pub/sub patterns and message queuing built in. This is crucial because AI workflows aren't purely request-response — they're event-driven processes that need fire-and-forget capabilities, parallel processing, and long-running operations. The architecture includes automatic retry mechanisms, load balancing, and connection pooling that are essential for production reliability.Virtual MCP implementation: This is particularly clever — instead of trying to increase tool limits within existing LLM constraints, they've abstracted tool organization at the infrastructure layer. Virtual MCPs allow logical grouping of tools by domain or function while presenting a unified interface to the AI system. It's the same abstraction pattern that made container orchestration successful.Role-based security model: The built-in moderation system implements proper separation of concerns with consumer and administrator roles. More importantly, it handles credential management at the infrastructure level rather than forcing applications to manage secrets. This includes end-to-end encryption, certificate-based authentication, and comprehensive audit logging — security patterns that are proven in distributed systems but rarely implemented correctly in AI platforms. Technical Architecture Deep Dive What also impresses me is their attention to distributed systems fundamentals: Event sourcing and message durability: The platform maintains a complete audit trail of agent interactions, which is essential for debugging complex multi-agent workflows. Unlike HTTP-based systems, where you lose interaction history, this enables replay and analysis capabilities that are crucial for production systems.Circuit breaker and backpressure patterns: Built-in failure isolation prevents cascade failures when individual agents malfunction or become overloaded. The backpressure mechanisms ensure that fast-producing agents don't overwhelm slower downstream systems — a critical capability when dealing with AI agents that can generate work at unpredictable rates.Service discovery and health checking: Agents can discover and connect to other agents dynamically without hardcoded endpoints. The health checking ensures that failed agents are automatically removed from routing tables, maintaining system reliability.Context preservation architecture: Perhaps most importantly, they've solved the context management problem that plagues most AI orchestration attempts. The platform maintains conversational state and working memory across agent interactions, ensuring that the collective intelligence of the system doesn't degrade due to infrastructure limitations. Production Readiness Indicators From an operational perspective, KubeMQ-Aiway demonstrates several characteristics that distinguish production-ready infrastructure from experimental tooling: Observability: Comprehensive monitoring, metrics, and distributed tracing for multi-agent workflows. This is essential for operating AI systems at scale, where debugging requires understanding complex interaction patterns.Scalability design: The architecture supports horizontal scaling of both the infrastructure layer and individual agents without requiring system redesign. This is crucial as AI workloads are inherently unpredictable and bursty.Operational simplicity: Despite the sophisticated capabilities, the operational model is straightforward — agents connect to a single aggregation point rather than requiring complex service mesh configurations. Market Timing and Competitive Analysis The timing of this launch is significant. Most organizations are hitting the infrastructure wall with their AI implementations right now, but existing solutions are either too simplistic (basic HTTP APIs) or too complex (trying to adapt traditional service meshes for AI workloads). KubeMQ-Aiway appears to have found the right abstraction level — sophisticated enough to handle complex AI orchestration requirements, but simple enough for development teams to adopt without becoming distributed systems experts. Compared to building similar capabilities internally, the engineering effort would be substantial. The distributed systems expertise required, combined with AI-specific requirements, represents months or years of infrastructure engineering work that most organizations can't justify when production AI solutions are available. Strategic Implications For technology leaders, the emergence of production-ready AI infrastructure platforms changes the strategic calculation around AI implementation. The question shifts from "should we build AI infrastructure?" to "which platform enables our AI strategy most effectively?" Early adopters of proper AI infrastructure are successfully running complex multi-agent systems at production scale while their competitors struggle with basic agent coordination. This gap will only widen as AI implementations become more sophisticated. The distributed systems problems in AI won't solve themselves, and application-layer workarounds don't scale. Infrastructure solutions like KubeMQ-Aiway represent how AI transitions from experimental projects to production systems that deliver sustainable business value. Organizations that recognize this pattern and invest in proven AI infrastructure will maintain a competitive advantage over those that continue trying to solve infrastructure problems at the application layer. Have a really great day!

By John Vester DZone Core CORE
Memory Leak Due to Uncleared ThreadLocal Variables
Memory Leak Due to Uncleared ThreadLocal Variables

In Java, we commonly use static, instance (member), and local variables. Occasionally, we use ThreadLocal variables. When a variable is declared as ThreadLocal, it will only be visible to that particular thread. ThreadLocal variables are extensively used in frameworks such as Log4J and Hibernate. If these ThreadLocal variables aren’t removed after their use, they will accumulate in memory and have the potential to trigger an OutOfMemoryError. In this post, let’s learn how to troubleshoot memory leaks that are caused by ThreadLocal variables. ThreadLocal Memory Leak Here is a sample program that simulates a ThreadLocal memory leak. Plain Text 01: public class ThreadLocalOOMDemo { 02: 03: private static final ThreadLocal<String> threadString = new ThreadLocal<>(); 04: 05: private static final String text = generateLargeString(); 06: 07: private static int count = 0; 08: 09: public static void main(String[] args) throws Exception { 10: while (true) { 11: 12: Thread thread = new Thread(() -> { 13: threadString.set("String-" + count + text); 14: try { 15: Thread.sleep(Long.MAX_VALUE); // Keep thread alive 16: } catch (InterruptedException e) { 17: Thread.currentThread().interrupt(); 18: } 19: }); 20: 21: thread.start(); 22: count++; 23: System.out.println("Started thread #" + count); 24: } 25: } 26: 27: private static String generateLargeString() { 28: StringBuilder sb = new StringBuilder(5 * 1024 * 1024); 29: while (sb.length() < 5 * 1024 * 1024) { 30: sb.append("X"); 31: } 32: return sb.toString(); 33: } 34:} 35: Before continuing to read, please take a moment to review the above program closely. In the above program, in line #3, ‘threadString’ is declared as a ‘ThreadLocal’ variable. In line #10, the program is infinitely (i.e., ‘while (true)’ condition) creating new threads. In line #13, to each created thread, it’s setting a large string (i.e., ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’) as a ThreadLocal variable. The program never removes the ThreadLocal variable once it’s created. So, in a nutshell, the program is creating new threads infinitely and slapping each new thread with a large string as its ThreadLocal variable and never removing it. Thus, when the program is executed, ThreadLocal variables will continuously accumulate into memory and finally result in ‘java.lang.OutOfMemoryError: Java heap space’. How to Diagnose ThreadLocal Memory Leak? You want to follow the steps highlighted in this post to diagnose the OutOfMemoryError: Java Heap Space. In a nutshell, you need to do: 1. Capture Heap Dump You need to capture a heap dump from the application, right before the JVM throws an OutOfMemoryError. In this post, eight options for capturing a heap dump are discussed. You may choose the option that best suits your needs. My favorite option is to pass the ‘-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<FILE_PATH_LOCATION>‘ JVM arguments to your application at the time of startup. Example: Shell -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/tmp/heapdump.hprof When you pass the above arguments, JVM will generate a heap dump and write it to ‘/opt/tmp/heapdump.hprof’ file whenever an OutOfMemoryError is thrown. 2. Analyze Heap Dump Once a heap dump is captured, you need to analyze the dump. In the next section, we will discuss how to do heap dump analysis. Heap Dump Analysis: ThreadLocal Memory Leak Heap dumps can be analyzed using various heap dump analysis tools, such as HeapHero, JHat, and JVisualVM. Here, let’s analyze the heap dump captured from this program using the HeapHero tool. HeapHero flags memory leak using ML algorithm The HeapHero tool utilizes machine learning algorithms internally to detect whether any memory leak patterns are present in the heap dump. Above is the screenshot from the heap dump analysis report, flagging a warning that there are 66 instances of ‘java.lang.Thread’ objects, which together is occupying 97.13% of overall memory. It’s a strong indication that the application is suffering from memory leak and it originates from the ‘java.lang.Thread’ objects. Largest Objects section highlights Threads consuming majority of heap space The ‘Largest Objects’ section in the HeapHero analysis report shows all the top memory-consuming objects, as shown in the above figure. Here you can clearly notice that all of these objects are of type ‘java.lang.Thread’ and each of them occupies ~10MB of memory. This clearly shows the culprit objects that are responsible for the memory leak. Outgoing Reference section shows the ThreadLocal strings Tools also give the capability to drill down into the object to investigate its content. When you drill down into any one of the Threads reported in the ‘Largest Object’ section, you can see all its child objects. From the above figure, you can notice the actual ThreadLocal string ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’ to be reported. Basically, this is the string that was added in line #13 of the above programs to be reported. Thus, the tool helps you to point out the memory-leaking object and its source with ease. How to Prevent ThreadLocal Memory Leak Once ThreadLocal variables are used, always call: Shell threadString.remove(); This clears the ThreadLocal variable value from the current thread and avoids the potential memory leaks. Conclusion Uncleared ThreadLocal variables are a subtle issue; however, when left unnoticed, they can accumulate over a period of time and have the potential to bring down the entire application. By being disciplined about removing the ThreadLocal variable after its use, and by using tools like HeapHero for faster root cause analysis, you can protect your applications from hard-to-detect outages.

By Ram Lakshmanan DZone Core CORE
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit to get control of logs on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to use Fluent Bit telemetry pipeline on a Kubernetes cluster to take control of all the logs being generated. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The backstory is that the project began as a log parsing project, using Fluentd, which joined the CNCF in 2026 and achieved Graduated status in 2019. Once it became apparent that the world was heading towards cloud-native Kubernetes environments, the solution was not designed to meet the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now a 10-year history in the release of v4 last week. Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Control Logs on a Kubernetes Cluster? When you dive into the cloud native world, this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data, and Fluent Bit is a wonderfully simple way to take control of them across a Kubernetes cluster. It provides a way of collecting everything through a central telemetry pipeline as you go, while providing the ability to parse, filter, and route all your telemetry data. For developers, this article will demonstrate using Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines Where to Get Started To ensure you are ready to start controlling your Kubernetes cluster logs, the rest of this article assumes you have completed the previous article. This ensures you are running a two-node Kubernetes cluster with a workload running in the form of Ghost CMS, and Fluent Bit is installed to collect all container logs. If you did not work through the previous article, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to spin up the Kubernetes cluster with the above setup on your local machine. Using either path, once set up, you are able to see the logs from Fluent Bit containing everything generated on this running cluster. This would be the logs across three namespaces: kube-system, ghost, and logging. You can verify that they are up and running by browsing those namespaces, shown here on my local machine: Go $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace kube-system NAME READY STATUS RESTARTS AGE coredns-668d6bf9bc-jrvrx 1/1 Running 0 69m coredns-668d6bf9bc-wbqjk 1/1 Running 0 69m etcd-2node-control-plane 1/1 Running 0 69m kindnet-fmf8l 1/1 Running 0 69m kindnet-rhlp6 1/1 Running 0 69m kube-apiserver-2node-control-plane 1/1 Running 0 69m kube-controller-manager-2node-control-plane 1/1 Running 0 69m kube-proxy-b5vjr 1/1 Running 0 69m kube-proxy-jxpqc 1/1 Running 0 69m kube-scheduler-2node-control-plane 1/1 Running 0 69m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace ghost NAME READY STATUS RESTARTS AGE ghost-dep-8d59966f4-87jsf 1/1 Running 0 77m ghost-dep-mysql-0 1/1 Running 0 77m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace logging NAME READY STATUS RESTARTS AGE fluent-bit-7qjmx 1/1 Running 0 41m The initial configuration for the Fluent Bit instance is to collect all container logs, from all namespaces, shown in the fluent-bit-helm.yaml configuration file used in our setup, highlighted in bold below: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri outputs: - name: stdout match: '*' To see all the logs collected, we can dump the Fluent Bit log file as follows, using the pod name we found above: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-7qjmx --nanmespace logging [OUTPUT-CUT-DUE-TO-LOG-VOLUME] ... You will notice if you browse that you have error messages, info messages, if you look hard enough, some logs from Ghost's MySQL workload, the Ghost CMS workload, and even your Fluent Bit instance. As a developer working on your cluster, how can you find anything useful in this flood of logging? The good thing is you do have a single place to look for them! Another point to mention is that by using the Fluent Bit tail input plugin and setting it to read from the beginning of each log file, we have ensured that our log telemetry data is taken from all our logs. If we didn't set this to collect from the beginning of the log file, our telemetry pipeline would miss everything that was generated before the Fluent Bit instance started. This ensures we have the workload startup messages and can test on standard log telemetry events each time we modify our pipeline configuration. Let's start taking control of our logs and see how we, as developers, can make some use of the log data we want to see during our local development testing. Taking Back Control The first thing we can do is to focus our log collection efforts on just the workload we are interested in, and in this example, we are looking to find problems with our Ghost CMS deployment. As you are not interested in the logs from anything happening in the kube-system namespace, you can narrow the focus of your Fluent Bit input plugin to only examine Ghost log files. This can be done by making a new configuration file called myfluent-bit-heml.yaml file and changing the default path as follows in bold: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' The next step is to update the Fluent Bit instance with a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-mzktk 1/1 Running 0 28s Now, explore the logs being collected by Fluent Bit and notice that all the kube-system namespace logs are no longer there, and we can focus on our deployed workload. Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-mzktk --nanmespace logging ... [11] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.278137067, {}], {"time"=>"2025-05-18T15:51:26.278137067Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.27 INFO ==> Configuring database"}] [12] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.318427288, {}], {"time"=>"2025-05-18T15:51:26.318427288Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.31 INFO ==> Setting up Ghost"}] [13] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.211337893, {}], {"time"=>"2025-05-18T15:51:31.211337893Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.21 INFO ==> Configuring Ghost URL to http://127.0.0.1:2368"}] [14] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.234609188, {}], {"time"=>"2025-05-18T15:51:31.234609188Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.23 INFO ==> Passing admin user creation wizard"}] [15] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.243222300, {}], {"time"=>"2025-05-18T15:51:31.2432223Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.24 INFO ==> Starting Ghost in background"}] [16] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583519.424206501, {}], {"time"=>"2025-05-18T15:51:59.424206501Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:59.42 INFO ==> Stopping Ghost"}] [17] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583520.921096963, {}], {"time"=>"2025-05-18T15:52:00.921096963Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:00.92 INFO ==> Persisting Ghost installation"}] [18] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583521.008567054, {}], {"time"=>"2025-05-18T15:52:01.008567054Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:01.00 INFO ==> ** Ghost setup finished! **"}] ... This is just a selection of log lines from the total output. If you look closer, you see these logs have their own sort of format, so let's standardize them so that JSON is the output format and make the various timestamps a bit more readable by changing your Fluent Bit output plugin configuration as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-gqsc8 1/1 Running 0 42s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-gqsc8 --nanmespace logging ... {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} {"date":"2025-06-05 13:49:59.387736","time":"2025-06-05T13:49:59.387736981Z","stream":"stdout","_p":"F","log":""} {"date":"2025-06-05 13:49:59.451176","time":"2025-06-05T13:49:59.451176821Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.45 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Starting Ghost **"} {"date":"2025-06-05 13:50:00.171207","time":"2025-06-05T13:50:00.171207812Z","stream":"stdout","_p":"F","log":""} ... Now, if we look closer at the array of messages and being the developer we are, we've noticed a mix of stderr and stdout log lines. Let's take control and trim out all the lines that do not contain stderr, as we are only interested in what is broken. We need to add a filter section to our Fluent Bit configuration using the grep filter and targeting a regular expression to select the keys stream or stderr as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-npn8n 1/1 Running 0 12s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-npn8n --nanmespace logging ... {"date":"2025-06-05 13:49:34.807524","time":"2025-06-05T13:49:34.807524266Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.80 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring database"} {"date":"2025-06-05 13:49:34.860722","time":"2025-06-05T13:49:34.860722188Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.86 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Setting up Ghost"} {"date":"2025-06-05 13:49:36.289847","time":"2025-06-05T13:49:36.289847086Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.28 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring Ghost URL to http://127.0.0.1:2368"} {"date":"2025-06-05 13:49:36.373376","time":"2025-06-05T13:49:36.373376803Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Passing admin user creation wizard"} {"date":"2025-06-05 13:49:36.379461","time":"2025-06-05T13:49:36.379461971Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Starting Ghost in background"} {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} ... We are no longer seeing standard output log events, as our telemetry pipeline is now filtering to only show standard error-tagged logs! This exercise has shown how to format and prune our logs using our Fluent Bit telemetry pipeline on a Kubernetes cluster. Now let's look at how to enrich our log telemetry data. We are going to add tags to every standard error line pointing the on-call developer to the SRE they need to contact. To do this, we expand our filter section of the Fluent Bit configuration using the modify filter and targeting the keys stream or stderr to remove those keys and add two new keys, STATUS and ACTION, as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr - name: modify match: '*' condition: Key_Value_Equals stream stderr remove: stream add: - STATUS REALLY_BAD - ACTION CALL_SRE outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-ftfs4 1/1 Running 0 32s Now, explore the logs being collected by Fluent Bit and notice the output changes where the stream key is missing and two new ones have been added at the end of each error log event: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-ftfs4 --nanmespace logging ... [CUT-LINE-FOR-VIEWING] Configuring database"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Setting up Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Configuring Ghost URL to http://127.0.0.1:2368"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Passing admin user creation wizard"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Starting Ghost in background"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Stopping Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Persisting Ghost installation"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] ** Ghost setup finished! **"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} ... Now we have a running Kubernetes cluster, with two nodes generating logs, a workload in the form of a Ghost CMS generating logs, and using a Fluent Bit telemetry pipeline to gather and take control of our log telemetry data. Initially, we found that gathering all log telemetry data was flooding too much information to be able to sift out the important events for our development needs. We then started taking control of our log telemetry data by narrowing our collection strategy, by filtering, and finally by enriching our telemetry data. More in the Series In this article, you learned how to use Fluent Bit on a Kubernetes cluster to take control of your telemetry data. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, integrating Fluent Bit telemetry pipelines with OpenTelemetry.

By Eric D. Schabell DZone Core CORE
HTAP Using a Star Query on MongoDB Atlas Search Index
HTAP Using a Star Query on MongoDB Atlas Search Index

MongoDB is often chosen for online transaction processing (OLTP) due to its flexible document model, which can align with domain-specific data structures and access patterns. Beyond basic transactional workloads, MongoDB also supports search capabilities through Atlas Search, built on Apache Lucene. When combined with the aggregation pipeline, this enables limited online analytical processing (OLAP) functionality suitable for near-real-time analytics. Because MongoDB uses a unified document model, these analytical queries can run without restructuring the data, allowing for certain hybrid transactional and analytical (HTAP) workloads. This article explores such a use case in the context of healthcare. Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, to support efficient ad-hoc queries. This typically requires a dimensional schema, or star schema, which is distinct from the normalized operational schema used for transactional updates. MongoDB can support a similar querying approach using its document schema, which is often designed for operational use. By adding an Atlas Search index to the collection storing transactional data, certain analytical queries can be supported without restructuring the schema. To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilized the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning. Import the Dataset The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records. I downloaded the CSV files and uncompressed them: curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw" unzip medsynora-dw.zip I imported each file into a collection, using mongoimport from the MongoDB Database Tools: for i in "MedSynora DW"/*.csv do mongoimport -d "MedSynoraDW" --file="$i" --type=csv --headerline -c "$(basename "$i" .csv)" -j 8 done For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers: # head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv ==> MedSynora DW/FactEncounter.csv <== Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent ==> MedSynora DW/FactLabTests.csv <== Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue The fact tables referenced the following dimensions: # head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv ==> MedSynora DW/DimDisease.csv <== Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit ==> MedSynora DW/DimDoctor.csv <== Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count ==> MedSynora DW/DimInsurance.csv <== InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments ==> MedSynora DW/DimPatient.csv <== Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type ==> MedSynora DW/DimRoom.csv <== RoomKey,Care_Level,Room Type Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition, it is described in two CSV files to fit into two SQL tables: Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections. MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship. Since this represents the appropriate data model for an operational database with such data, I created a new collection using an aggregation pipeline to replace the two imported from the normalized CSV: db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 }); db.FactEncounter.aggregate([ { $lookup: { from: "FactLabTests", localField: "Encounter_ID", foreignField: "Encounter_ID", as: "LabTests" } }, { $addFields: { LabTests: { $map: { input: "$LabTests", as: "test", in: { Phase: "$$test.Phase", LabType: "$$test.LabType", TestName: "$$test.TestName", TestValue: "$$test.TestValue" } } } } }, { $out: "FactEncounterLabTests" } ]); Here is how one document looks: AtlasLocalDev atlas [direct: primary] MedSynoraDW> db.FactEncounterLabTests.find().limit(1) [ { _id: ObjectId('67fc3d2f40d2b3c843949c97'), Encounter_ID: 2158, Patient_ID: 'TR479', Disease_ID: 1632, ResponsibleDoctorID: 905, InsuranceKey: 82, RoomKey: 203, CheckinDate: '2024-01-23 11:09:00', CheckoutDate: '2024-03-29 17:00:00', CheckinDateKey: 20240123, CheckoutDateKey: 20240329, Patient_Severity_Score: 63.2, RadiologyType: 'None', RadiologyProcedureCount: 0, EndoscopyType: 'None', EndoscopyProcedureCount: 0, CompanionPresent: 'True', LabTests: [ { Phase: 'Admission', LabType: 'CBC', TestName: 'Lymphocytes_abs (10^3/µl)', TestValue: 1.34 }, { Phase: 'Admission', LabType: 'Chem', TestName: 'ALT (U/l)', TestValue: 20.5 }, { Phase: 'Admission', LabType: 'Lipids', TestName: 'Triglycerides (mg/dl)', TestValue: 129.1 }, { Phase: 'Discharge', LabType: 'CBC', TestName: 'RBC (10^6/µl)', TestValue: 4.08 }, ... In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections. Atlas Search Index Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined. I created a single Atlas Search index encompassing all dimensions and measures I intended to use in predicates, including those in embedded documents. db.FactEncounterLabTests.createSearchIndex( "SearchFactEncounterLabTests", { mappings: { dynamic: false, fields: { "Encounter_ID": { "type": "number" }, "Patient_ID": { "type": "token" }, "Disease_ID": { "type": "number" }, "InsuranceKey": { "type": "number" }, "RoomKey": { "type": "number" }, "ResponsibleDoctorID": { "type": "number" }, "CheckinDate": { "type": "token" }, "CheckoutDate": { "type": "token" }, "LabTests": { "type": "document" , fields: { "Phase": { "type": "token" }, "LabType": { "type": "token" }, "TestName": { "type": "token" }, "TestValue": { "type": "number" } } } } } } ); Since I don't need extra text searching on the keys, I designated the character string ones as token. I labeled the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD. In relational databases, the star schema approach involves limiting the number of columns in fact tables due to their typically large number of rows. Dimension tables, which are generally smaller, can include more columns and are often denormalized in SQL databases, making the star schema more common than the snowflake schema. Similarly, in document modeling, embedding all dimension fields can increase the size of fact documents unnecessarily, so referencing dimension collections is often preferred. MongoDB’s data modeling principles allow it to be queried similarly to a star schema without additional complexity, as its design aligns with common application access patterns. Star Query A star schema allows processing queries which filter fields within dimension collections in several stages: In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted. For an example query, I aimed to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions. Search Aggregation Pipeline To optimize the fact collection process and apply all filters, I began with a simple aggregation pipeline that started with a search on the search index. This enabled filters to be applied directly to the fields in the fact collection, while additional filters were incorporated in the first stage of the star query. I used a local variable with a compound operator to facilitate adding more filters for each dimension during this stage. Before proceeding through the star query stages to add filters on dimensions, my query included a filter on the lab type, which was part of the fact collection and indexed. const search = { "$search": { "index": "SearchFactEncounterLabTests", "compound": { "must": [ { "in": { "path": "LabTests.LabType" , "value": "Lipids" } }, ] }, "sort": { CheckoutDate: -1 } } } I added a sort operation to order the results by check-out date in descending order. This illustrated the advantage of sorting during the index search rather than in later stages of the aggregation pipeline, especially when a limit was applied. I used this local variable to add more filters in Stage 1 of the star query, so that it could be executed for Stage 2 and collect documents for Stage 3. Stage 1: Query the Dimension Collections In the first phase of the star query, I obtained the dimension keys from the dimension collections. For every dimension with a filter, I retrieved the dimension keys using a find() on the dimension collection and appended a must condition to the compound of the fact index search. The following added the conditions on the Patient (female patients over 170 cm): search["$search"]["compound"]["must"].push( { in: { path: "Patient_ID", // Foreign Key in Fact value: db.DimPatient.find( // Dimension collection {Gender: "Female", Height: { "$gt": 170 } // filter on Dimension ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension }) The following added the conditions on the Doctor (Japanese): search["$search"]["compound"]["must"].push( { in: { path: "ResponsibleDoctorID", // Foreign Key in Fact value: db.DimDoctor.find( // Dimension collection {"Doctor Nationality": "Japanese" } // filter on Dimension ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Room (Deluxe): search["$search"]["compound"]["must"].push( { in: { path: "RoomKey", // Foreign Key in Fact value: db.DimRoom.find( // Dimension collection {"Room Type": "Deluxe" } // filter on Dimension ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Disease (Hematology): search["$search"]["compound"]["must"].push( { in: { path: "Disease_ID", // Foreign Key in Fact value: db.DimDisease.find( // Dimension collection {"Disease Type": "Hematology" } // filter on Dimension ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension }) Finally, here's the condition on the Insurance coverage (greater than 80%): search["$search"]["compound"]["must"].push( { in: { path: "InsuranceKey", // Foreign Key in Fact value: db.DimInsurance.find( // Dimension collection {"Coverage Limit": { "$gt": 0.8 } } // filter on Dimension ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension }) All these search criteria had the same structure: a find() on the dimension collection with the filters from the query, resulting in an array of dimension keys (similar to primary keys in a dimension table) that were used to search the fact documents by referencing them (like foreign keys in a fact table). Each of these steps queried the dimension collection to obtain a simple array of dimension keys, which were then added to the aggregation pipeline. Rather than joining tables as in a relational database, the criteria on the dimensions were pushed down into the query on the fact collection. Stage 2: Query the Fact Search Index Using the results from the dimension queries, I built the following pipeline search step: AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search) { '$search': { index: 'SearchFactEncounterLabTests', compound: { must: [ { in: { path: 'LabTests.LabType', value: 'Lipids' } }, { in: { path: 'Patient_ID', value: [ 'TR551', 'TR751', 'TR897', 'TRGT201', 'TRJB261', 'TRQG448', 'TRSQ510', 'TRTP535', 'TRUC548', 'TRVT591', 'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896', 'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672', 'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003', 'TRE004', 'TRMN351', 'TRRY492', 'TRTI528', 'TRAKA962', 'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384', 'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419', 'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079', 'TR663', 'TR808', 'TR849', 'TRKA286', 'TRLC314', 'TRMG344', 'TRPT435', 'TRVZ597', 'TRXC626', 'TRACT773', 'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167', 'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514', 'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990', 'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374', 'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026', 'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354', 'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589', 'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987', 'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121', ... 6184 more items ] } }, { in: { path: 'ResponsibleDoctorID', value: [ 830, 844, 862, 921 ] } }, { in: { path: 'RoomKey', value: [ 203 ] } }, { in: { path: 'Disease_ID', value: [ 1519, 1506, 1504, 1510, 1515, 1507, 1503, 1502, 1518, 1517, 1508, 1513, 1509, 1512, 1516, 1511, 1505, 1514 ] } }, { in: { path: 'InsuranceKey', value: [ 83, 84 ] } } ] }, sort: { CheckoutDate: -1 } } MongoDB Atlas Search indexes, which are built on Apache Lucene, handle queries with multiple conditions and long arrays of values. In this example, a search operation uses the compound operator with the must clause to apply filters across attributes. This approach applies filters after resolving complex conditions into lists of dimension keys. Using the search operation defined above, I ran an aggregation pipeline to retrieve the document of interest: db.FactEncounterLabTests.aggregate([ search, ]) With my example, nine documents were returned in 50 milliseconds. Estimate the Count This approach works well for queries with multiple filters where individual conditions are not very selective but their combination is. Querying dimensions and using a search index on facts helps avoid scanning unnecessary documents. However, depending on additional operations in the aggregation pipeline, it is advisable to estimate the number of records returned by the search index to prevent expensive queries. In applications that allow multi-criteria queries, it is common to set a threshold and return an error or warning if the estimated number of documents exceeds it, prompting users to add more filters. To support this, you can run a $searchMeta operation on the index before a $search. For example, the following checks that the number of documents returned by the filter is less than 10,000: MedSynoraDW> db.FactEncounterLabTests.aggregate([ { "$searchMeta": { index: search["$search"].index, compound: search["$search"].compound, count: { "type": "lowerBound" , threshold: 10000 } } } ]) [ { count: { lowerBound: Long('9') } } ] In my case, with nine documents, I can add more operations to the aggregation pipeline without expecting a long response time. If there are more documents than expected, additional steps in the aggregation pipeline may take longer. If tens or hundreds of thousands of documents are expected as input to a complex aggregation pipeline, the application may warn the user that the query execution will not be instantaneous, and may offer the choice to run it as a background job with a notification when done. With such a warning, the user may decide to add more filters, or a limit to work on a Top-n result, which will be added to the aggregation pipeline after a sorted search. Stage 3: Join Cack to Dimensions for Projection The first step of the aggregation pipeline fetches all the documents needed for the result, and only those documents, using efficient access through the search index. Once filtering is complete, the smaller set of documents is used for aggregation or projection in the later stages of the aggregation pipeline. In the third stage of the star query, it performs lookups on the dimensions to retrieve additional attributes needed for aggregation or projection. It might re-examine some collections used for filtering, which is not a problem since the dimensions remain small. For larger dimensions, the initial stage could save this information in a temporary array to avoid extra lookups, although this is often unnecessary. For example, when I wanted to display additional information about the patient and the doctor, I added two lookup stages to my aggregation pipeline: { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, For the simplicity of this demo, I imported the dimensions directly from the CSV file. In a well-designed database, the primary key for dimensions should be the document's _id field, and the collection ought to be established as a clustered collection. This design ensures efficient joins from fact documents. Most of the dimensions are compact and stay in memory. I added a final projection to fetch only the fields I needed. The full aggregation pipeline, using the search defined above with filters and arrays of dimension keys, is: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, { "$project": { "Patient_Severity_Score": 1, "CheckinDate": 1, "CheckoutDate": 1, "Patient.name": { "$concat": [ { "$arrayElemAt": ["$Patient.First Name", 0] }, " ", { "$arrayElemAt": ["$Patient.Last Name", 0] } ] }, "ResponsibleDoctor.name": { "$concat": [ { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Name", 0] }, " ", { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Surname", 0] } ] } } } ]) On a small instance, it returned the following result in 50 milliseconds: [ { _id: ObjectId('67fc3d2f40d2b3c843949a97'), CheckinDate: '2024-02-12 17:00:00', CheckoutDate: '2024-03-30 13:04:00', Patient_Severity_Score: 61.4, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Niina Johanson' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f5c'), CheckinDate: '2024-04-29 06:44:00', CheckoutDate: '2024-05-30 19:53:00', Patient_Severity_Score: 57.7, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Cindy Wibisono' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f0e'), CheckinDate: '2024-10-06 13:43:00', CheckoutDate: '2024-11-29 09:37:00', Patient_Severity_Score: 55.1, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Asta Koch' } ] }, { _id: ObjectId('67fc3d2f40d2b3c8439523de'), CheckinDate: '2024-08-24 22:40:00', CheckoutDate: '2024-10-09 12:18:00', Patient_Severity_Score: 66, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Paloma Aguero' } ] }, { _id: ObjectId('67fc3d3040d2b3c843956f7e'), CheckinDate: '2024-11-04 14:50:00', CheckoutDate: '2024-12-31 22:59:59', Patient_Severity_Score: 51.5, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Aulikki Johansson' } ] }, { _id: ObjectId('67fc3d3040d2b3c84395e0ff'), CheckinDate: '2024-01-14 19:09:00', CheckoutDate: '2024-02-07 15:43:00', Patient_Severity_Score: 47.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Laura Potter' } ] }, { _id: ObjectId('67fc3d3140d2b3c843965ed2'), CheckinDate: '2024-01-03 09:39:00', CheckoutDate: '2024-02-09 12:55:00', Patient_Severity_Score: 57.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Gabriela Cassiano' } ] }, { _id: ObjectId('67fc3d3140d2b3c843966ba1'), CheckinDate: '2024-07-03 13:38:00', CheckoutDate: '2024-07-17 07:46:00', Patient_Severity_Score: 60.3, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Monica Zuniga' } ] }, { _id: ObjectId('67fc3d3140d2b3c843969226'), CheckinDate: '2024-04-06 11:36:00', CheckoutDate: '2024-04-26 07:02:00', Patient_Severity_Score: 62.9, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Stanislava Beranova' } ] } ] The star query approach focuses solely on filtering to obtain the input for further processing, while retaining the full power of aggregation pipelines. Additional Aggregation after Filtering When I have the set of documents efficiently filtered upfront, I can apply some aggregations before the projection. For example, the following grouped per doctor and counted the number of patients and the range of severity score: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$unwind": "$ResponsibleDoctor" }, { "$group": { "_id": { "doctor_id": "$ResponsibleDoctor.Doctor_ID", "doctor_name": { "$concat": [ "$ResponsibleDoctor.Doctor Name", " ", "$ResponsibleDoctor.Doctor Surname" ] } }, "min_severity_score": { "$min": "$Patient_Severity_Score" }, "max_severity_score": { "$max": "$Patient_Severity_Score" }, "patient_count": { "$sum": 1 } // Count the number of patients } }, { "$project": { "doctor_name": "$_id.doctor_name", "min_severity_score": 1, "max_severity_score": 1, "patient_count": 1 } } ]) My filters got documents from only one doctor and nine patients: [ { _id: { doctor_id: 862, doctor_name: 'Sayuri Shan Kou' }, min_severity_score: 47.6, max_severity_score: 66, patient_count: 9, doctor_name: 'Sayuri Shan Kou' } ] Using a MongoDB document model, this method enables direct analytical queries on the operational database, removing the need for a separate analytical database. The search index operates as the analytical component for the operational database and works with the MongoDB aggregation pipeline. Since the search index runs as a separate process, it can be deployed on a dedicated search node to isolate resource usage. When running analytics on an operational database, queries should be designed to minimize impact on the operational workload. Conclusion MongoDB’s document model with Atlas Search indexes supports managing and querying data following a star schema approach. By using a single search index on the fact collection and querying dimension collections for filters, it is possible to perform ad-hoc queries without replicating data into a separate analytical schema as typically done in relational databases. This method resembles the approach used in SQL databases, where a star schema data mart is maintained apart from the normalized operational database. In MongoDB, the document model uses embedding and referencing patterns similar to a star schema and is structured for operational transactions. Search indexes provide similar functionality without moving data to a separate system. The method, implemented as a three-stage star query, can be integrated into client applications to optimize query execution and enable near-real-time analytics on complex data. This approach supports hybrid transactional and analytical processing (HTAP) workloads.

By Franck Pachot
AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering
AI-Native Platforms: The Unstoppable Alliance of GenAI and Platform Engineering

Let's be honest. Building developer platforms, especially for AI-native teams, is a complex art, a constant challenge. It's about finding a delicate balance: granting maximum autonomy to development teams without spiraling into chaos, and providing incredibly powerful, cutting-edge tools without adding superfluous complexity to their already dense workload. Our objective as Platform Engineers has always been to pave the way, remove obstacles, and accelerate innovation. But what if the next, inevitable phase of platform evolution wasn't just about what we build and provide, but what Generative AI can help us co-build, co-design, and co-manage? We're not talking about a mere incremental improvement, a minor optimization, or a marginal new feature. We're facing a genuine paradigm shift, a conceptual earthquake where artificial intelligence is no longer merely the final product of our efforts, the result of our development toils, but becomes the silent partner, the tireless ally that is already reimagining, rewriting, and redefining our entire development experience. This is the real gamble, the challenge that awaits us: transforming our platforms from simple toolsets, however sophisticated, into intelligent, dynamic, and self-optimizing ecosystems. A place where productivity isn't just high, but exceptionally high, and innovation flows frictionlessly. What if We Unlock 100% of Our Platform’s Potential? Your primary goal, like that of any good Platform Engineer, is already to make developers' lives simpler, faster, and, let's admit it, significantly more enjoyable. Now, imagine endowing your platform with genuine intelligence, with the ability to understand, anticipate, and even generate. GenAI, in this context, isn't just an additional feature that layers onto existing ones; it's the catalyst that is already fundamentally redefining the Developer Experience (DevEx), exponentially accelerating the entire software development lifecycle, and, even more fascinating, creating new, intuitive, and natural interfaces for interacting with the platform's intrinsic capabilities. Let's momentarily consider the most common and frustrating pain points that still afflict the average developer: the exhaustive and often fruitless hunt through infinite and fragmented documentation, the obligation to memorize dozens, if not hundreds, of specific and often cryptic CLI commands, or the tedious and repetitive generation of boilerplate code. With the intelligent integration of GenAI, your platform magically evolves into a true intelligent co-pilot. Imagine a developer who can simply express a request in natural language, as if speaking to an expert colleague: "Provision a new staging environment for my authentication microservice, complete with a PostgreSQL database, a dedicated Kafka topic, and integration with our monitoring system." The GenAI-powered platform not only understands the deep meaning and context of the request, not only translates the intention into a series of technical actions, but executes the operation autonomously, providing immediate feedback and magically configuring everything needed. This isn't mere automation, which we already know; it's a conversational interaction, deep and contextual, that almost completely zeroes out the developer's cognitive load, freeing their mind and creative energies to focus on innovation, not on the complex and often tedious infrastructural "plumbing". But the impact extends far beyond simple commands. GenAI can act as an omnipresent expert, an always-available and incredibly informed figure, providing real-time, contextual assistance. Imagine being stuck on a dependency error, a hard-to-diagnose configuration problem, or a security vulnerability. Instead of spending hours searching forums or asking colleagues, you can ask the platform directly. And it, magically, suggests practical solutions, directs you to relevant internal best practices (perhaps your own guides, finally usable in an intelligent way!), or even proposes complete code patches to solve the problem. It can proactively identify potential security vulnerabilities in the code you've just generated or modified, suggest intelligent refactorings to improve performance, or even scaffold entire new modules or microservices based on high-level descriptions. This drastically accelerates the entire software development lifecycle, making best practices inherent to the process and transforming bottlenecks into opportunities for automation. Your platform is no longer a mere collection of passive tools, but an intelligent and proactive partner at every single stage of the developer's workflow, from conception to implementation, from testing to deployment. Crucially, for this to work, the GenAI model must be fed with the right platform context. By ingesting all platform documentation, internal APIs, service catalogs, and architectural patterns, the AI becomes an unparalleled tool for discoverability of platform items. Developers can now query in natural language to find the right component, service, or golden path for their needs. Furthermore, this contextual understanding allows the AI to interrogate and access all data and assets within the platform itself, as well as from the applications being developed on it, providing insights and recommendations in real-time. This elevates the concept of a composable architecture, already enabled by your platform, to an entirely new level. With an AI co-pilot that not only knows all available platform items but also understands how to use them optimally and how others have used them effectively, the development of new composable applications or rapid Proofs of Concept (PoCs) becomes faster than ever before. The new interfaces enabled by GenAI go beyond mere suggestion. Think of natural language chatbot interfaces for giving commands, where the platform responds like a virtual assistant. Crucially, thanks to advancements like Model Context Protocol (MCP) or similar tool-use capabilities, the GenAI-powered platform can move beyond just "suggesting" and actively "doing". It can execute complex workflows, interact with external APIs, and trigger actions within your infrastructure. This fosters a true cognitive architecture where the model isn't just generating text but is an active participant in your operations, capable of generating architectural diagrams, provisioning resources, or even deploying components based on a simple natural language description. The vision is that of a "platform agent" or an "AI persona" that learns and adapts to the specific needs of the team and the individual developer, constantly optimizing their path and facilitating the adoption of best practices. Platforms: The Launchpad for Ai-Powered Applications This synergy is two-way, a deep symbiotic relationship. If, on one hand, GenAI infuses new intelligence and vitality into platforms, on the other, your Internal Developer Platforms are, and will increasingly become, the essential launchpad for the unstoppable explosion of AI-powered applications. The complex and often winding journey of an artificial intelligence model—from the very first phase of experimentation and prototyping, through intensive training, to serving in production and scalable inference—is riddled with often daunting infrastructural complexities. Dedicated GPU clusters, specialized Machine Learning frameworks, complex data pipelines, and scalable, secure, and performant serving endpoints are by no means trivial for every single team to manage independently. And this is where your platform uniquely shines. It has the power to abstract away all the thorny and technical details of AI infrastructure, providing self-service and on-demand provisioning of the exact compute resources (CPU, various types of GPUs), storage (object storage, data lakes), and networking required for every single phase of the model's lifecycle. Imagine a developer who has just finished training a new model and needs to deploy an inference service. Instead of interacting with the Ops team for days or weeks, they simply request it through an intuitive self-service portal on the platform, and within minutes, the platform automatically provisions the necessary hardware (perhaps a dedicated GPU instance), deploys the model to a scalable endpoint (e.g., a serverless service or a container on a dedicated cluster), and, transparently, even generates a secure API key for access and consumption. This process eliminates days or weeks of manual configuration, of tickets and waiting times, transforming a complex and often frustrating MLOps challenge into a fluid, instant, and completely self-service operation. The platform manages not only serving but the entire lifecycle: from data preparation, to training clusters, to evaluation and A/B testing phases, all the way to post-deployment monitoring. Furthermore, platforms provide crucial golden paths for AI application development at the application layer. There's no longer a need for every team to reinvent the wheel for common AI patterns. Your platform can offer pre-built templates and codified best practices for integrating Large Language Models (LLMs), implementing patterns like Retrieval-Augmented Generation (RAG) with connectors to your internal data sources, or setting up complete pipelines for model monitoring and evaluation. Think of robust libraries and opinionated frameworks for prompt engineering, for managing model and dataset versions, for specific AI model observability (e.g., tools for bias detection, model interpretation, or drift management). The platform becomes a hub for collaboration on AI assets, facilitating the sharing and reuse of models, datasets, and components, including the development of AI agents. By embedding best practices and pre-integrating the most common and necessary AI services, every single developer, even one without a deep Machine Learning background, is empowered to infuse their applications with intelligent, cutting-edge capabilities. This not only democratizes AI development across the organization but unlocks unprecedented innovation that was previously limited to a few specialized teams. The Future Is Symbiotic: Your Next Move The era of AI-native development isn't an option; it's an imminent reality, and it urgently demands AI-native platforms. The marriage of GenAI and Platform Engineering isn't just an evolutionary step; it's a revolutionary leap destined to redefine the very foundations of our craft. GenAI makes platforms intrinsically smarter, more intuitive, more responsive, and consequently, incredibly more powerful. Platforms, in turn, provide the robust, self-service infrastructure and the well-paved roads necessary to massively accelerate the adoption and deployment of AI across the enterprise, transforming potential into reality. Are you ready to stop building for AI and start building with AI? Now is the time to act. Identify the most painful bottlenecks in your current DevEx and think about how GenAI could transform them. Prioritize the creation of self-service capabilities for AI infrastructure, making model deployment as simple as that of a traditional microservice. Cultivate a culture of "platform as a product", where AI is not just a consumer, but a fundamental feature of the platform itself. The future of software development isn't just about AI-powered applications; it's about an AI-powered development experience that completely redefines the concepts of productivity, creativity, and the very act of value creation. Embrace this unstoppable alliance, and unlock the next fascinating frontier of innovation. The time of static platforms is over. The era of intelligent platforms has just begun.

By Graziano Casto DZone Core CORE
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset
Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset

In recent years, Agile has become closely associated with modern software development, promoting customer-focused value delivery, regular feedback loops, and empowered teams. However, beneath the familiar terminology, many technical professionals are beginning to question whether Agile is achieving its intended outcomes or simply adding complexity. Many experienced developers and engineers voice discontent with excessive processes, poorly executed rituals, and a disconnect between Agile principles and the realities of their daily work. As organizations push for broader Agile adoption, understanding the roots of this discontent is crucial — not only for improving team morale but also for ensuring that Agile practices genuinely add value rather than becoming just another management fad. The Agile Manifesto The Agile Manifesto defines a set of values and principles that guide software development (and other products). It inspires various frameworks and methods to support iterative delivery, early and continuous value creation, team collaboration, and continuous improvement through regular feedback and adaptation. Teams may misinterpret their core purpose when implementing Agile methodologies that do not adhere to their foundational principles. This misinterpretation can distort the framework’s adaptability and focus on customer-centric value delivery. The sooner we assess the health of Agile practices and take corrective action, the greater the benefits for business outcomes and team morale. Feedback on Agile Practices Here are some common feedback themes based on Scrum teams' perceptions of their experience with Agile practices. 1. Disconnect Between Agile Theory and Practice The Agile Manifesto sounds excellent, but real-world Agile feels like “Agile theater” with ceremonies and buzzwords. Cause: Many teams adopt Agile practices solely to undergo the process without embracing its values. Change in perception: Recognize the difference between doing vs. being Agile. Foster a culture of self-organized teams delivering value with continuous improvement to customers. 2. Lack of Autonomy Agile can feel prescriptive, with strict roles and rituals that constrain engineers. Cause: An overly rigid application of Agile can stifle creativity and reduce a sense of ownership. Engineers thrive when given the freedom to solve problems rather than being confined to a prescriptive approach. Change in perception: Agile teams are empowered to make decisions. They don’t dwell on obstacles—they take ownership, lead through collaboration, and focus on delivering solutions with achievable delivery commitments. 3. Misuse of Agile as a Management Tool Agile is used for micromanagement to track velocity and demand commitments. Cause: Agile is sometimes misunderstood to focus on metrics over outcomes. When velocity is prioritized over value, the purpose gets lost. Change in perception: Focus on principles and purpose, not just processes. Processes aren’t about restriction, but repeatable and reliable success. Agile processes support the team by reinforcing what works and making success scalable. 4. Lack of Visible Improvement Despite Agile processes, teams still face delays, unclear requirements, or poor decisions. Cause: When teams struggle to show visible improvement, foundational elements — like a clear roadmap and meaningful engagement with engineers around the product vision — are often missing. Change in perception: Anchor Agile practices to tangible outcomes, such as faster feedback loops, improved quality, and reduced defects. Continuously inspect and adapt the process and product direction, ensuring both evolve together to drive meaningful progress. How to Bridge the Gap With Kaizen The disconnect between Agile’s theoretical benefits and practical execution can undermine empowerment and autonomy for a self-organized team, ultimately producing outcomes antithetical to the methodology’s intent of delivering iterative, user-focused solutions. Without proper contextualization and leadership buy-in, such implementations risk reducing Agile to a superficial process rather than a cultural shift toward continuous improvement. As the Japanese philosophy of Kaizen reminds us, meaningful change happens incrementally. Agile retrospectives embody this mindset. When the process isn't working, the team must come together — not to assign blame but to reflect, realign, and evolve. Leveraging the Power of Retrospective for Continuous Improvement Misalignment with the value statement is a core reason Agile processes fail. Agile teams should go beyond surface-level issues and explore more profound, value-driven questions in the retrospective to get the most out of them. Some of the recommended core areas for effective Agile retrospectives: Value Alignment What does “value” mean to us in this sprint or project? Are we clear on what our customer truly needs right now? Flow and Process Efficiency Where did work get blocked and delayed, and is the team aware of the communication path to seek support? Are our ceremonies (stand-ups, planning, reviews) meaningful, valuable, or just rituals? Commitment and Focus Were our sprint goals clear and achievable? Did we commit to too much or too little? Customer Centricity Did we receive or act on honest feedback from users or stakeholders? Do we know how the work impacted the customer? Suggested Template for Agile Retrospective Takeaways Use this template to capture and communicate the outcomes of your retrospective. It helps ensure accountability, transparency, and alignment going forward. A structured retrospective framework for teams to reflect on performance and improve workflows. 1. Keep doing what’s working well: Practical and valuable habits and Practices. What reinforces team strengths and morale? Examples: Effective and outcome-based meeting Collaboration for efficient dependency management 2. Do less of what we are doing too much of: Process overdose. Encourage balance and efficiency. Overused activities are not always valuable. Examples: Too many long meetings drain team morale and disrupt daily progress. Excessive code reviews on trivial commits delay code merge and integration. 3. Stop doing what’s not working and should be eliminated: Identify waste or negative patterns. Break unhealthy habits that reduce productivity or hurt team morale. Examples: Starting work before stories and the Definition of Done are fully defined - action before understanding purpose, business value, and success criteria Skipping retrospectives - detached from improvement 4. Start doing what new practices or improvements we should try: Encourages innovation, experimentation, and growth. A great place to introduce ideas that the team hasn't tried yet. Examples: Add a mid-sprint check-in Start using sprint goals more actively Conclusion Agile is based on the principle of progressing through continuous improvement and incrementally delivering value. Retrospective meetings are crucial in this process, as they allow teams to pause, reflect, and realign themselves to ensure they are progressing in the right direction. This approach aligns with the Kaizen philosophy of ongoing improvement.

By Pabitra Saikia
Automating Sentiment Analysis Using Snowflake Cortex
Automating Sentiment Analysis Using Snowflake Cortex

In this hands-on tutorial, you'll learn how to automate sentiment analysis and categorize customer feedback using Snowflake Cortex, all through a simple SQL query without needing to build heavy and complex machine learning algorithms. No MLOps is required. We'll work with sample data simulating real customer feedback comments about a fictional company, "DemoMart," and classify each customer feedback entry using Cortex's built-in function. We'll determine sentiment (positive, negative, neutral) and label the feedback into different categories. The goal is to: Load a sample dataset of customer feedback into a Snowflake table.Use the built-in LLM-powered classification (CLASSIFY_TEXT) to tag each entry with a sentiment and classify the feedback into a specific category. Automate this entire workflow to run weekly using Snowflake Task.Generate insights from the classified data. Prerequisites A Snowflake account with access to Snowflake CortexRole privileges to create tables, tasks, and proceduresBasic SQL knowledge Step 1: Create Sample Feedback Table We'll use a sample dataset of customer feedback that covers products, delivery, customer support, and other areas. Let's create a table in Snowflake to store this data. Here is the SQL for creating the required table to hold customer feedback. SQL CREATE OR REPLACE TABLE customer.csat.feedback ( feedback_id INT, feedback_ts DATE, feedback_text STRING ); Now, you can load the data into the table using Snowflake's Snowsight interface. The sample data "customer_feedback_demomart.csv" is available in the GitHub repo. You can download and use it. Step 2: Use Cortex to Classify Sentiment and Category Let's read and process each row from the feedback table. Here's the magic. This single query classifies each piece of feedback for both sentiment and category: SQL SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING AS sentiment, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_category FROM customer.csat.feedback LIMIT 10; I have used the CLASSIFY_TEXT Function available within Snowflake's cortex to derive the sentiment based on the feedback_text and further classify it into a specific category the feedback is associated with, such as 'Product', 'Customer Service', 'Delivery', and so on. P.S.: You can change the categories based on your business needs. Step 3: Store Classified Results Let's store the classified results in a separate table for further reporting and analysis purposes. For this, I have created a table with the name feedback_classified as shown below. SQL CREATE OR REPLACE TABLE customer.csat.feedback_classified ( feedback_id INT, feedback_ts DATE, feedback_text STRING, sentiment STRING, feedback_category STRING ); Initial Bulk Load Now, let's do an initial bulk classification for all existing data before moving on to the incremental processing of newly arriving data. SQL -- Initial Load INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback; Once the initial load is completed successfully, let's build an SQL that fetches only incremental data based on the processed_ts column value. For the incremental load, we need fresh data with customer feedback. For that, let's insert ten new records into our raw table customer.csat.feedback SQL INSERT INTO customer.csat.feedback (feedback_id, feedback_ts, feedback_text) VALUES (5001, CURRENT_DATE, 'My DemoMart order was delivered to the wrong address again. Very disappointing.'), (5002, CURRENT_DATE, 'I love the new packaging DemoMart is using. So eco-friendly!'), (5003, CURRENT_DATE, 'The delivery speed was slower than promised. Hope this improves.'), (5004, CURRENT_DATE, 'The product quality is excellent, I’m genuinely impressed with DemoMart.'), (5005, CURRENT_DATE, 'Customer service helped me cancel and reorder with no issues.'), (5006, CURRENT_DATE, 'DemoMart’s website was down when I tried to place my order.'), (5007, CURRENT_DATE, 'Thanks DemoMart for the fast shipping and great support!'), (5008, CURRENT_DATE, 'Received a damaged item. This is the second time with DemoMart.'), (5009, CURRENT_DATE, 'DemoMart app is very user-friendly. Shopping is a breeze.'), (5010, CURRENT_DATE, 'The feature I wanted is missing. Hope DemoMart adds it soon.'); Step 4: Automate Incremental Data Processing With TASK Now that we have newly added (incremental) fresh data into our raw table, let's create a task to pick up only new data and classify it automatically. We will schedule this task to run every Sunday at midnight UTC. SQL --Creating task CREATE OR REPLACE TASK CUSTOMER.CSAT.FEEDBACK_CLASSIFIED WAREHOUSE = COMPUTE_WH SCHEDULE = 'USING CRON 0 0 * * 0 UTC' -- Run evey Sunday at midnight UTC AS INSERT INTO customer.csat.feedback_classified SELECT feedback_id, feedback_ts, feedback_text, SNOWFLAKE.CORTEX.CLASSIFY_TEXT(feedback_text, ['positive', 'negative', 'neutral']):label::STRING, SNOWFLAKE.CORTEX.CLASSIFY_TEXT( feedback_text, ['Product', 'Customer Service', 'Delivery', 'Price', 'User Experience', 'Feature Request'] ):label::STRING AS feedback_label, CURRENT_TIMESTAMP AS PROCESSED_TIMESTAMP FROM customer.csat.feedback WHERE feedback_ts > (SELECT COALESCE(MAX(PROCESSED_TIMESTAMP),'1900-01-01') FROM CUSTOMER.CSAT.FEEDBACK_CLASSIFIED ); This will automatically run every Sunday at midnight UTC, process any newly arrived customer feedback, and classify it. Step 5: Visualize Insights You can now build dashboards in Snowsight to see weekly trends using a simple query like this: SQL SELECT feedback_category, sentiment, COUNT(*) AS total FROM customer.csat.feedback_classified GROUP BY feedback_category, sentiment ORDER BY total DESC; Conclusion With just a few lines of SQL, you: Ingested raw feedback into a Snowflake table.Used Snowflake Cortex to classify customer feedback and derive sentiment and feedback categoriesAutomated the process to run weeklyBuilt insights into the classified feedback for business users/leadership team to act upon by category and sentiment This approach is ideal for support teams, product teams, and leadership, as it allows them to continuously monitor customer experience without building or maintaining ML infrastructure. GitHub I have created a GitHub page with all the code and sample data. You can access it freely. The whole dataset generator and SQL scripts are available on GitHub.

By Rajanikantarao Vellaturi
Converting List to String in Terraform
Converting List to String in Terraform

In Terraform, you will often need to convert a list to a string when passing values to configurations that require a string format, such as resource names, cloud instance metadata, or labels. Terraform uses HCL (HashiCorp Configuration Language), so handling lists requires functions like join() or format(), depending on the context. How to Convert a List to a String in Terraform The join() function is the most effective way to convert a list into a string in Terraform. This concatenates list elements using a specified delimiter, making it especially useful when formatting data for use in resource names, cloud tags, or dynamically generated scripts. The join(", ", var.list_variable) function, where list_variable is the name of your list variable, merges the list elements with ", " as the separator. Here’s a simple example: Shell variable "tags" { default = ["dev", "staging", "prod"] } output "tag_list" { value = join(", ", var.tags) } The output would be: Shell "dev, staging, prod" Example 1: Formatting a Command-Line Alias for Multiple Commands In DevOps and development workflows, it’s common to run multiple commands sequentially, such as updating repositories, installing dependencies, and deploying infrastructure. Using Terraform, you can dynamically generate a shell alias that combines these commands into a single, easy-to-use shortcut. Shell variable "commands" { default = ["git pull", "npm install", "terraform apply -auto-approve"] } output "alias_command" { value = "alias deploy='${join(" && ", var.commands)}'" } Output: Shell "alias deploy='git pull && npm install && terraform apply -auto-approve'" Example 2: Creating an AWS Security Group Description Imagine you need to generate a security group rule description listing allowed ports dynamically: Shell variable "allowed_ports" { default = [22, 80, 443] } resource "aws_security_group" "example" { name = "example_sg" description = "Allowed ports: ${join(", ", [for p in var.allowed_ports : tostring(p)])}" dynamic "ingress" { for_each = var.allowed_ports content { from_port = ingress.value to_port = ingress.value protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } } } The join() function, combined with a list comprehension, generates a dynamic description like "Allowed ports: 22, 80, 443". This ensures the security group documentation remains in sync with the actual rules. Alternative Methods For most use cases, the join() function is the best choice for converting a list into a string in Terraform, but the format() and jsonencode() functions can also be useful in specific scenarios. 1. Using format() for Custom Formatting The format() function helps control the output structure while joining list items. It does not directly convert lists to strings, but it can be used in combination with join() to achieve custom formatting. Shell variable "ports" { default = [22, 80, 443] } output "formatted_ports" { value = format("Allowed ports: %s", join(" | ", var.ports)) } Output: Shell "Allowed ports: 22 | 80 | 443" 2. Using jsonencode() for JSON Output When passing structured data to APIs or Terraform modules, you can use the jsonencode() function, which converts a list into a JSON-formatted string. Shell variable "tags" { default = ["dev", "staging", "prod"] } output "json_encoded" { value = jsonencode(var.tags) } Output: Shell "["dev", "staging", "prod"]" Unlike join(), this format retains the structured array representation, which is useful for JSON-based configurations. Creating a Literal String Representation in Terraform Sometimes you need to convert a list into a literal string representation, meaning the output should preserve the exact structure as a string (e.g., including brackets, quotes, and commas like a JSON array). This is useful when passing data to APIs, logging structured information, or generating configuration files. For most cases, jsonencode() is the best option due to its structured formatting and reliability in API-related use cases. However, if you need a simple comma-separated string without additional formatting, join() is the better choice. Common Scenarios for List-to-String Conversion in Terraform Converting a list to a string in Terraform is useful in multiple scenarios where Terraform requires string values instead of lists. Here are some common use cases: Naming resources dynamically: When creating resources with names that incorporate multiple dynamic elements, such as environment, application name, and region, these components are often stored as a list for modularity. Converting them into a single string allows for consistent and descriptive naming conventions that comply with provider or organizational naming standards.Tagging infrastructure with meaningful identifiers: Tags are often key-value pairs where the value needs to be a string. If you’re tagging resources based on a list of attributes (like team names, cost centers, or project phases), converting the list into a single delimited string ensures compatibility with tagging schemas and improves downstream usability in cost analysis or inventory tools.Improving documentation via descriptions in security rules: Security groups, firewall rules, and IAM policies sometimes allow for free-form text descriptions. Providing a readable summary of a rule’s purpose, derived from a list of source services or intended users, can help operators quickly understand the intent behind the configuration without digging into implementation details.Passing variables to scripts (e.g., user_data in EC2 instances): When injecting dynamic values into startup scripts or configuration files (such as a shell script passed via user_data), you often need to convert structured data like lists into strings. This ensures the script interprets the input correctly, particularly when using loops or configuration variables derived from Terraform resources.Logging and monitoring, ensuring human-readable outputs: Terraform output values are often used for diagnostics or integration with logging/monitoring systems. Presenting a list as a human-readable string improves clarity in logs or dashboards, making it easier to audit deployments and troubleshoot issues by conveying aggregated information in a concise format. Key Points Converting lists to strings in Terraform is crucial for dynamically naming resources, structuring security group descriptions, formatting user data scripts, and generating readable logs. Using join() for readable concatenation, format() for creating formatted strings, and jsonencode() for structured output ensures clarity and consistency in Terraform configurations.

By Mariusz Michalowski

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Want to Become a Senior Software Engineer? Do These Things

June 13, 2025 by Seun Matt DZone Core CORE

When Incentives Sabotage Product Strategy

June 13, 2025 by Stefan Wolpers DZone Core CORE

Misunderstanding Agile: Bridging The Gap With A Kaizen Mindset

June 12, 2025 by Pabitra Saikia

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

Understanding the Fundamentals of Cryptography

June 13, 2025 by Siri Varma Vegiraju DZone Core CORE

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

When Incentives Sabotage Product Strategy

June 13, 2025 by Stefan Wolpers DZone Core CORE

Exploring the IBM App Connect Enterprise SELECT, ROW and THE Functions in ESQL

June 13, 2025 by Matthias Blomme

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

KubeVirt: Can VM Management With Kubernetes Work?

June 12, 2025 by Chris Ward DZone Core CORE

The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

June 12, 2025 by John Vester DZone Core CORE

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Create POM With LLM (GitHub Copilot) and Playwright MCP

June 13, 2025 by Kailash Pathak DZone Core CORE

Smarter IoT Systems With Edge Computing and AI

June 13, 2025 by Surendra Pandey

AI Agent Architectures: Patterns, Applications, and Implementation Guide

June 13, 2025 by Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: