DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How does AI transform chaos engineering from an experiment into a critical capability? Learn how to effectively operationalize the chaos.

Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.

Are you a front-end or full-stack developer frustrated by front-end distractions? Learn to move forward with tooling and clear boundaries.

Developer Experience: Demand to support engineering teams has risen, and there is a shift from traditional DevOps to workflow improvements.

DZone Spotlight

Thursday, June 19 View All Articles »
Monitoring and Managing the Growth of the MSDB System Database in SQL Server

Monitoring and Managing the Growth of the MSDB System Database in SQL Server

By arvind toorpu DZone Core CORE
In SQL Server environments, system databases play crucial roles in ensuring smooth and reliable database operations. Among these, the msdb database is critical as it handles a variety of operational tasks, including job scheduling via SQL Server Agent, alert management, database mail configuration, and backup and restore history tracking. These functions are essential for automating routine maintenance, monitoring system health, and managing administrative workflows. However, the msdb database can sometimes grow unexpectedly large, especially in busy or long-running environments. This growth, if left unchecked, can lead to performance degradation, longer response times for job execution, and potential issues with SQL Server Agent functionality. Therefore, understanding how to monitor and manage the size of the msdb database is critical for database administrators aiming to maintain optimal SQL Server performance. Monitoring MSDB Growth The msdb system database in SQL Server is essential for managing and maintaining tasks like job scheduling, alerts, and database mail. However, under certain conditions, its size can grow unexpectedly, which may impact the overall performance of your SQL Server instance. This article explains how to detect and monitor the growth of the msdb database using a targeted SQL query, allowing you to identify and address the root causes effectively. By regularly tracking the size and growth trends of msdbAdministrators can proactively implement cleanup and maintenance strategies to ensure the system database remains efficient and responsive. This proactive approach minimizes disruptions and helps maintain the reliability of automated SQL Server operations. Why Monitor msdb Growth? Growth in the msdb database typically results from: Retained job history for SQL Server Agent.Database Mail logs.Backup and restore history.Maintenance plan logs. Unchecked growth may lead to disk space issues or slower query performance when accessing msdb. Detecting Growth in the msdb Database To analyze and monitor the growth of the msdb database, you can use the following query. This script identifies the largest objects within msdb and provides details on their sizes and row counts. MS SQL USE msdb GO SELECT TOP(10) o.[object_id] , obj = SCHEMA_NAME(o.[schema_id]) + '.' + o.name , o.[type] , i.total_rows , i.total_size FROM sys.objects o JOIN ( SELECT i.[object_id] , total_size = CAST(SUM(a.total_pages) * 8. / 1024 AS DECIMAL(18,2)) , total_rows = SUM(CASE WHEN i.index_id IN (0, 1) AND a.[type] = 1 THEN p.[rows] END) FROM sys.indexes i JOIN sys.partitions p ON i.[object_id] = p.[object_id] AND i.index_id = p.index_id JOIN sys.allocation_units a ON p.[partition_id] = a.container_id WHERE i.is_disabled = 0 AND i.is_hypothetical = 0 GROUP BY i.[object_id] ) i ON o.[object_id] = i.[object_id] WHERE o.[type] IN ('V', 'U', 'S') ORDER BY i.total_size DESC; Understanding the Query Key Components sys.objects: Provides details about all objects (e.g., tables, views, stored procedures) in the database. sys.indexes, sys.partitions, and sys.allocation_units: Combine to calculate the total size (in MB) and row counts of objects. Filters: Exclude hypothetical and disabled indexes to focus on actual data usage.Limit results to user tables (U), system tables (S), and views (V). Sorting: Orders results by total size in descending order to highlight the largest objects. Example Output Object IDObject NameObject TypeTotal RowsTotal Size (MB)105657892dbo.backupsetU500,000320.50205476921dbo.sysmail_logS120,000150.25304567112dbo.agent_job_historyU800,00085.75 Common Causes of msdb Growth Job history retention: SQL Server Agent retains extensive job history by default, which can grow significantly over time.Database mail logs: Frequent email activity results in large logs stored in sysmail_log.Backup and restore history: Details of every backup and restore operation are stored in msdb.Maintenance plan logs: Maintenance plans generate logs that contribute to the database size. Steps to Manage msdb Growth 1. Clear Old Job History Reduce the retention period for job history or delete old entries: MS SQL EXEC msdb.dbo.sp_purge_jobhistory @job_name = NULL, -- NULL clears history for all jobs @oldest_date = '2023-01-01'; -- Retain only recent entries 2. Clean Database Mail Logs Purge old database mail log entries: MS SQL DELETE FROM sysmail_log WHERE log_date < GETDATE() - 30; -- Keep only the last 30 days 3. Clear Backup and Restore History Use the sp_delete_backuphistory system procedure: MS SQL EXEC msdb.dbo.sp_delete_backuphistory @oldest_date = '2023-01-01'; 4. Adjust Retention Settings Modify retention settings to prevent excessive growth: MS SQL EXEC msdb.dbo.sp_configure_jobhistory_limit @job_name = NULL, @max_rows = 1000; Monitoring Best Practices To maintain control over msdb growth, it’s crucial to implement a robust monitoring strategy: Automate Monitoring Set up a SQL Server Agent job to execute the size-checking query on a regular cadence (for example, daily or hourly). Capture the results in a centralized table or monitoring dashboard. When the size of the key msdb If tables or the overall database exceed a predefined threshold, such as 80% of the allocated space, the job can automatically send you a summary report or trigger an alert. This proactive approach ensures you spot growth trends early, without relying on manual checks. Enable Alerts Leverage SQL Server Agent’s built-in alerting mechanism. Define alerts on specific performance conditions, such as rapid increases in log file usage or high log_reuse_wait_desc statuses. Configure these alerts to notify administrators via email, PagerDuty, or other channels. By setting appropriate severity levels and response procedures, your team can address issues before they impact scheduled jobs or system mail. Regular Maintenance Incorporate msdb cleanup tasks into your standard maintenance plan. Schedule system stored procedures like sp_delete_backuphistory and sp_purge_jobhistory to run at off-peak hours, pruning old records according to your retention policy. Combine this with periodic index maintenance on msdb tables to reduce fragmentation and maintain query performance. Consistent housekeeping helps prevent unbounded growth and keeps your SQL Server Agent and backup history running smoothly. Conclusion The msdb database is vital for SQL Server’s core operations, storing job schedules, backup history, and maintenance plans. If left unchecked, its growth can lead to performance degradation, slow job execution, and impact monitoring processes. Regularly monitoring msdb Size, as determined by the provided query, helps identify growth issues early. Common causes like excessive backup or job history can be managed through routine cleanup using system stored procedures, such as sp_delete_backuphistory and sp_purge_jobhistory. Incorporating msdb Maintenance of your overall SQL Server upkeep ensures smoother operations and better performance. Keeping msdb Optimizing is essential for maintaining SQL Server’s stability and supporting long-term scalability. More
Scrum Smarter, Not Louder: AI Prompts Every Developer Should Steal

Scrum Smarter, Not Louder: AI Prompts Every Developer Should Steal

By Ella Mitkin
Most developers think AI’s only job is writing code, debugging tests, or generating documentation. But Scrum? That’s still a human mess, full of vague stories, chaotic meetings, and awkward silences. Here’s the truth: prompt engineering can turn AI into your secret Agile assistant—if you know how to talk to it. In this guide, we share field-tested, research-backed prompts that developers can use in real time to make Agile rituals smoother, smarter, and actually useful. Based on findings from Alamu et al. (2024), Verma et al. (2025), and Mitra & Lewis (2025), we show how prompt structures can turn your next standup, sprint planning, or retro into something that works for you, not just for your Scrum Master. Sprint Planning Prompts: From Chaos to Clarity Use Case: Defining scope, estimating work, and avoiding the “What’s this story even mean?” syndrome. Prompt: "As an expert in Agile backlog refinement, help me break down this story: '[insert story text]'. List sub-tasks with realistic developer effort in hours. Flag any missing requirements." Why it works: Adds structure to vague backlog items and creates an actionable breakdown, saving planning time. Prompt: "You are an Agile coach specialized in value prioritization. Here’s a list of five backlog items with estimated effort: [list]. Rank them based on business value impact, risk, and delivery speed." Why it works: Helps developers push back against arbitrary prioritization. Prompt: "Act as a Product Owner. Review these backlog stories: [list]. Suggest any that should be merged, split, or sent back for clarification based on user value." Why it works: Promotes clarity early, reduces mid-sprint surprises. Standups: Async, Remote, and Useful Again Use Case: Remote teams or developers who want to be more concise. Prompt: "Act as a standup facilitator. Summarize my work in these bullet points: [insert]. Highlight blockers and suggest one follow-up question I can ask the team." Why it works: Refines communication and highlights action. Prompt: "You are a Scrum lead tracking momentum. Based on this Git log and ticket status, generate a concise standup update (Yesterday/Today/Blockers): [insert data]." Why it works: Builds a data-driven update without fluff. Prompt: "As a burnout-aware Agile bot, review these updates: [insert]. Flag any signs of overload or repeated blockers, and suggest wellness check-in prompts." Why it works: Adds a human touch through AI. Retrospectives: Say What Needs Saying (Without the Drama) Use Case: Emotional tension, team friction, or addressing recurring issues. Prompt: "You are a retrospective expert. Analyze these notes: [insert retro notes or observations]. Suggest 3 ‘Start/Stop/Continue’ talking points that are tactful but honest." Why it works: Offers safe but direct feedback phrasing. Prompt: "As an Agile conflict mediator, suggest retro feedback for this situation: [describe team tension]. Focus on constructive language and psychological safety." Why it works: Coaches developers through conflict-aware participation. Prompt: "Act as an AI retro board tool. Cluster the following feedback into themes and suggest one lesson learned per theme: [feedback list]." Why it works: Organizes chaos into insight, fast. Ticket Crafting: User Stories That Actually Work Use Case: Turning chaos into structured tickets that meet expectations. Prompt: "As a certified Product Owner, help me rewrite this vague task into a full user story with acceptance criteria: [insert task]. Format it in the ‘As a… I want… so that…’ style and add 3 testable conditions." Why it works: Bridges development thinking with business expectations. Prompt: "You are a Jira expert and Agile coach. I need to document a technical debt ticket that meets DOD. Convert this explanation into a clean ticket description and add a checklist for completion." Why it works: Helps developers write what gets accepted and shipped. Prompt: "Act like a QA reviewer. Scan this user story: [story]. Suggest edge cases or acceptance tests we might have missed." Why it works: Avoids future rework by adding a testing lens early. Sprint Syncs and Review Prep: Impress Without Overthinking Use Case: Showing progress without turning into a status robot. Prompt: "Act like a Scrum Master prepping for Sprint Review. Based on this list of closed tasks, create a short impact summary and link to business goals." Why it works: Connects delivery to outcomes. Prompt: "As a technical demo expert, outline a 3-minute walkthrough script for this feature: [insert feature]. Include who it’s for, what problem it solves, and how it works." Why it works: Makes Sprint Reviews easier to navigate. Prompt: "Act as a release coordinator. Based on this sprint’s output, draft a release note with technical highlights, known limitations, and user-facing improvements." Why it works: Delivers value to internal and external stakeholders. This Is Not Cheating Using AI in Agile isn’t about faking it—it’s about making the system work for your brain. These prompts don’t replace human discussion. They just help developers show up prepared, focused, and less drained. So next time your backlog makes no sense, or your standup feels pointless, try typing instead of talking. Let the AI sharpen your edge—one prompt at a time. Why This Research Matters for Developers At a glance, integrating AI into Agile rituals may seem like a tool for managers or coaches, but developers stand to benefit just as much, if not more. That’s why so much current research is digging into the impact of prompt engineering specifically tailored for technical contributors. These aren't academic fantasies. They're responses to real developer pain points: vague tickets, unproductive standups, poorly scoped retros, and communication fatigue. Frameworks such as Prompt-Driven Agile Facilitation and Agile AI Copilot don’t just suggest AI can help—they show how developers can use targeted, structured prompts to support both solo and team productivity. These studies are increasingly reflecting the reality of hybrid work: asynchronous meetings, remote collaboration, and cross-functional handoffs. We’re seeing tools and bots being created that support retrospectives (Nguyen et al., 2025), sprint demos, and conflict resolution (Kumar et al., 2024), not because developers can't manage these—but because time and energy are finite. Prompt-based systems reduce friction and help technical teams align faster. They don't take the human out of Agile—they reduce the waste that prevents teams from being truly Agile. More importantly, this isn’t about creating robotic output. It’s about giving developers ownership of the process. These prompts act as a developer’s voice coach, technical writer, and backlog cleaner—all rolled into one. That’s why researchers are paying attention: prompt engineering isn't a passing trend. It's becoming a silent infrastructure in high-performing teams. So, if you’ve ever sat through a meaningless retro or received a user story that made no sense, know that AI isn't replacing your voice. It's amplifying it. You just need to know what to ask. Research Foundations Prompt-Driven Agile Facilitation – Alamu et al. (2024)The Role of Prompt Engineering in Agile Development – Verma et al. (2025)Agile Standups with Conversational Agents – Mitra & Lewis (2025)Retrospectives Enhanced by Prompted AI Tools – Nguyen et al. (2025)Agile AI Copilot: Prompting and Pitfalls – Carlsen & Ghosh (2024)Guiding LLMs with Prompts in Agile Requirements Engineering – Feng & Liu (2023)Prompt-Based Chatbots in Agile Coaching – Kumar et al. (2024)AI Prompts in Agile Knowledge Management – Samadi & Becker (2025) More

Trend Report

Generative AI

AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.

Generative AI

Refcard #402

SBOM Essentials

By Siri Varma Vegiraju DZone Core CORE
SBOM Essentials

Refcard #269

Getting Started With Data Quality

By Miguel Garcia DZone Core CORE
Getting Started With Data Quality

More Articles

Before You Microservice Everything, Read This
Before You Microservice Everything, Read This

The way we build software systems is always evolving, and right now, everyone's talking about microservices. They've become popular because of cloud computing, containerization, and tools like Kubernetes. Lots of new projects use this approach, and even older systems are trying to switch over. But this discussion is about something else: the modular monolith, especially in comparison to microservices. But, why focus on this? Because it seems like the tech world has jumped on the microservice bandwagon a little too quickly, without really thinking about what's driving that decision. There's a common idea that microservices are the perfect solution to all the problems people have with traditional, monolith application systems. From my own experience working with systems that are deployed in multiple pieces, I know this isn't true. Every way of building software has its good and bad points, and microservices are no different. They solve some problems, sure, but they also create new ones. First, we need to get rid of the idea that a monolith application can't be well-made. We also need to be clear about what we actually mean by "monolith application," because people use that term in different ways. This post will focus on explaining what a modular monolith is. Modular Monolith – What Is It? When we're talking about technical stuff and business needs, especially how a system is put together, it's really important to be precise. We need to all be on the same page. So, let's define exactly what I mean by a modular monolith. First, what's a "monolith"? Think of it like a statue carved from a single block of stone. In software, the "statue" is the system, and the "stone" is the code that runs it. So, a monolith system is one piece of running code, without any separate parts. Here are a couple of more technical explanations: Monolith System: A single-application software system is designed so that different jobs (like handling data coming in and out, processing information, dealing with errors, and showing things to the user) are all mixed together, instead of being in separate, distinct pieces.Modular Monolith Design: This is a traditional way of building software where the whole thing is self-contained, with all the parts connected and relying on each other. This is different from a modular approach, where the parts are more independent. The phrases "mixed together" and "parts are connected and relying on each other" make single-application design sound messy and disorganized. But it doesn't have to be that way. To sum up, a monolith is just a system that's deployed as a single unit. Let’s get into a deeper look. To understand what "modular" means, let's define it. Something is modular if it's made up of separate pieces that fit together to make a whole, or if it's built from parts that can be combined to create something bigger. In programming, modularity means designing and building something in separate sections. Modular programming is about dividing a program into separate, interchangeable modules, each responsible for a specific task. A module's "interface" shows what it provides and what it needs from other modules. Other modules can see the things defined in the interface, while the "implementation" is the actual code that makes those things happen. For a modular design to be effective, each module should: Be independent and interchangeable.Contain everything they need to do their job.Have a clearly defined interface. Let's look at these in more detail. Independence and Interchangeability Modules should be as independent as possible. They can't be completely separate, because they need to work with other modules. But they should depend on each other as little as possible. This is called Loose Coupling, Strong Cohesion. For example, imagine this code: C# // Poorly designed module with tight coupling public class OrderProcessor { private readonly InventoryService _inventoryService; private readonly PaymentService _paymentService; public OrderProcessor(InventoryService inventoryService, PaymentService paymentService) { _inventoryService = inventoryService; _paymentService = paymentService; } public void ProcessOrder(Order order) { _inventoryService.CheckStock(order); _paymentService.ProcessPayment(order); } } Here, OrderProcessor is tightly linked to InventoryService and PaymentService. If either of those services changes, OrderProcessor has to change too. This makes it less independent. A better way is to use interfaces to make the modules less dependent on each other: C# // Better design with loose coupling public interface IInventoryService { void CheckStock(Order order); } public interface IPaymentService { void ProcessPayment(Order order); } public class OrderProcessor { private readonly IInventoryService _inventoryService; private readonly IPaymentService _paymentService; public OrderProcessor(IInventoryService inventoryService, IPaymentService paymentService) { _inventoryService = inventoryService; _paymentService = paymentService; } public void ProcessOrder(Order order) { _inventoryService.CheckStock(order); _paymentService.ProcessPayment(order); } } Now, OrderProcessor depends on abstract definitions (IInventoryService and IPaymentService), which makes it more independent and easier to test or change. Modules Must Contain Everything They Need A module has to have everything it needs to do its job. In a modular monolith, a module is a business module designed to provide a complete set of features. This is called Vertical Slices, where each slice is a specific business function. For example, in an online store, you might have modules like OrderManagement, InventoryManagement, and PaymentProcessing. Each module has all the logic and parts needed to do its specific job. Modules Must Have a Defined Interface A final key to modularity is having a well-defined interface. Without a clear contract, true modular design isn't possible. A contract is the "entry point" to the module, and it should: Be clear and simple.Only include what clients need to know.Stay stable so it doesn't cause problems.Keep all the other details hidden. For example, you can define a module's contract using interfaces: C# public interface IOrderService { void PlaceOrder(Order order); Order GetOrder(int orderId); } public class OrderService : IOrderService { public void PlaceOrder(Order order) { // Implementation details } public Order GetOrder(int orderId) { // Implementation details } } Here, IOrderService is the contract, showing only the methods needed to work with the OrderService module. Conclusion Building a monolith system doesn't automatically mean it's badly designed, not modular, or low quality. A modular monolith is a single-application system built using modular principles. To make it highly modular, each module should: Be independent and interchangeable.Contain all the necessary parts to do its job (organized by business area).Have a well-defined interface or contract. By following these principles, you can create a modular monolith that has the simplicity of a single application with the flexibility of a modular design. This is especially useful for systems where microservices might make things too complicated.

By Nizam Abdul Khadar
How to Achieve SOC 2 Compliance in AWS Cloud Environments
How to Achieve SOC 2 Compliance in AWS Cloud Environments

Did you know cloud security was one of the most evident challenges of using cloud solutions in 2023? As businesses increasingly depend on Cloud services like Amazon Web Services (AWS) to host their applications, securing sensitive data in the Cloud becomes non-negotiable. Organizations must ensure their technology infrastructure meets the highest security standards. One such standard is SOC 2 (Systems and Organization Controls 2) compliance. SOC 2 is more than a regulatory checkbox. It represents a business’s commitment to robust security measures and instills trust in customers and stakeholders. SOC 2 compliance for AWS evaluates how securely an organization’s technology setup manages data storage, processing, and transfer. Let’s further discuss SOC 2 compliance, its importance in AWS, and how organizations can achieve SOC 2 compliance for AWS. What Is SOC 2 Compliance? SOC 2 is an auditing standard developed by the American Institute of CPAs (AICPA). This standard ensures organizations protect sensitive customer data by securing their systems, processes, and controls. SOC 2 is based on five Trust Services Criteria (TSC), and achieving SOC 2 compliance involves rigorous evaluation against these criteria. Security: This criterion ensures an organization's systems and data are protected against unauthorized access, breaches, and cyber threats. It involves implementing physical security measures such as access controls, encryption, firewalls, etc.Availability: This assesses the organization's ability to ensure that its systems and services are accessible and operational whenever needed by users or stakeholders. This includes measures to prevent and mitigate downtime, such as redundancy, failover mechanisms, disaster recovery plans, and proactive monitoring. Process integrity: Process integrity evaluates the accuracy, completeness, and reliability of the organization's processes and operations. This involves implementing checks and balances to validate the accuracy of data. It also emphasizes implementing mechanisms to monitor data integrity. Confidentiality: This involves protecting sensitive information from unauthorized access, disclosure, or exposure. This includes implementing encryption, data masking, and other measures to prevent unauthorized users or entities from accessing or viewing confidential data.Privacy: It ensures customers’ personal information is handled in compliance with relevant privacy regulations and standards. This involves implementing policies, procedures, and controls to protect individuals' privacy rights. SOC 1 vs. SOC 2 vs. SOC 3: Head-to-Head Comparison Understanding the key differences between SOC 1, SOC 2, and SOC 3 is essential for organizations looking to demonstrate their commitment to security and compliance. Below is a comparison highlighting various aspects of these controls. AspectsSOC 1SOC 2SOC 3ScopeFinancial ControlsOperational and Security ControlsHigh-level operational controlsTarget AudienceAuditors, RegulatorsCustomers, Business partnersGeneral AudienceFocus AreaControls impacting the financial reporting of service organizations.Trusted Services Criteria (Security, Availability, Processing Integrity, Confidentiality, Privacy).Trusted Services Criteria (Security, Availability, Processing Integrity, Confidentiality, Privacy).Evaluation Timeline6-12 months6-12 months3-6 monthsWho Needs to ComplyCollection agencies, payroll providers, payment processing companies, etc.SaaS companies, data hosting, or processing providers, and Cloud storage providers.Organizations that require SOC 2 compliance certification and want to use it to market to the general audience. Importance of SOC 2 Compliance in AWS Understanding AWS’s shared responsibility model is important when navigating SOC 2 compliance within AWS. This model outlines the respective responsibilities of AWS and its customers. AWS’s responsibility is to secure the cloud infrastructure, while customers manage security in the cloud. This means customers are accountable for securing their data, applications, and services hosted on AWS. This model holds crucial implications for SOC 2 compliance: Data security: As a customer, it’s your responsibility to secure your data. This involves ensuring secure data transmission, implementing encryption, and controlling data access. Compliance management: You must ensure that your applications, services, and processes comply with SOC 2 requirements, necessitating continuous monitoring and management. User access management: You are responsible for configuring AWS services to meet SOC 2 requirements, including permissions and security settings.Staff training: Ensure your team is adequately trained to follow AWS security best practices and SOC 2 requirements. This is necessary to prevent non-compliance related to misunderstanding or misuse of AWS services. Challenges of Achieving SOC 2 Compliance in AWS Here is a list of some challenges businesses face when looking to achieve SOC 2 compliance on AWS. Complexity of AWS environments: Understanding the complex architecture of AWS setups requires in-depth knowledge and expertise. It can be challenging for businesses to ensure that all components are configured securely. Data protection and privacy: The dynamic nature of cyber threats and the need for comprehensive measures to prevent unauthorized access can make securing sensitive data in the AWS environment challenging. Evolving/continuous compliance requirements: Adapting to changing compliance standards requires constant monitoring and updating of policies and procedures. This can be challenging for businesses as it may strain resources and expertise. Training and awareness: Ensuring that all personnel are adequately trained and aware of their roles and responsibilities in maintaining compliance can be difficult. This challenge is prevalent in large organizations with diverse teams and skill sets.Scalability: As AWS environments grow, ensuring security measures can scale effectively to meet increasing demands becomes complex. Consequently, scaling security measures with business growth while staying compliant adds another layer of complexity. How Organizations Can Achieve SOC 2 Compliance for Their AWS Cloud Environments Achieving SOC 2 compliance in AWS involves a structured approach to ensure the best security practices. Here's a step-by-step guide: 1. Assess Your Current Landscape Start by conducting a comprehensive assessment of your current AWS environment. Examine existing security processes and controls and identify potential vulnerabilities and compliance gaps against SOC 2 requirements. This stage includes conducting internal audits, risk assessments, and evaluating existing policies and procedures. 2. Identify Required Security Controls Develop a thorough security program detailing all security controls required to meet SOC 2 compliance. This includes measures for data protection, access controls, system monitoring, and more. You can also access the AWS SOC report via the AWS Artifact tool, which provides a comprehensive list of security controls. 3. Use AWS Tools for SOC 2 Compliance Leverage the suite of security tools AWS offers to facilitate SOC 2 compliance. These include: AWS Identity and Access Management (IAM): Administers access to AWS services and resources.AWS Config: Enables you to review, audit, and analyze the configurations of your AWS resources.AWS Key Management Service (KMS): Simplifies the creation and administration of cryptographic keys, allowing control over their usage across various AWS services and within your applications.AWS CloudTrail: Offers a record of AWS API calls made within your account. This includes activities executed via AWS SDKs, AWS Management Console, Command Line tools, and additional AWS services. 4. Develop Documentation of Security Policies Document your organization's security policies and procedures in alignment with SOC 2 requirements. This includes creating detailed documentation outlining security controls, processes, and responsibilities. 5. Enable Continuous Monitoring Implement continuous monitoring mechanisms to track security events and compliance status in real time. Use AWS services like Amazon GuardDuty, AWS Config, and AWS Security Hub to automate monitoring and ensure ongoing compliance with SOC 2 standards. Typical SOC 2 Compliance Process Timeline The SOC 2 compliance process usually spans 6 to 12 months. It consists of several phases, starting from preparation to achieving compliance: Preparation (1-2 months): This initial phase involves assessing current security practices and identifying gaps. Afterward, you can develop a plan to address the identified gaps while configuring AWS services and updating policies. Implementation (3-6 months): Execute the planned AWS configurations outlined in the preparation phase. Implement necessary security controls and measures to align with SOC 2 standards.Documentation (1-2 months): Gather documentation of the AWS environment, cataloging policies, procedures, and operational practices. Conduct an internal review to ensure documentation completeness and alignment with SOC-2 requirements. Auditing (1-2 months): Engage a qualified auditor with expertise in evaluating AWS environments for SOC-2 compliance. Collaborate with the chosen auditor to execute the audit process. After the audit, the auditor will provide a detailed SOC 2 report. Conclusion Achieving SOC 2 compliance in AWS requires planning, rigorous implementation, and an ongoing commitment to security best practices. Organizations can effortlessly navigate SOC 2 compliance by complying with the shared responsibility model, using AWS tools, and maintaining continuous vigilance. As cloud-hosted applications take over the digital space, prioritizing security and compliance becomes crucial. With the right approach and dedication, organizations can attain SOC 2 compliance and strengthen their position as a trusted party.

By Chase Bolt
Code of Shadows: Master Shifu and Po Use Functional Java to Solve the Decorator Pattern Mystery
Code of Shadows: Master Shifu and Po Use Functional Java to Solve the Decorator Pattern Mystery

It was a cold, misty morning at the Jade Palace. The silence was broken not by combat… but by a mysterious glitch in the logs. Po (rushing in): "Shifu! The logs… they're missing timestamps!" Shifu (narrowing his eyes): "This is no accident, Po. This is a breach in the sacred code path. The timekeeper has been silenced." Traditional OOP Decorator Shifu unfurled an old Java scroll: Java //Interface package com.javaonfly.designpatterns.decorator.oops; public interface Loggable { public void logMessage(String message); } //Implementation package com.javaonfly.designpatterns.decorator.oops.impl; import com.javaonfly.designpatterns.decorator.oops.Loggable; public class SimpleLogger implements Loggable { @Override public void logMessage(String message) { System.out.println(message); } } //Implementation class TimestampLogger implements Loggable { private Loggable wrapped; public TimestampLogger(Loggable wrapped) { this.wrapped = wrapped; } public void logMessage(String message) { String timestamped = "[" + System.currentTimeMillis() + "] " + message; wrapped.logMessage(timestamped); } } //Calling the decorator public class Logger { public static void main(String[] args){ Loggable simpleLogger = new SimpleLogger(); simpleLogger.logMessage("This is a simple log message."); Loggable timestampedLogger = new TimestampLogger(simpleLogger); timestampedLogger.logMessage("This is a timestamped log message."); } } //Output This is a simple log message. [1748594769477] This is a timestamped log message. Po: "Wait, we’re creating all these classes just to add a timestamp?" Shifu: "That is the illusion of control. Each wrapper adds bulk. True elegance lies in Functional Programming." Functional Decorator Pattern With Lambdas Shifu waved his staff and rewrote the scroll: Java package com.javaonfly.designpatterns.decorator.fp; import java.time.LocalDateTime; import java.util.function.Function; public class Logger { //higer order function public void decoratedLogMessage(Function<String, String> simpleLogger, Function<String, String> timestampLogger) { String message = simpleLogger.andThen(timestampLogger).apply("This is a log message."); System.out.println(message); } public static void main(String[] args){ Logger logger = new Logger(); Function<String, String> simpleLogger = message -> { System.out.println(message); return message; }; Function<String, String> timestampLogger = message -> { String timestampedMessage = "[" + System.currentTimeMillis() + "] " + ": " + message; return timestampedMessage; }; logger.decoratedLogMessage(simpleLogger, timestampLogger); } } //Output This is a log message. [1748595357335] This is a log message. Po (blinking): "So... no more wrappers, just function transformers?" Shifu (nodding wisely): "Yes, Po. In Functional Programming, functions are first-class citizens. The Function<T, R> interface lets us compose behavior. Each transformation can be chained using andThen, like stacking skills in Kung Fu." Breaking Down the Code – Functional Wisdom Explained Po (scratching his head): "Shifu, what exactly is this Function<T, R> thing? Is it some kind of scroll?" Shifu (gently): "Ah, Po. It is not a scroll. It is a powerful interface from the java.util.function package—a tool forged in the fires of Java 8." "Function<T, R> represents a function that accepts an input of type T and produces a result of type R." In our case: Java Function<String, String> simpleLogger This means: “Take a String message, and return a modified String message.” Each logger lambda—like simpleLogger and timestampLogger—does exactly that. The Art of Composition — andThen Po (eyes wide): "But how do they all work together? Like… kung fu moves in a combo?" Shifu (smiling): "Yes. That combo is called composition. And the technique is called andThen." Java simpleLogger.andThen(timestampLogger) This means: First, execute simpleLogger, which prints the message and passes it on.Then, take the result and pass it to timestampLogger, which adds the timestamp. This is function chaining—the essence of functional design. Java String message = simpleLogger .andThen(timestampLogger) .apply("This is a log message."); Like chaining martial arts techniques, each function passes its result to the next—clean, fluid, precise. Po: "So the message flows through each function like a river through stones?" Shifu: "Exactly. That is the way of the Stream." Functional Flow vs OOP Structure Shifu (serenely): "Po, unlike the OOP approach where you must wrap one class inside another—creating bulky layers—the functional approach lets you decorate behavior on the fly, without classes or inheritance." No need to create SimpleLogger, TimestampLogger, or interfaces.Just use Function<String, String> lambdas and compose them. The Secret to Clean Code “A true master does not add weight to power. He adds precision to purpose.” – Master Shifu This approach: Eliminates boilerplate.Encourages reusability.Enables testability (each function can be unit-tested in isolation).Supports dynamic behavior chaining. Po's New Move: Making the Logger Generic After mastering the basics, Po's eyes sparkled with curiosity. Po: "Shifu, what if I want this technique to work with any type—not just strings?" Shifu (with a deep breath): "Yes of course you can! Try to write it, Dragon warrior." Po meditated for a moment, and then rewrote the logger: Java public <T> void decoratedLogMessage(Function<T, T>... loggers) { Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); T message = pipeline.apply((T) "This is a log message."); System.out.println(message); } Po (bowing): "Master Shifu, after learning to compose logging functions using Function<String, String>, I asked myself — what if I could decorate not just strings, but any type of data? Numbers, objects, anything! So I used generics and built this move..." Java public <T> void decoratedLogMessage(Function<T, T>... loggers) { "This declares a generic method where T can be any type — String, Integer, or even a custom User object. The method takes a varargs of Function<T, T> — that means a flexible number of functions that take and return the same type." Java Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); "I stream all the logger functions and reduce them into a single pipeline function using Function::andThen. Function.identity() is the neutral starting point — like standing still before striking.Function::andThen chains each logger — like chaining combos in kung fu!" Java T message = pipeline.apply((T) "This is a log message."); I apply the final pipeline function to a sample input. Since this time I tested it with a String, I cast it as (T). But this method can now accept any type!" Shifu (smiling, eyes narrowing with pride): "You’ve taken the form beyond its scroll, Po. You have learned not just to use functions—but to respect their essence. This generic version... is the true Dragon Scroll of the Decorator." Modified Code by Po Java package com.javaonfly.designpatterns.decorator.fp; import java.time.LocalDateTime; import java.util.Arrays; import java.util.function.Function; public class Logger { public <T> void decoratedLogMessage(Function<T, T>... loggers) { Function<T, T> pipeline= Arrays.stream(loggers).sequential().reduce(Function.identity(), Function::andThen); T message = pipeline.apply((T) "This is a log message."); System.out.println(message); } public static void main(String[] args){ Logger logger = new Logger(); Function<String, String> simpleLogger = message -> { System.out.println(message); return message; }; Function<String, String> timestampLogger = message -> { String timestampedMessage = "[" + System.currentTimeMillis() + "] " + message; return timestampedMessage; }; Function<String, String> JadeLogger = message -> { String JadeLoggedMessage = "[jadelog] " + message; return JadeLoggedMessage; }; logger.decoratedLogMessage(simpleLogger, timestampLogger,JadeLogger); } } //Output This is a log message. [jadelog] [1748598136677] This is a log message. Wisdom Scroll: OOP vs. Functional Decorator FeatureOOP DecoratorFunctional DecoratorNeeds ClassYesNoUses InterfaceYesOptionalComposabilityRigidElegantBoilerplateHighMinimalFlexibilityModerateHigh (thanks to lambdas) Final Words from Master Shifu "Po, the world of code is full of distractions—designs that look powerful but slow us down. A true Kung Fu developer learns to adapt. To decorate without weight. To enhance without inheritance. To flow with functions, not fight the structure."

By Shamik Mitra
Want to Become a Senior Software Engineer? Do These Things
Want to Become a Senior Software Engineer? Do These Things

In my experience working with and leading software engineers, I have seen mid-level Engineers produce outcomes worthy of a Senior, and seniors who are only so in title. High-performing mid-levels eventually overtook under-performing seniors. How you become a Senior Software Engineer is important. If you become a Senior because you're the last man standing or the one with the longest tenure. I am afraid that future upward movement may be challenging. Especially, if you decide to go elsewhere. I have been fortunate to directly mentor a couple of engineers to become Senior, and witness the journey of others. In this article, I am going to discuss the day-to-day activities that distinguish the best and how you can join their ranks. Know the Basics Bruce Lee famously said: "I fear not the man who has practised 10,000 kicks once, but I fear the man who has practised one kick 10,000 times." This is a homage to the importance of getting the basics right. 1. Write Clean Code If you want to become a Senior Software Engineer, you have to write clean and reliable code. The pull request you authored should not be like a Twitter thread due to the myriad of corrections. Your code contributions should **completely** address the assigned task. If the task is to create a function that sums two digits. In addition to the `+` operation, add validations. Take care of null cases, and use the correct data type in the function parameters. Think about number overflow and other cases. This is what it means to have your code contribution address the task at hand completely. Pay attention to the coding standard and ensure your code changes adhere to it. When you create pull requests that do not require too many corrections, work as expected, and more, you'll be able to complete more tasks per sprint and become one of the top contributors on the team. You see where this is going already. You should pay attention to the smallest details in your code. Perform null checks, and use the appropriate data types and more. For example, in Java, do not use Integer everywhere just because you can; it takes more memory and may impair the performance of your application in production. Instead of writing multiple, nested if...else constructs, use early return. Don't do this: Java public boolean sendEmailToUser(User user) { if(user != null && !user.getEmail().isEmpty()) { String template = "src/path/to/email/template"; template = template .replace("username", user.getFirstName() + " " + user.getLastName()) .replace("link", "https://password-reset.example.com"); emailService.sendEmail(template); return true; } return false; } Do this instead. It's cleaner and more readable: Java public boolean sendEmailToUser(User user) { if(user == null || user.getEmail().isEmpty()) { return false; } String template = "src/path/to/email/template"; template = template .replace("username", user.getFirstName() + " " + user.getLastName()) .replace("link", "https://password-reset.example.com"); emailService.sendEmail(template); return true; } Ensure you handle different scenarios in your logic. If you are making external HTTP calls, ensure there's exception handling that caters to 5XX and 4XX. Validate that the return payload has the expected data points. Implement retry logic where applicable. Write the simplest and most performant version of your logic. Needlessly fanciful and complicated code only impressed one person: your current self. Your future self will wonder what on earth you had to drink the day you wrote that code. Say less about how other people will perceive it down the line. What typically happens to such complicated code, which is not maintainable, is that it gets rewritten and deprecated. So, if your goal is to leave a legacy behind, needlessly complicated, non-performant, hard-to-maintain code will not help. If you're using reactive Java programming, please do not write deeply nested code - the callback hell of JavaScript. Use functional programming to separate the different aspects and have a single clean pipeline. 2. Write Readable Code In addition to writing clean code, your code should be readable. Don't write code as if you're a minifier of some sort. Use white-space properly. Coding is akin to creating art. Write beautiful code that others want to read. Use the right variable names. var a = 1 + 2; might make sense now, until you need to troubleshoot and then begin to wonder what on earth is a. Now, you have to run the application in debug mode and observe the values to decode what it means. This extra step (read extra minutes or hours) could have been avoided from the outset. Write meaningful comments and Javadoc. Please don't do this and call it a Javadoc: Java /** * @author smatt */ We will be able to tell you're the author of the code when we do Git Blame. Therefore, kindly stop adding a Javadoc to a method or class just to put your name there. You're contributing to the company's codebase and not an open-source repo on GitHub. Moreover, if your contribution is substantial enough, we will definitely remember you wrote it. Writing meaningful comments and Javadoc is all the more necessary when you're writing special business logic. Your comment or Javadoc can be the saving grace for your future self or colleague when that business logic needs to be improved. I once spent about 2 weeks trying to understand the logic for generating "identifiers". It wasn't funny. Brilliant logic, but it took me weeks to appreciate it. A well-written Javadoc and documentation could have saved me some hours. Avoid spelling mistakes in variable names, comments, function names, etc. Unless your codebase is not in English, please use comprehensible English variable names. We should not need Alan Turing to infer the function of a variable, a method, or a class from its name. Think about it, this is why the Java ecosystem seems to have long method names. We would rather have long names with explicit meaning than require a codex to translate one. Deepen Your Knowledge Software engineering is a scientific and knowledge-based profession. What you know counts a lot towards your growth. What you know how to do is the currency of the trade. If you want to become a Senior Software Engineer, you need to know how to use the tools and platforms employed in your organization. I have interviewed great candidates who did not get the maximum available offer because they only knew as far as committing to the production branch. When it comes to how the deployment pipeline works, how the logs, alerts, and other observability component works, they don't know; "The DevOps team handles that one." As a Senior Software Engineer, you need to be able to follow your code everywhere from the product requirement, technical specification, slicing, refinement, writing, code reviews, deployment, monitoring, and support. This is when you establish your knowledge and become a "Senior". Your organization uses Kibana or Grafana for log visualization, New Relic, Datadog, etc. Do you know how to filter for all the logs for a single service? Do you know how to view the logs for a single HTTP request? Let's say you have an APM platform, such as Datadog, New Relic, or Grafana. Do you know how to set up alerts? Can you interpret an alert, or do you believe your work is limited to writing code and merging to master? While every other thing should be handled by "other people." If you want to become a Senior Software Engineer, you have to learn how these things are set up, how they work, be able to fix them if they break, and improve them too. Currently, you're not a Senior Software Engineer, but have you ever wondered what your "Senior Engineer" or "Tech Lead" had to do before assigning a task to you? There are important steps that happen before and after the writing of the code. It is expected that a Senior Software Engineer should know them and be able to do them well. If your company writes technical specifications, observe refinement sessions, poker planning, or ticket slicing. Don't be satisfied just being in attendance. Attach yourself to someone who's already leading these and ask to help them out. When given the opportunity, pour your heart into it. Get feedback, and you become better over time. If you want to become a Senior Software Engineer, be the embodiment of your organization's coding standard. If there's none — jackpot! Research and implement one. In this process, you'll move from someone who ONLY executes to someone who's involved in the execution planning and thus ready for more responsibility, a.k.a. the Senior Software Engineer role. Still on deepening your knowledge, you should know how the application in your custody works. One rule of thumb I have for myself is this: "Seun, if you leave this place today and someone asks you to come and build similar things, will you be able to do it?". It's a simple concept but powerful. If your team has special implementations and logic somewhere that's hard to understand, make it your job to understand them. Be the guy who knows the hard stuff. There's a special authentication mechanism, and you're the one who knows all about it. There's a special user flow that gets confusing, be the one who knows about it off-hand. Be the guy in Engineering who knows how the CI/CD pipeline works and is able to fix issues. Ask yourself this: Do you know how your organization's deployment pipeline works, or do you just write your code and pass it on to someone else to deploy? Without deepening your knowledge, you will not be equipped to take on more responsibilities, help during the time of an incident, or proffer solutions to major pain points. Be the resident expert, and you can be guaranteed that your ascension will be swift. Be Proactive and Responsible I have once interviewed someone who seems to have a good grasp of the basics and the coding aspect. However, we were able to infer from their submissions that they've never led any project before. While they may be a good executor, they're not yet ready for the Senior role. Volunteer and actively seek opportunities to lead initiatives and do hard things in your organization. Your team is about to start a new project or build a new feature? Volunteer to be the technical owner. When you are given the opportunity, give it your best. Write the best technical specification there is, and review pull requests from other people in time. Submit the most code contributions, organize and carry out water-tight user acceptance tests. When the feature/project is in production, follow up with it to ensure it is doing exactly what it is supposed to do and delivering value for the business. Do these, and now you have things to reference when your name comes up for a promotion. Furthermore, take responsibility for team and organizational challenges. For example, no one wants to update a certain codebase because the test cases are flaky and boring. Be the guy who fixes that without being asked. Of course, solving a problem of that magnitude shows you as a dependable team member who is hungry for more. Another example, your CI/CD pipeline takes 60 minutes to run. Why not be the person who takes some time out to optimize it? If you get it from 60 minutes to 45 minutes, that's a 25% improvement in the pipeline. If we compute the number of times the job has to run per day, and multiply that by 5 days a week. We are looking at saving 375 minutes of man-hours per day. Holy karamba! That's on a single initiative. Now, that's a Senior-worthy outcome. I'm sure, if you look at your Engineering organization, there are countless issues to fix and things to improve. You just need to do them. Another practical thing you can do to demonstrate proactivity and responsibility is simply being available. You most likely have heard something about "the available becomes the desire." There's an ongoing production incident and a bridge call. Join the call. Try to contribute as much as possible to help solve the problem. Attend the post-incident review calls and contribute. You see, by joining these calls, you'll see and learn how the current Seniors troubleshoot issues. You'll see how they use the tools and you'll learn a thing or two. It may not be a production incident, it may be a customer support ticket, or an alert on Slack about something going wrong. Don't just "look and pass" or suddenly go offline. Show some care, attempt to fix it, and you can only get better at it. The best thing about getting better at it is that you become a critical asset to the company. While it is true that anyone is expendable, provided the company is ready to bear the cost. I have also been in a salary review session where some people get a little bit more than others, on the same level, because they're considered a "critical asset." It's a thing, and either you know it or not, it applies. Be proactive (do things without being told to do so) and be responsible (take ownership), and see yourself grow beyond your imagination. Be a Great Communicator Being a great communicator is crucial to your career progression to the Senior Software Engineer role. The reason is that you work with other human beings, and they are not mind readers. Are you responding to the customer support ticket or responding to an alert on Slack? Say so in that Slack thread so other people can do something else. Are you blocked? Please say so. Mention what exactly you have tried and ask for ideas. We don't want to find out on the eve of the delivery date that you've been stuck for the past 1 week, and so the team won't be able to meet their deadline. When other people ask for help, and you are able to, please unblock them. You only get better by sharing with your team. Adopt open communication as much as possible. It will save you from having to private message 10 different people, on the same subject, to reach a decision. Just start a Slack thread in the channel, and everyone can contribute and reach a decision faster. It also helps with accountability and responsibility. What If? Seun Matt, "What if I do all these things and still do not get promoted? I am the resident expert, I know my stuff, I am proactive, and a great communicator. I have the basics covered, and I even set the standards. Despite all of this, I have not been promoted in years." I hear you loud and clear, my friend. We have all been there at some point. There are times when an organization is not able to perform pay raises or do promotions due to economic challenges, lack of profitability, and other prevailing market forces. Remember, the companies we work for do not print money, it is from their profit that they do promotions, raises and bonuses. For you, it is a win-win situation no matter how you look at it. These skill sets that you now have. The path you have taken they are all yours and applicable in the next company. In this profession, your next Salary is influenced by what you're doing at your current place and what you've done in the past. So, even if it does not work out in your current place, when you put yourself out there, you'll get a better offer, all things being equal. Conclusion No matter where you work, ensure you have a good experience. "I have done it before" trumps the number of years. In Software Engineering, experience is not just about the number of years you've worked in a company or have been coding. Experience has to do with the number of challenges you have solved yourself. How many "I have seen this before" you have under your belt? Being passive or just clocking in 9–5 will not get you that level of experience. You need to participate and lead. The interesting part is that your next salary in your next job will be determined by the experience you're garnering in your current work. Doing all of the above is a call of duty. It requires extra work, and it has extra rewards for those able to pursue it. Stay curious, see you at the top, and happy coding! Want to learn how to effectively evaluate the performance of Engineers? Watch the video below. Video

By Seun Matt DZone Core CORE
Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems
Understanding the Circuit Breaker: A Key Design Pattern for Resilient Systems

Reliability is critical, specifically, when services are interconnected, and failures in one component can lead to cascading effect on other services. The Circuit Breaker Pattern is an important design pattern used to build fault tolerant and resilient systems. Particularly in microservices architecture. This article explains the fundamentals of the circuit breaker pattern, its benefits, and how to implement it to protect your systems from failure. What is the Circuit Breaker Pattern? The Circuit Breaker Pattern is actually inspired by electrical circuit breakers you see at your home, which is designed to prevent system failures by detecting faults and stopping the flow of electricity when problems occur. In software, this pattern monitors service interactions, preventing continuous calls/retries to a failing/failed service, which could overload the service with problem. by “Breaking” the circuit between services, this pattern allows a system to gracefully handle failures and avoid cascading problems. How Does It Actually Work? State Diagram showing the differnt states of CB pattern The circuit breaker has three distinct states: Closed, Open, and Half-Open. Closed State: Normally, the circuit breaker is “closed,” meaning (loop is closed) requests are flowing as usual between services. (In electrical terms wires are connected to allow flow of electricity) Open State: When the circuit breaker is open, it immediately rejects requests to the failing service, preventing further stress on the service and giving it time to recover. During this time, fallback mechanisms can be triggered, such as returning cached data or default responses. Half-Open State: Following a defined timeout, the circuit breaker switches to the half-open state and allows for varying numbers of requests from endpoints to determine if the service has been restored. In case of successful requests, the circuit breaker is closed again, but it goes back to the open state in other cases. The main idea behind this design pattern is to prevent a failing service from pulling down the entire system and to provide a way for recovery once the service becomes healthy. Electrical analogy to remember the open and close states Why Use the Circuit Breaker Pattern? In complex distributed systems, failures are unavoidable. Here are some real reasons why the circuit breaker pattern is essential: Preventing Cascading Failures: In a microservices architecture, if one service fails and others depend on it, the failure can spread across the entire system. The circuit breaker stops this by isolating the faulty service. Improving System Stability: By stopping requests to a failing service, the circuit breaker prevents resource burn down and lowers the load on dependent services, helping to stabilize the system. Better UX: Instead of having requests stuck for too long or return unhandled errors, the circuit breaker allows for graceful degradation by serving fallback responses, improving the user experience even during failures. Automated Recovery: The half-open state allows the system to automatically test the health of a service and recover without manual intervention. How to Implement the Circuit Breaker Pattern The implementation of the circuit breaker pattern depends on the specific stack you’re using, but the standard approach remains same. Below are the high-level overview of how to implement it: Set Failure Thresholds: Define the conditions under which the circuit breaker should open. This can be based on consecutive failures, error rates, or timeouts.Monitor Requests: Continuously track the success or failure of requests to a service. If the failure threshold is attained then trip the circuit breaker.Handle Open State: When the circuit breaker is open, reject further requests to the service and trigger fallback mechanisms.Implement Half-Open State: After some timeout let limited requests hit the service to test if it has recovered. If successful close the circuit breaker.Provide Fallback Mechanisms: During failures, fallback mechanisms can provide default responses, use cached data, or switch to alternate services. The following example demonstrates how to implement a circuit breaker in Java using the widely adopted Resilience4jlibrary: Resilience4j is a powerful Java library designed to help you implement resilience patterns, such as the Circuit Breaker, Rate Limiter, Retry, Bulkhead, and Time Limiter patterns. One of the main advantages of Resilience4j is its flexibility and easy configuration. Correct configuration of these resilience patterns allows developers to fine tune the systems for maximum fault tolerance, improved stability, and better performance in the face of errors. Java import io.github.resilience4j.circuitbreaker.CircuitBreaker; import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig; import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry; import java.time.Duration; public class CircuitBreakerExample { public static void main(String[] args) { // Create a custom configuration for the Circuit Breaker CircuitBreakerConfig config = CircuitBreakerConfig.custom() .failureRateThreshold(50) .waitDurationInOpenState(Duration.ofSeconds(5)) .ringBufferSizeInHalfOpenState(5) .ringBufferSizeInClosedState(20) .build(); // Create a CircuitBreakerRegistry with a custom global configuration CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config); // Get or create a CircuitBreaker from the CircuitBreakerRegistry CircuitBreaker circuitBreaker = registry.circuitBreaker("myService"); // Decorate the service call with the circuit breaker Supplier<String> decoratedSupplier = CircuitBreaker .decorateSupplier(circuitBreaker, myService::call); // Execute the decorated supplier and handle the result Try<String> result = Try.ofSupplier(decoratedSupplier) .recover(throwable -> "Fallback response"); System.out.println(result.get()); } } In this example, the circuit breaker is configured to open if 50% of the requests fail. It stays open for 5 seconds before entering the half-open state, during which it allows 5 requests to test the service. If the requests are successful, it closes the circuit breaker, allowing normal operation to resume. Important Configuration Options for Circuit Breaker in Resilience4j Resilience4j provides a flexible and robust implementation of the Circuit Breaker Pattern, allowing developers to configure various aspects to tailor the behavior to their application’s needs. Correct configuration is crucial to balancing fault tolerance, system stability, and recovery mechanisms. Below are the key configuration options for Resilience4j’s Circuit Breaker: 1. Failure Rate Threshold: This is the percentage of failed requests that will cause the circuit breaker to transition from a Closed state (normal operation) to an Open state (where requests are blocked). The purpose is to controls when the circuit breaker should stop forwarding requests to a failing service. For example, a threshold of 50% means the circuit breaker will open after half of the requests fail. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .failureRateThreshold(50) // Open the circuit when 50% of requests fail .build(); 2. Wait Duration in Open State: The time the circuit breaker remains in the Open state before it transitions to the Half-Open state, where it starts allowing a limited number of requests to test if the service has recovered. This prevents retrying failed services immediately, allowing the downstream service time to recover before testing it again. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .waitDurationInOpenState(Duration.ofSeconds(30)) // Wait for 30 seconds before transitioning to Half-Open .build(); 3. Ring Buffer Size in Closed State: The number of requests that the circuit breaker records while in the Closed state (before failure rates are evaluated). This acts as a sliding window for error monitoring. Helps the circuit breaker determine the failure rate based on recent requests. a larger ring buffer size means more data points are considered when deciding whether to open the circuit. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .ringBufferSizeInClosedState(50) // Consider the last 50 requests to calculate the failure rate .build(); 4. Ring Buffer Size in Half-Open State: The number of permitted requests in the Half-Open state before deciding whether to close the circuit or revert to the Open state based on success or failure rates. determines how many requests will be tested in the half-open state to decide whether the service is stable to close the circuit or still its failing. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .ringBufferSizeInHalfOpenState(5) // Test with 5 requests in Half-Open state .build(); 5. Sliding Window Type and Size: Defines how failure rates are measured: either by a count-based sliding window or time-based sliding window. provides flexibility in handling failure rates are computed. A count based window is useful in hightraffic systems, whereas a time based window works well in low traffic environments. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) // Use a count-based window with the last 100 requests .build(); 6. Minimum Number of Calls: specifies the minimum number of requests required before the failure rate is evaluated. prevents the circuit breaker from opening prematurely when there isnt enough data to calculate a meaningful failure rate, specially during low traffic. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .minimumNumberOfCalls(20) // Require at least 20 calls before evaluating failure rate .build(); 7. Permitted Number of Calls in Half-Open State: The number of requests allowed to pass through in the Half-Open state to check if the service has recovered. After transitioning to the half-open state, this config controls how many requests are allowed to evaluate service recovery. a smaller value can catch issues fastr, while a larger value can promise that temporary issues don’t result in reopening circuit. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .permittedNumberOfCallsInHalfOpenState(5) // Test recovery with 5 requests .build(); 8. Slow Call Duration Threshold: Defines the threshold for a slow call. Calls taking longer than this threshold are considered “slow” and can contribute to the failure rate. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slowCallDurationThreshold(Duration.ofSeconds(2)) // Any call over 2 seconds is considered slow .build(); 9. Slow Call Rate Threshold: The percentage of “slow” calls that will trigger the circuit breaker to open, similar to the failure rate threshold. Detects services that are degrading in performance before they fail outright, allowing systems to respond to performance issues early. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slowCallRateThreshold(50) // Open the circuit when 50% of calls are slow .build(); 10. Automatic Transition from Open to Half-Open: Controls how the circuit breaker automatically transitions from the Open state to the Half-Open state after a set wait duration. Enables the system to recover automatically by testing the service periodically, avoiding the need for manual intervention. Java CircuitBreakerConfig config = CircuitBreakerConfig.custom() .automaticTransitionFromOpenToHalfOpenEnabled(true) // Enable automatic transition .build(); 11. Fallback Mechanism: Helps configure fallback actions when the circuit breaker is open and requests are blocked. Prevents cascading failures and improves us by serving cached data/default responses. Java Try<String> result = Try.ofSupplier( CircuitBreaker.decorateSupplier(circuitBreaker, service::call) ).recover(throwable -> "Fallback response"); Conclusion The Circuit Breaker Pattern is a vital tool in building resilient, fault-tolerant systems. By preventing cascading failures, improving system stability, and enabling graceful recovery, it plays a crucial role in modern software architecture, especially in microservices environments. Whether you’re building a large-scale enterprise application or a smaller distributed system, the circuit breaker can be a game-changer in maintaining reliable operations under failure conditions.

By Narendra Lakshmana gowda
AI Agent Architectures: Patterns, Applications, and Implementation Guide
AI Agent Architectures: Patterns, Applications, and Implementation Guide

Architecture is something I am very much interested in. As I was exploring AI agents, I was curious to understand the agentic architectures. That led me to this awesome resource, The 2025 Guide to AI Agents, published by IBM on their Think page. One of the sections of the guide is around architecture. The architecture section explains that agentic architecture refers to the design and structure enabling AI agents to automate workflows, reason through tasks, and utilize tools to achieve their objectives. This architecture is built to support autonomous, goal-driven behavior by allowing agents to perceive their environment, process information, and act independently within defined rules and constraints. It often incorporates frameworks that facilitate collaboration between multiple agents, known as multi-agent systems, and provide the necessary infrastructure for integrating with external tools, APIs, and data sources. By leveraging agentic architecture, organizations can create scalable, flexible AI solutions that automate complex business processes and adapt to changing requirements. Introduction to AI Agent Architectures AI agent architectures provide structural blueprints for designing intelligent systems that perceive environments, process information, and execute actions. These frameworks define how components interact, manage data flow, and make decisions, critically impacting performance, scalability, and adaptability. As AI systems evolve from narrow applications to complex reasoning engines, architectural choices determine their ability to handle uncertainty, integrate new capabilities, and operate in dynamic environments. This guide explores essential patterns with practical implementation insights. Here are some core architecture patterns: 1. Orchestrator-Worker Architecture The orchestrator-worker pattern represents a centralized approach to task management where a single intelligent controller (orchestrator) maintains global oversight of system operations. This architecture excels at decomposing complex problems into manageable subtasks, distributing them to specialized worker agents, and synthesizing partial results into complete solutions. The orchestrator serves as the system's "brain," making strategic decisions about task allocation, monitoring worker performance, and implementing fallback strategies when errors occur. Workers operate as domain-specific experts, focusing solely on executing their assigned tasks with maximum efficiency. This separation of concerns enables parallel processing while maintaining centralized control, particularly valuable when auditability, reproducibility, or coordinated error recovery are required. Orchestrator worker Concept The central coordinator decomposes tasks, assigns subtasks to specialized workers, and synthesizes results. Key Components Orchestrator (task decomposition/assignment)Worker pool (specialized capabilities)Task queue (work distribution)Result aggregator When to Use Complex workflows requiring multiple capabilitiesSystems needing centralized monitoringApplications with parallelizable tasks Real-World Case Banking Fraud Detection: Orchestrator routes transactions to workers analyzing patterns, location data, and behavior history. Suspicious cases trigger human review. 2. Hierarchical Architecture Hierarchical architectures model organizational command structures by arranging decision-making into multiple layers of abstraction. At the highest level, strategic planners operate with long-term horizons and broad objectives, while successive layers handle progressively more immediate concerns until reaching real-time actuators at the base level. This architecture naturally handles systems where different time scales of decision-making coexist; for example, an autonomous vehicle simultaneously plans a multi-day route (strategic), navigates city blocks (tactical), and adjusts wheel torque (execution). Information flows bi-directionally: sensor data aggregates upward through abstraction layers while commands propagate downward with increasing specificity. The hierarchy provides inherent fail-safes, as lower layers can implement emergency behaviors when higher-level planning becomes unresponsive. Concept Multi-layered control with increasing abstraction levels (strategic → tactical → execution). Key Components: Strategic layer (long-term goals)Tactical layer (resource allocation)Execution layer (real-time control)Feedback loops between layers Hierarchical When to Use Systems with natural command chainsProblems requiring different time-scale decisionsSafety-critical applications Real-World Case Smart Factory: Strategic layer optimizes quarterly production, tactical layer schedules weekly shifts, execution layer controls robotic arms in real-time. 3. Blackboard Architecture The Backboard pattern mimics human expert panels solving complex problems through collaborative contribution. At its core lies a shared data space (the blackboard), where knowledge sources —such as independent specialists like image recognizers, database query engines, or statistical analyzers—post partial solutions and read others' contributions. Unlike orchestrated systems, no central controller directs the problem-solving; instead, knowledge sources activate opportunistically when their expertise becomes relevant to the evolving solution. This emergent behavior makes blackboard systems uniquely suited for ill-defined problems where solution paths are unpredictable, such as medical diagnosis or scientific discovery. The architecture naturally accommodates contradictory hypotheses (represented as competing entries on the blackboard) and converges toward consensus through evidence accumulation. Concept Independent specialists contribute to a shared data space ("blackboard"), collaboratively evolving solutions. Key Components Blackboard (shared data repository)Knowledge sources (specialized agents)Control mechanism (activation coordinator) Blackboard architecture When to Use Ill-defined problems with multiple approachesDiagnostic systems requiring expert collaborationResearch environments Real-World Case Oil Rig Monitoring: Geologists, engineers, and equipment sensors contribute data to predict maintenance needs and drilling risks. 4. Event-Driven Architecture Event-driven architectures treat system state changes as first-class citizens, with components reacting to asynchronous notifications rather than polling for updates. This paradigm shift enables highly responsive systems that scale efficiently under variable loads. Producers (sensors, user interfaces, or other agents) emit events when state changes occur — a temperature threshold breach, a new chat message arrival, or a stock price movement. Consumers subscribe to relevant events through a message broker, which handles routing, persistence, and delivery guarantees. The architecture's inherent decoupling allows components to evolve independently, making it ideal for distributed systems and microservices. Event sourcing variants maintain complete system state as an ordered log of events, enabling time-travel debugging and audit capabilities unmatched by traditional architectures. Concept Agents communicate through asynchronous events triggered by state changes. Key Components Event producers (sensors/user inputs)Message broker (event routing)Event consumers (processing agents)State stores Event-driven architecture When to Use Real-time reactive systemsDecoupled components with independent scalingIoT and monitoring applications Real-World Case Smart Building: Motion detectors trigger lighting adjustments, energy price changes activate HVAC optimization, and smoke sensors initiate evacuation protocols. 5. Multi-Agent Systems (MAS) Multi-agent systems distribute intelligence across autonomous entities that collaborate through negotiation rather than central command. Each agent maintains its own goals, knowledge base, and decision-making processes, interacting with peers through standardized protocols like contract net (task auctions) or voting mechanisms. This architecture excels in environments where central control is impractical, such as disaster response robots exploring rubble, blockchain oracles providing decentralized data feeds, or competing traders in financial markets. MAS implementations carefully balance local autonomy against global coordination needs through incentive structures and communication protocols. The architecture's resilience comes from redundancy — agent failures rarely cripple the system - while emergent behaviors can produce innovative solutions unpredictable from individual agent designs. Multi-agent systems Concept Autonomous agents collaborate through negotiation to achieve individual or collective goals. Key Components Autonomous agentsCommunication protocols (FIPA/ACL)Coordination mechanisms (auctions/voting)Environment model When to Use Distributed problems without a central authoritySystems requiring high fault toleranceCompetitive or collaborative environments Real-World Case Port Logistics: Cranes, trucks, and ships negotiate berthing schedules and container transfers using contract-net protocols. 6. Reflexive vs. Deliberative Architectures These contrasting paradigms represent two fundamental approaches to agent decision-making. Reflexive architectures implement direct stimulus-response mappings through condition-action rules ("if temperature > 100°C then shutdown"), providing ultra-fast reactions at the cost of contextual awareness. They excel in safety-critical applications like industrial emergency stops or network intrusion prevention. Deliberative architectures instead maintain internal world models, using planning algorithms to sequence actions toward goals while considering constraints. Though computationally heavier, they enable sophisticated behaviors like supply chain optimization or clinical treatment planning. Hybrid implementations often layer reflexive systems atop deliberative bases — autonomous vehicles use deliberative route planning but rely on reflexive collision avoidance when milliseconds matter. Reflexive Concept Direct stimulus-response mapping without internal state. Structure: Condition-action rulesUse: Time-critical reactionsCase: Industrial E-Stop - Immediately cuts power when a safety breach is detected Deliberative Concept Internal world model with planning/reasoning. Structure: Perception → Model Update → Planning → ActionUse: Complex decision-makingCase: Supply Chain Optimization - Simulates multiple scenarios before committing resources Hybrid Approach Autonomous Vehicles: Reflexive layer handles collision avoidance while the deliberative layer plans routes. 7. Memory-Augmented Architectures Memory-augmented architectures explicitly separate processing from knowledge retention, overcoming the context window limitations of stateless systems. These designs incorporate multiple memory systems: working memory for immediate task context, episodic memory for experience recording, and semantic memory for factual knowledge. Retrieval mechanisms range from simple keyword lookup to sophisticated vector similarity searches across embedding spaces. The architecture enables continuous learning, as new experiences update memory content without requiring model retraining, and supports reasoning across extended timelines. Modern implementations combine neural networks with symbolic knowledge graphs, allowing both pattern recognition and logical inference over memorized content. This proves invaluable for applications like medical diagnosis systems that must recall patient histories while staying current with the latest research. Concept Agents with explicit memory systems for long-term context. Key Components Short-term memory (working context)Long-term memory (vector databases/knowledge graphs)Retrieval mechanisms (semantic search)Memory update policies When to Use Conversational agents require contextSystems needing continuous learningApplications leveraging historical data Real-World Case Medical Assistant: Recalls patient history, researches latest treatments, and maintains consultation context across sessions. Architecture Selection Table ArchitectureBest ForStrengthsLimitationsImplementation ComplexityOrchestrator-WorkerComplex task coordinationCentralized control, auditabilitySingle point of failureMediumHierarchicalLarge-scale systemsClear responsibility chainsCommunication bottlenecksHighBlackboardCollaborative problem-solvingFlexible expertise integrationUnpredictable timingHighEvent-DrivenReal-time reactive systemsLoose coupling, scalabilityEvent tracing difficultiesMediumMulti-AgentDistributed environmentsHigh fault toleranceCoordination complexityHighReflexiveTime-critical responsesLow latency, simplicityLimited intelligenceLowDeliberativeStrategic planningSophisticated reasoningComputational overheadHighMemory-AugmentedContextual applicationsLong-term knowledge retentionMemory management costsMedium-High Conclusion The most effective implementations combine patterns strategically, such as using hierarchical organization for enterprise-scale systems with event-driven components for real-time responsiveness, or memory-augmented orchestrators that manage specialized workers. As AI systems advance, architectures will increasingly incorporate self-monitoring and dynamic reconfiguration capabilities, enabling systems that evolve their own organization based on performance requirements. Selecting the right architectural foundation remains the most critical determinant of an AI system's long-term viability and effectiveness. For AI Developer tools, check my article here.

By Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking
The Missing Infrastructure Layer: Why AI's Next Evolution Requires Distributed Systems Thinking

The recent announcement of KubeMQ-Aiway caught my attention not as another AI platform launch, but as validation of a trend I've been tracking across the industry. After spending the last two decades building distributed systems and the past three years deep in AI infrastructure consulting, the patterns are becoming unmistakable: we're at the same inflection point that microservices faced a decade ago. The Distributed Systems Crisis in AI We've been here before. In the early 2010s, as monolithic architectures crumbled under scale pressures, we frantically cobbled together microservices with HTTP calls and prayed our systems wouldn't collapse. It took years to develop proper service meshes, message brokers, and orchestration layers that made distributed systems reliable rather than just functional. The same crisis is unfolding with AI systems, but the timeline is compressed. Organizations that started with single-purpose AI models are rapidly discovering they need multiple specialized agents working in concert, and their existing infrastructure simply wasn't designed for this level of coordination complexity. Why Traditional Infrastructure Fails AI Agents Across my consulting engagements, I'm seeing consistent patterns of infrastructure failure when organizations try to scale AI beyond proof-of-concepts: HTTP communication breaks down: Traditional request-response patterns work for stateless operations but fail when AI agents need to maintain context across extended workflows, coordinate parallel processing, or handle operations that take minutes rather than milliseconds. The synchronous nature of HTTP creates cascading failures that bring down entire AI workflows.Context fragmentation destroys intelligence: AI agents aren't just processing data — they're maintaining conversational state and building accumulated knowledge. When that context gets lost at service boundaries or fragmented across sessions, the system's collective intelligence degrades dramatically.Security models are fundamentally flawed: Most AI implementations share credentials through environment variables or configuration files. This creates lateral movement risks and privilege escalation vulnerabilities that traditional security models weren't designed to handle.Architectural constraints force bad decisions: Tool limitations in current AI systems force teams into anti-patterns, such as building meta-tools, fragmenting capabilities, or implementing complex dynamic loading mechanisms. Each workaround introduces new failure modes and operational complexity. Evaluating the KubeMQ-Aiway Technical Solution KubeMQ-Aiway is “the industry’s first purpose-built connectivity hub for AI agents and Model-Context-Protocol (MCP) servers. It enables seamless routing, security, and scaling of all interactions — whether synchronous RPC calls or asynchronous streaming — through a unified, multi-tenant-ready infrastructure layer.” In other words, it’s the hub that manages and routes messages between systems, services, and AI agents. Through their early access program, I recently explored KubeMQ-Aiway's architecture. Several aspects stood out as particularly well-designed for these challenges: Unified aggregation layer: Rather than forcing point-to-point connections between agents, they've created a single integration hub that all agents and MCP servers connect through. This is architecturally sound — it eliminates the N-squared connection problem that kills system reliability at scale. More importantly, it provides a single point of control for monitoring, security, and operational management.Multi-pattern communication architecture: The platform supports both synchronous and asynchronous messaging natively, with pub/sub patterns and message queuing built in. This is crucial because AI workflows aren't purely request-response — they're event-driven processes that need fire-and-forget capabilities, parallel processing, and long-running operations. The architecture includes automatic retry mechanisms, load balancing, and connection pooling that are essential for production reliability.Virtual MCP implementation: This is particularly clever — instead of trying to increase tool limits within existing LLM constraints, they've abstracted tool organization at the infrastructure layer. Virtual MCPs allow logical grouping of tools by domain or function while presenting a unified interface to the AI system. It's the same abstraction pattern that made container orchestration successful.Role-based security model: The built-in moderation system implements proper separation of concerns with consumer and administrator roles. More importantly, it handles credential management at the infrastructure level rather than forcing applications to manage secrets. This includes end-to-end encryption, certificate-based authentication, and comprehensive audit logging — security patterns that are proven in distributed systems but rarely implemented correctly in AI platforms. Technical Architecture Deep Dive What also impresses me is their attention to distributed systems fundamentals: Event sourcing and message durability: The platform maintains a complete audit trail of agent interactions, which is essential for debugging complex multi-agent workflows. Unlike HTTP-based systems, where you lose interaction history, this enables replay and analysis capabilities that are crucial for production systems.Circuit breaker and backpressure patterns: Built-in failure isolation prevents cascade failures when individual agents malfunction or become overloaded. The backpressure mechanisms ensure that fast-producing agents don't overwhelm slower downstream systems — a critical capability when dealing with AI agents that can generate work at unpredictable rates.Service discovery and health checking: Agents can discover and connect to other agents dynamically without hardcoded endpoints. The health checking ensures that failed agents are automatically removed from routing tables, maintaining system reliability.Context preservation architecture: Perhaps most importantly, they've solved the context management problem that plagues most AI orchestration attempts. The platform maintains conversational state and working memory across agent interactions, ensuring that the collective intelligence of the system doesn't degrade due to infrastructure limitations. Production Readiness Indicators From an operational perspective, KubeMQ-Aiway demonstrates several characteristics that distinguish production-ready infrastructure from experimental tooling: Observability: Comprehensive monitoring, metrics, and distributed tracing for multi-agent workflows. This is essential for operating AI systems at scale, where debugging requires understanding complex interaction patterns.Scalability design: The architecture supports horizontal scaling of both the infrastructure layer and individual agents without requiring system redesign. This is crucial as AI workloads are inherently unpredictable and bursty.Operational simplicity: Despite the sophisticated capabilities, the operational model is straightforward — agents connect to a single aggregation point rather than requiring complex service mesh configurations. Market Timing and Competitive Analysis The timing of this launch is significant. Most organizations are hitting the infrastructure wall with their AI implementations right now, but existing solutions are either too simplistic (basic HTTP APIs) or too complex (trying to adapt traditional service meshes for AI workloads). KubeMQ-Aiway appears to have found the right abstraction level — sophisticated enough to handle complex AI orchestration requirements, but simple enough for development teams to adopt without becoming distributed systems experts. Compared to building similar capabilities internally, the engineering effort would be substantial. The distributed systems expertise required, combined with AI-specific requirements, represents months or years of infrastructure engineering work that most organizations can't justify when production AI solutions are available. Strategic Implications For technology leaders, the emergence of production-ready AI infrastructure platforms changes the strategic calculation around AI implementation. The question shifts from "should we build AI infrastructure?" to "which platform enables our AI strategy most effectively?" Early adopters of proper AI infrastructure are successfully running complex multi-agent systems at production scale while their competitors struggle with basic agent coordination. This gap will only widen as AI implementations become more sophisticated. The distributed systems problems in AI won't solve themselves, and application-layer workarounds don't scale. Infrastructure solutions like KubeMQ-Aiway represent how AI transitions from experimental projects to production systems that deliver sustainable business value. Organizations that recognize this pattern and invest in proven AI infrastructure will maintain a competitive advantage over those that continue trying to solve infrastructure problems at the application layer. Have a really great day!

By John Vester DZone Core CORE
Memory Leak Due to Uncleared ThreadLocal Variables
Memory Leak Due to Uncleared ThreadLocal Variables

In Java, we commonly use static, instance (member), and local variables. Occasionally, we use ThreadLocal variables. When a variable is declared as ThreadLocal, it will only be visible to that particular thread. ThreadLocal variables are extensively used in frameworks such as Log4J and Hibernate. If these ThreadLocal variables aren’t removed after their use, they will accumulate in memory and have the potential to trigger an OutOfMemoryError. In this post, let’s learn how to troubleshoot memory leaks that are caused by ThreadLocal variables. ThreadLocal Memory Leak Here is a sample program that simulates a ThreadLocal memory leak. Plain Text 01: public class ThreadLocalOOMDemo { 02: 03: private static final ThreadLocal<String> threadString = new ThreadLocal<>(); 04: 05: private static final String text = generateLargeString(); 06: 07: private static int count = 0; 08: 09: public static void main(String[] args) throws Exception { 10: while (true) { 11: 12: Thread thread = new Thread(() -> { 13: threadString.set("String-" + count + text); 14: try { 15: Thread.sleep(Long.MAX_VALUE); // Keep thread alive 16: } catch (InterruptedException e) { 17: Thread.currentThread().interrupt(); 18: } 19: }); 20: 21: thread.start(); 22: count++; 23: System.out.println("Started thread #" + count); 24: } 25: } 26: 27: private static String generateLargeString() { 28: StringBuilder sb = new StringBuilder(5 * 1024 * 1024); 29: while (sb.length() < 5 * 1024 * 1024) { 30: sb.append("X"); 31: } 32: return sb.toString(); 33: } 34:} 35: Before continuing to read, please take a moment to review the above program closely. In the above program, in line #3, ‘threadString’ is declared as a ‘ThreadLocal’ variable. In line #10, the program is infinitely (i.e., ‘while (true)’ condition) creating new threads. In line #13, to each created thread, it’s setting a large string (i.e., ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’) as a ThreadLocal variable. The program never removes the ThreadLocal variable once it’s created. So, in a nutshell, the program is creating new threads infinitely and slapping each new thread with a large string as its ThreadLocal variable and never removing it. Thus, when the program is executed, ThreadLocal variables will continuously accumulate into memory and finally result in ‘java.lang.OutOfMemoryError: Java heap space’. How to Diagnose ThreadLocal Memory Leak? You want to follow the steps highlighted in this post to diagnose the OutOfMemoryError: Java Heap Space. In a nutshell, you need to do: 1. Capture Heap Dump You need to capture a heap dump from the application, right before the JVM throws an OutOfMemoryError. In this post, eight options for capturing a heap dump are discussed. You may choose the option that best suits your needs. My favorite option is to pass the ‘-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<FILE_PATH_LOCATION>‘ JVM arguments to your application at the time of startup. Example: Shell -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/tmp/heapdump.hprof When you pass the above arguments, JVM will generate a heap dump and write it to ‘/opt/tmp/heapdump.hprof’ file whenever an OutOfMemoryError is thrown. 2. Analyze Heap Dump Once a heap dump is captured, you need to analyze the dump. In the next section, we will discuss how to do heap dump analysis. Heap Dump Analysis: ThreadLocal Memory Leak Heap dumps can be analyzed using various heap dump analysis tools, such as HeapHero, JHat, and JVisualVM. Here, let’s analyze the heap dump captured from this program using the HeapHero tool. HeapHero flags memory leak using ML algorithm The HeapHero tool utilizes machine learning algorithms internally to detect whether any memory leak patterns are present in the heap dump. Above is the screenshot from the heap dump analysis report, flagging a warning that there are 66 instances of ‘java.lang.Thread’ objects, which together is occupying 97.13% of overall memory. It’s a strong indication that the application is suffering from memory leak and it originates from the ‘java.lang.Thread’ objects. Largest Objects section highlights Threads consuming majority of heap space The ‘Largest Objects’ section in the HeapHero analysis report shows all the top memory-consuming objects, as shown in the above figure. Here you can clearly notice that all of these objects are of type ‘java.lang.Thread’ and each of them occupies ~10MB of memory. This clearly shows the culprit objects that are responsible for the memory leak. Outgoing Reference section shows the ThreadLocal strings Tools also give the capability to drill down into the object to investigate its content. When you drill down into any one of the Threads reported in the ‘Largest Object’ section, you can see all its child objects. From the above figure, you can notice the actual ThreadLocal string ‘String-1XXXXXXXXXXXXXXXXXXXXXXX…’ to be reported. Basically, this is the string that was added in line #13 of the above programs to be reported. Thus, the tool helps you to point out the memory-leaking object and its source with ease. How to Prevent ThreadLocal Memory Leak Once ThreadLocal variables are used, always call: Shell threadString.remove(); This clears the ThreadLocal variable value from the current thread and avoids the potential memory leaks. Conclusion Uncleared ThreadLocal variables are a subtle issue; however, when left unnoticed, they can accumulate over a period of time and have the potential to bring down the entire application. By being disciplined about removing the ThreadLocal variable after its use, and by using tools like HeapHero for faster root cause analysis, you can protect your applications from hard-to-detect outages.

By Ram Lakshmanan DZone Core CORE
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)
Mastering Fluent Bit: Controlling Logs With Fluent Bit on Kubernetes (Part 4)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit to get control of logs on a Kubernetes cluster. In case you missed the previous article, I'm providing a short introduction to Fluent Bit before sharing how to use Fluent Bit telemetry pipeline on a Kubernetes cluster to take control of all the logs being generated. What Is Fluent Bit? Before diving into Fluent Bit, let's step back and look at the position of this project within the Fluent organization. If we look at the Fluent organization on GitHub, we find the Fluentd and Fluent Bit projects hosted there. The backstory is that the project began as a log parsing project, using Fluentd, which joined the CNCF in 2026 and achieved Graduated status in 2019. Once it became apparent that the world was heading towards cloud-native Kubernetes environments, the solution was not designed to meet the flexible and lightweight requirements that Kubernetes solutions demanded. Fluent Bit was born from the need to have a low-resource, high-throughput, and highly scalable log management solution for cloud native Kubernetes environments. The project was started within the Fluent organization as a sub-project in 2017, and the rest is now a 10-year history in the release of v4 last week. Fluent Bit has become so much more than a flexible and lightweight log pipeline solution, now able to process metrics and traces, and becoming a telemetry pipeline collection tool of choice for those looking to put control over their telemetry data right at the source where it's being collected. Let's get started with Fluent Bit and see what we can do for ourselves! Why Control Logs on a Kubernetes Cluster? When you dive into the cloud native world, this means you are deploying containers on Kubernetes. The complexities increase dramatically as your applications and microservices interact in this complex and dynamic infrastructure landscape. Deployments can auto-scale, pods spin up and are taken down as the need arises, and underlying all of this are the various Kubernetes controlling components. All of these things are generating telemetry data, and Fluent Bit is a wonderfully simple way to take control of them across a Kubernetes cluster. It provides a way of collecting everything through a central telemetry pipeline as you go, while providing the ability to parse, filter, and route all your telemetry data. For developers, this article will demonstrate using Fluent Bit as a single point of log collection on a development Kubernetes cluster with a deployed workload. Finally, all examples in this article have been done on OSX and are assuming the reader is able to convert the actions shown here to their own local machines Where to Get Started To ensure you are ready to start controlling your Kubernetes cluster logs, the rest of this article assumes you have completed the previous article. This ensures you are running a two-node Kubernetes cluster with a workload running in the form of Ghost CMS, and Fluent Bit is installed to collect all container logs. If you did not work through the previous article, I've provided a Logs Control Easy Install project repository that you can download, unzip, and run with one command to spin up the Kubernetes cluster with the above setup on your local machine. Using either path, once set up, you are able to see the logs from Fluent Bit containing everything generated on this running cluster. This would be the logs across three namespaces: kube-system, ghost, and logging. You can verify that they are up and running by browsing those namespaces, shown here on my local machine: Go $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace kube-system NAME READY STATUS RESTARTS AGE coredns-668d6bf9bc-jrvrx 1/1 Running 0 69m coredns-668d6bf9bc-wbqjk 1/1 Running 0 69m etcd-2node-control-plane 1/1 Running 0 69m kindnet-fmf8l 1/1 Running 0 69m kindnet-rhlp6 1/1 Running 0 69m kube-apiserver-2node-control-plane 1/1 Running 0 69m kube-controller-manager-2node-control-plane 1/1 Running 0 69m kube-proxy-b5vjr 1/1 Running 0 69m kube-proxy-jxpqc 1/1 Running 0 69m kube-scheduler-2node-control-plane 1/1 Running 0 69m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace ghost NAME READY STATUS RESTARTS AGE ghost-dep-8d59966f4-87jsf 1/1 Running 0 77m ghost-dep-mysql-0 1/1 Running 0 77m $ kubectl --kubeconfig target/2nodeconfig.yaml get pods --namespace logging NAME READY STATUS RESTARTS AGE fluent-bit-7qjmx 1/1 Running 0 41m The initial configuration for the Fluent Bit instance is to collect all container logs, from all namespaces, shown in the fluent-bit-helm.yaml configuration file used in our setup, highlighted in bold below: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri outputs: - name: stdout match: '*' To see all the logs collected, we can dump the Fluent Bit log file as follows, using the pod name we found above: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-7qjmx --nanmespace logging [OUTPUT-CUT-DUE-TO-LOG-VOLUME] ... You will notice if you browse that you have error messages, info messages, if you look hard enough, some logs from Ghost's MySQL workload, the Ghost CMS workload, and even your Fluent Bit instance. As a developer working on your cluster, how can you find anything useful in this flood of logging? The good thing is you do have a single place to look for them! Another point to mention is that by using the Fluent Bit tail input plugin and setting it to read from the beginning of each log file, we have ensured that our log telemetry data is taken from all our logs. If we didn't set this to collect from the beginning of the log file, our telemetry pipeline would miss everything that was generated before the Fluent Bit instance started. This ensures we have the workload startup messages and can test on standard log telemetry events each time we modify our pipeline configuration. Let's start taking control of our logs and see how we, as developers, can make some use of the log data we want to see during our local development testing. Taking Back Control The first thing we can do is to focus our log collection efforts on just the workload we are interested in, and in this example, we are looking to find problems with our Ghost CMS deployment. As you are not interested in the logs from anything happening in the kube-system namespace, you can narrow the focus of your Fluent Bit input plugin to only examine Ghost log files. This can be done by making a new configuration file called myfluent-bit-heml.yaml file and changing the default path as follows in bold: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' The next step is to update the Fluent Bit instance with a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-mzktk 1/1 Running 0 28s Now, explore the logs being collected by Fluent Bit and notice that all the kube-system namespace logs are no longer there, and we can focus on our deployed workload. Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-mzktk --nanmespace logging ... [11] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.278137067, {}], {"time"=>"2025-05-18T15:51:26.278137067Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.27 INFO ==> Configuring database"}] [12] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583486.318427288, {}], {"time"=>"2025-05-18T15:51:26.318427288Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:26.31 INFO ==> Setting up Ghost"}] [13] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.211337893, {}], {"time"=>"2025-05-18T15:51:31.211337893Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.21 INFO ==> Configuring Ghost URL to http://127.0.0.1:2368"}] [14] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.234609188, {}], {"time"=>"2025-05-18T15:51:31.234609188Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.23 INFO ==> Passing admin user creation wizard"}] [15] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583491.243222300, {}], {"time"=>"2025-05-18T15:51:31.2432223Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:31.24 INFO ==> Starting Ghost in background"}] [16] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583519.424206501, {}], {"time"=>"2025-05-18T15:51:59.424206501Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:51:59.42 INFO ==> Stopping Ghost"}] [17] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583520.921096963, {}], {"time"=>"2025-05-18T15:52:00.921096963Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:00.92 INFO ==> Persisting Ghost installation"}] [18] kube.var.log.containers.ghost-dep-8d59966f4-87jsf_ghost_ghost-dep-c8ee31893743a1ce781f6f43ea3d0bfb93412623a721a2248e842936dc567089.log: [[1747583521.008567054, {}], {"time"=>"2025-05-18T15:52:01.008567054Z", "stream"=>"stderr", "_p"=>"F", "log"=>"ghost 15:52:01.00 INFO ==> ** Ghost setup finished! **"}] ... This is just a selection of log lines from the total output. If you look closer, you see these logs have their own sort of format, so let's standardize them so that JSON is the output format and make the various timestamps a bit more readable by changing your Fluent Bit output plugin configuration as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-gqsc8 1/1 Running 0 42s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-gqsc8 --nanmespace logging ... {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} {"date":"2025-06-05 13:49:59.387736","time":"2025-06-05T13:49:59.387736981Z","stream":"stdout","_p":"F","log":""} {"date":"2025-06-05 13:49:59.451176","time":"2025-06-05T13:49:59.451176821Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.45 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Starting Ghost **"} {"date":"2025-06-05 13:50:00.171207","time":"2025-06-05T13:50:00.171207812Z","stream":"stdout","_p":"F","log":""} ... Now, if we look closer at the array of messages and being the developer we are, we've noticed a mix of stderr and stdout log lines. Let's take control and trim out all the lines that do not contain stderr, as we are only interested in what is broken. We need to add a filter section to our Fluent Bit configuration using the grep filter and targeting a regular expression to select the keys stream or stderr as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-npn8n 1/1 Running 0 12s Now, explore the logs being collected by Fluent Bit and notice the output changes: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-npn8n --nanmespace logging ... {"date":"2025-06-05 13:49:34.807524","time":"2025-06-05T13:49:34.807524266Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.80 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring database"} {"date":"2025-06-05 13:49:34.860722","time":"2025-06-05T13:49:34.860722188Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:34.86 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Setting up Ghost"} {"date":"2025-06-05 13:49:36.289847","time":"2025-06-05T13:49:36.289847086Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.28 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Configuring Ghost URL to http://127.0.0.1:2368"} {"date":"2025-06-05 13:49:36.373376","time":"2025-06-05T13:49:36.373376803Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Passing admin user creation wizard"} {"date":"2025-06-05 13:49:36.379461","time":"2025-06-05T13:49:36.379461971Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:36.37 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Starting Ghost in background"} {"date":"2025-06-05 13:49:58.001603","time":"2025-06-05T13:49:58.001603337Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:58.00 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Stopping Ghost"} {"date":"2025-06-05 13:49:59.291618","time":"2025-06-05T13:49:59.291618721Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.29 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> Persisting Ghost installation"} {"date":"2025-06-05 13:49:59.387701","time":"2025-06-05T13:49:59.38770119Z","stream":"stderr","_p":"F","log":"\u001b[38;5;6mghost \u001b[38;5;5m13:49:59.38 \u001b[0m\u001b[38;5;2mINFO \u001b[0m ==> ** Ghost setup finished! **"} ... We are no longer seeing standard output log events, as our telemetry pipeline is now filtering to only show standard error-tagged logs! This exercise has shown how to format and prune our logs using our Fluent Bit telemetry pipeline on a Kubernetes cluster. Now let's look at how to enrich our log telemetry data. We are going to add tags to every standard error line pointing the on-call developer to the SRE they need to contact. To do this, we expand our filter section of the Fluent Bit configuration using the modify filter and targeting the keys stream or stderr to remove those keys and add two new keys, STATUS and ACTION, as follows: Go args: - --workdir=/fluent-bit/etc - --config=/fluent-bit/etc/conf/fluent-bit.yaml config: extraFiles: fluent-bit.yaml: | service: flush: 1 log_level: info http_server: true http_listen: 0.0.0.0 http_port: 2020 pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*ghost* multiline.parser: docker, cri filters: - name: grep match: '*' regex: stream stderr - name: modify match: '*' condition: Key_Value_Equals stream stderr remove: stream add: - STATUS REALLY_BAD - ACTION CALL_SRE outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Update the Fluent Bit instance using a helm update command as follows: Go $ helm upgrade --kubeconfig target/2nodeconfig.yaml --install fluent-bit fluent/fluent-bit --set image.tag=4.0.0 --namespace=logging --create-namespace --values=myfluent-bit-helm.yaml NAME READY STATUS RESTARTS AGE fluent-bit-ftfs4 1/1 Running 0 32s Now, explore the logs being collected by Fluent Bit and notice the output changes where the stream key is missing and two new ones have been added at the end of each error log event: Go $ kubectl --kubeconfig target/2nodeconfig.yaml logs fluent-bit-ftfs4 --nanmespace logging ... [CUT-LINE-FOR-VIEWING] Configuring database"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Setting up Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Configuring Ghost URL to http://127.0.0.1:2368"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Passing admin user creation wizard"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Starting Ghost in background"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Stopping Ghost"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] Persisting Ghost installation"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} [CUT-LINE-FOR-VIEWING] ** Ghost setup finished! **"},"STATUS":"REALLY_BAD","ACTION":"CALL_SRE"} ... Now we have a running Kubernetes cluster, with two nodes generating logs, a workload in the form of a Ghost CMS generating logs, and using a Fluent Bit telemetry pipeline to gather and take control of our log telemetry data. Initially, we found that gathering all log telemetry data was flooding too much information to be able to sift out the important events for our development needs. We then started taking control of our log telemetry data by narrowing our collection strategy, by filtering, and finally by enriching our telemetry data. More in the Series In this article, you learned how to use Fluent Bit on a Kubernetes cluster to take control of your telemetry data. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, integrating Fluent Bit telemetry pipelines with OpenTelemetry.

By Eric D. Schabell DZone Core CORE
HTAP Using a Star Query on MongoDB Atlas Search Index
HTAP Using a Star Query on MongoDB Atlas Search Index

MongoDB is often chosen for online transaction processing (OLTP) due to its flexible document model, which can align with domain-specific data structures and access patterns. Beyond basic transactional workloads, MongoDB also supports search capabilities through Atlas Search, built on Apache Lucene. When combined with the aggregation pipeline, this enables limited online analytical processing (OLAP) functionality suitable for near-real-time analytics. Because MongoDB uses a unified document model, these analytical queries can run without restructuring the data, allowing for certain hybrid transactional and analytical (HTAP) workloads. This article explores such a use case in the context of healthcare. Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, to support efficient ad-hoc queries. This typically requires a dimensional schema, or star schema, which is distinct from the normalized operational schema used for transactional updates. MongoDB can support a similar querying approach using its document schema, which is often designed for operational use. By adding an Atlas Search index to the collection storing transactional data, certain analytical queries can be supported without restructuring the schema. To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilized the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning. Import the Dataset The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records. I downloaded the CSV files and uncompressed them: curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw" unzip medsynora-dw.zip I imported each file into a collection, using mongoimport from the MongoDB Database Tools: for i in "MedSynora DW"/*.csv do mongoimport -d "MedSynoraDW" --file="$i" --type=csv --headerline -c "$(basename "$i" .csv)" -j 8 done For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers: # head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv ==> MedSynora DW/FactEncounter.csv <== Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent ==> MedSynora DW/FactLabTests.csv <== Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue The fact tables referenced the following dimensions: # head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv ==> MedSynora DW/DimDisease.csv <== Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit ==> MedSynora DW/DimDoctor.csv <== Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count ==> MedSynora DW/DimInsurance.csv <== InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments ==> MedSynora DW/DimPatient.csv <== Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type ==> MedSynora DW/DimRoom.csv <== RoomKey,Care_Level,Room Type Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition, it is described in two CSV files to fit into two SQL tables: Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections. MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship. Since this represents the appropriate data model for an operational database with such data, I created a new collection using an aggregation pipeline to replace the two imported from the normalized CSV: db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 }); db.FactEncounter.aggregate([ { $lookup: { from: "FactLabTests", localField: "Encounter_ID", foreignField: "Encounter_ID", as: "LabTests" } }, { $addFields: { LabTests: { $map: { input: "$LabTests", as: "test", in: { Phase: "$$test.Phase", LabType: "$$test.LabType", TestName: "$$test.TestName", TestValue: "$$test.TestValue" } } } } }, { $out: "FactEncounterLabTests" } ]); Here is how one document looks: AtlasLocalDev atlas [direct: primary] MedSynoraDW> db.FactEncounterLabTests.find().limit(1) [ { _id: ObjectId('67fc3d2f40d2b3c843949c97'), Encounter_ID: 2158, Patient_ID: 'TR479', Disease_ID: 1632, ResponsibleDoctorID: 905, InsuranceKey: 82, RoomKey: 203, CheckinDate: '2024-01-23 11:09:00', CheckoutDate: '2024-03-29 17:00:00', CheckinDateKey: 20240123, CheckoutDateKey: 20240329, Patient_Severity_Score: 63.2, RadiologyType: 'None', RadiologyProcedureCount: 0, EndoscopyType: 'None', EndoscopyProcedureCount: 0, CompanionPresent: 'True', LabTests: [ { Phase: 'Admission', LabType: 'CBC', TestName: 'Lymphocytes_abs (10^3/µl)', TestValue: 1.34 }, { Phase: 'Admission', LabType: 'Chem', TestName: 'ALT (U/l)', TestValue: 20.5 }, { Phase: 'Admission', LabType: 'Lipids', TestName: 'Triglycerides (mg/dl)', TestValue: 129.1 }, { Phase: 'Discharge', LabType: 'CBC', TestName: 'RBC (10^6/µl)', TestValue: 4.08 }, ... In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections. Atlas Search Index Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined. I created a single Atlas Search index encompassing all dimensions and measures I intended to use in predicates, including those in embedded documents. db.FactEncounterLabTests.createSearchIndex( "SearchFactEncounterLabTests", { mappings: { dynamic: false, fields: { "Encounter_ID": { "type": "number" }, "Patient_ID": { "type": "token" }, "Disease_ID": { "type": "number" }, "InsuranceKey": { "type": "number" }, "RoomKey": { "type": "number" }, "ResponsibleDoctorID": { "type": "number" }, "CheckinDate": { "type": "token" }, "CheckoutDate": { "type": "token" }, "LabTests": { "type": "document" , fields: { "Phase": { "type": "token" }, "LabType": { "type": "token" }, "TestName": { "type": "token" }, "TestValue": { "type": "number" } } } } } } ); Since I don't need extra text searching on the keys, I designated the character string ones as token. I labeled the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD. In relational databases, the star schema approach involves limiting the number of columns in fact tables due to their typically large number of rows. Dimension tables, which are generally smaller, can include more columns and are often denormalized in SQL databases, making the star schema more common than the snowflake schema. Similarly, in document modeling, embedding all dimension fields can increase the size of fact documents unnecessarily, so referencing dimension collections is often preferred. MongoDB’s data modeling principles allow it to be queried similarly to a star schema without additional complexity, as its design aligns with common application access patterns. Star Query A star schema allows processing queries which filter fields within dimension collections in several stages: In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted. For an example query, I aimed to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions. Search Aggregation Pipeline To optimize the fact collection process and apply all filters, I began with a simple aggregation pipeline that started with a search on the search index. This enabled filters to be applied directly to the fields in the fact collection, while additional filters were incorporated in the first stage of the star query. I used a local variable with a compound operator to facilitate adding more filters for each dimension during this stage. Before proceeding through the star query stages to add filters on dimensions, my query included a filter on the lab type, which was part of the fact collection and indexed. const search = { "$search": { "index": "SearchFactEncounterLabTests", "compound": { "must": [ { "in": { "path": "LabTests.LabType" , "value": "Lipids" } }, ] }, "sort": { CheckoutDate: -1 } } } I added a sort operation to order the results by check-out date in descending order. This illustrated the advantage of sorting during the index search rather than in later stages of the aggregation pipeline, especially when a limit was applied. I used this local variable to add more filters in Stage 1 of the star query, so that it could be executed for Stage 2 and collect documents for Stage 3. Stage 1: Query the Dimension Collections In the first phase of the star query, I obtained the dimension keys from the dimension collections. For every dimension with a filter, I retrieved the dimension keys using a find() on the dimension collection and appended a must condition to the compound of the fact index search. The following added the conditions on the Patient (female patients over 170 cm): search["$search"]["compound"]["must"].push( { in: { path: "Patient_ID", // Foreign Key in Fact value: db.DimPatient.find( // Dimension collection {Gender: "Female", Height: { "$gt": 170 } // filter on Dimension ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension }) The following added the conditions on the Doctor (Japanese): search["$search"]["compound"]["must"].push( { in: { path: "ResponsibleDoctorID", // Foreign Key in Fact value: db.DimDoctor.find( // Dimension collection {"Doctor Nationality": "Japanese" } // filter on Dimension ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Room (Deluxe): search["$search"]["compound"]["must"].push( { in: { path: "RoomKey", // Foreign Key in Fact value: db.DimRoom.find( // Dimension collection {"Room Type": "Deluxe" } // filter on Dimension ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension }) The following added the condition on the Disease (Hematology): search["$search"]["compound"]["must"].push( { in: { path: "Disease_ID", // Foreign Key in Fact value: db.DimDisease.find( // Dimension collection {"Disease Type": "Hematology" } // filter on Dimension ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension }) Finally, here's the condition on the Insurance coverage (greater than 80%): search["$search"]["compound"]["must"].push( { in: { path: "InsuranceKey", // Foreign Key in Fact value: db.DimInsurance.find( // Dimension collection {"Coverage Limit": { "$gt": 0.8 } } // filter on Dimension ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension }) All these search criteria had the same structure: a find() on the dimension collection with the filters from the query, resulting in an array of dimension keys (similar to primary keys in a dimension table) that were used to search the fact documents by referencing them (like foreign keys in a fact table). Each of these steps queried the dimension collection to obtain a simple array of dimension keys, which were then added to the aggregation pipeline. Rather than joining tables as in a relational database, the criteria on the dimensions were pushed down into the query on the fact collection. Stage 2: Query the Fact Search Index Using the results from the dimension queries, I built the following pipeline search step: AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search) { '$search': { index: 'SearchFactEncounterLabTests', compound: { must: [ { in: { path: 'LabTests.LabType', value: 'Lipids' } }, { in: { path: 'Patient_ID', value: [ 'TR551', 'TR751', 'TR897', 'TRGT201', 'TRJB261', 'TRQG448', 'TRSQ510', 'TRTP535', 'TRUC548', 'TRVT591', 'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896', 'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672', 'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003', 'TRE004', 'TRMN351', 'TRRY492', 'TRTI528', 'TRAKA962', 'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384', 'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419', 'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079', 'TR663', 'TR808', 'TR849', 'TRKA286', 'TRLC314', 'TRMG344', 'TRPT435', 'TRVZ597', 'TRXC626', 'TRACT773', 'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167', 'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514', 'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990', 'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374', 'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026', 'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354', 'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589', 'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987', 'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121', ... 6184 more items ] } }, { in: { path: 'ResponsibleDoctorID', value: [ 830, 844, 862, 921 ] } }, { in: { path: 'RoomKey', value: [ 203 ] } }, { in: { path: 'Disease_ID', value: [ 1519, 1506, 1504, 1510, 1515, 1507, 1503, 1502, 1518, 1517, 1508, 1513, 1509, 1512, 1516, 1511, 1505, 1514 ] } }, { in: { path: 'InsuranceKey', value: [ 83, 84 ] } } ] }, sort: { CheckoutDate: -1 } } MongoDB Atlas Search indexes, which are built on Apache Lucene, handle queries with multiple conditions and long arrays of values. In this example, a search operation uses the compound operator with the must clause to apply filters across attributes. This approach applies filters after resolving complex conditions into lists of dimension keys. Using the search operation defined above, I ran an aggregation pipeline to retrieve the document of interest: db.FactEncounterLabTests.aggregate([ search, ]) With my example, nine documents were returned in 50 milliseconds. Estimate the Count This approach works well for queries with multiple filters where individual conditions are not very selective but their combination is. Querying dimensions and using a search index on facts helps avoid scanning unnecessary documents. However, depending on additional operations in the aggregation pipeline, it is advisable to estimate the number of records returned by the search index to prevent expensive queries. In applications that allow multi-criteria queries, it is common to set a threshold and return an error or warning if the estimated number of documents exceeds it, prompting users to add more filters. To support this, you can run a $searchMeta operation on the index before a $search. For example, the following checks that the number of documents returned by the filter is less than 10,000: MedSynoraDW> db.FactEncounterLabTests.aggregate([ { "$searchMeta": { index: search["$search"].index, compound: search["$search"].compound, count: { "type": "lowerBound" , threshold: 10000 } } } ]) [ { count: { lowerBound: Long('9') } } ] In my case, with nine documents, I can add more operations to the aggregation pipeline without expecting a long response time. If there are more documents than expected, additional steps in the aggregation pipeline may take longer. If tens or hundreds of thousands of documents are expected as input to a complex aggregation pipeline, the application may warn the user that the query execution will not be instantaneous, and may offer the choice to run it as a background job with a notification when done. With such a warning, the user may decide to add more filters, or a limit to work on a Top-n result, which will be added to the aggregation pipeline after a sorted search. Stage 3: Join Cack to Dimensions for Projection The first step of the aggregation pipeline fetches all the documents needed for the result, and only those documents, using efficient access through the search index. Once filtering is complete, the smaller set of documents is used for aggregation or projection in the later stages of the aggregation pipeline. In the third stage of the star query, it performs lookups on the dimensions to retrieve additional attributes needed for aggregation or projection. It might re-examine some collections used for filtering, which is not a problem since the dimensions remain small. For larger dimensions, the initial stage could save this information in a temporary array to avoid extra lookups, although this is often unnecessary. For example, when I wanted to display additional information about the patient and the doctor, I added two lookup stages to my aggregation pipeline: { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, For the simplicity of this demo, I imported the dimensions directly from the CSV file. In a well-designed database, the primary key for dimensions should be the document's _id field, and the collection ought to be established as a clustered collection. This design ensures efficient joins from fact documents. Most of the dimensions are compact and stay in memory. I added a final projection to fetch only the fields I needed. The full aggregation pipeline, using the search defined above with filters and arrays of dimension keys, is: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$lookup": { "from": "DimPatient", "localField": "Patient_ID", "foreignField": "Patient_ID", "as": "Patient" } }, { "$project": { "Patient_Severity_Score": 1, "CheckinDate": 1, "CheckoutDate": 1, "Patient.name": { "$concat": [ { "$arrayElemAt": ["$Patient.First Name", 0] }, " ", { "$arrayElemAt": ["$Patient.Last Name", 0] } ] }, "ResponsibleDoctor.name": { "$concat": [ { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Name", 0] }, " ", { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Surname", 0] } ] } } } ]) On a small instance, it returned the following result in 50 milliseconds: [ { _id: ObjectId('67fc3d2f40d2b3c843949a97'), CheckinDate: '2024-02-12 17:00:00', CheckoutDate: '2024-03-30 13:04:00', Patient_Severity_Score: 61.4, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Niina Johanson' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f5c'), CheckinDate: '2024-04-29 06:44:00', CheckoutDate: '2024-05-30 19:53:00', Patient_Severity_Score: 57.7, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Cindy Wibisono' } ] }, { _id: ObjectId('67fc3d2f40d2b3c843949f0e'), CheckinDate: '2024-10-06 13:43:00', CheckoutDate: '2024-11-29 09:37:00', Patient_Severity_Score: 55.1, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Asta Koch' } ] }, { _id: ObjectId('67fc3d2f40d2b3c8439523de'), CheckinDate: '2024-08-24 22:40:00', CheckoutDate: '2024-10-09 12:18:00', Patient_Severity_Score: 66, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Paloma Aguero' } ] }, { _id: ObjectId('67fc3d3040d2b3c843956f7e'), CheckinDate: '2024-11-04 14:50:00', CheckoutDate: '2024-12-31 22:59:59', Patient_Severity_Score: 51.5, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Aulikki Johansson' } ] }, { _id: ObjectId('67fc3d3040d2b3c84395e0ff'), CheckinDate: '2024-01-14 19:09:00', CheckoutDate: '2024-02-07 15:43:00', Patient_Severity_Score: 47.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Laura Potter' } ] }, { _id: ObjectId('67fc3d3140d2b3c843965ed2'), CheckinDate: '2024-01-03 09:39:00', CheckoutDate: '2024-02-09 12:55:00', Patient_Severity_Score: 57.6, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Gabriela Cassiano' } ] }, { _id: ObjectId('67fc3d3140d2b3c843966ba1'), CheckinDate: '2024-07-03 13:38:00', CheckoutDate: '2024-07-17 07:46:00', Patient_Severity_Score: 60.3, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Monica Zuniga' } ] }, { _id: ObjectId('67fc3d3140d2b3c843969226'), CheckinDate: '2024-04-06 11:36:00', CheckoutDate: '2024-04-26 07:02:00', Patient_Severity_Score: 62.9, ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ], Patient: [ { name: 'Stanislava Beranova' } ] } ] The star query approach focuses solely on filtering to obtain the input for further processing, while retaining the full power of aggregation pipelines. Additional Aggregation after Filtering When I have the set of documents efficiently filtered upfront, I can apply some aggregations before the projection. For example, the following grouped per doctor and counted the number of patients and the range of severity score: db.FactEncounterLabTests.aggregate([ search, { "$lookup": { "from": "DimDoctor", "localField": "ResponsibleDoctorID", "foreignField": "Doctor_ID", "as": "ResponsibleDoctor" } }, { "$unwind": "$ResponsibleDoctor" }, { "$group": { "_id": { "doctor_id": "$ResponsibleDoctor.Doctor_ID", "doctor_name": { "$concat": [ "$ResponsibleDoctor.Doctor Name", " ", "$ResponsibleDoctor.Doctor Surname" ] } }, "min_severity_score": { "$min": "$Patient_Severity_Score" }, "max_severity_score": { "$max": "$Patient_Severity_Score" }, "patient_count": { "$sum": 1 } // Count the number of patients } }, { "$project": { "doctor_name": "$_id.doctor_name", "min_severity_score": 1, "max_severity_score": 1, "patient_count": 1 } } ]) My filters got documents from only one doctor and nine patients: [ { _id: { doctor_id: 862, doctor_name: 'Sayuri Shan Kou' }, min_severity_score: 47.6, max_severity_score: 66, patient_count: 9, doctor_name: 'Sayuri Shan Kou' } ] Using a MongoDB document model, this method enables direct analytical queries on the operational database, removing the need for a separate analytical database. The search index operates as the analytical component for the operational database and works with the MongoDB aggregation pipeline. Since the search index runs as a separate process, it can be deployed on a dedicated search node to isolate resource usage. When running analytics on an operational database, queries should be designed to minimize impact on the operational workload. Conclusion MongoDB’s document model with Atlas Search indexes supports managing and querying data following a star schema approach. By using a single search index on the fact collection and querying dimension collections for filters, it is possible to perform ad-hoc queries without replicating data into a separate analytical schema as typically done in relational databases. This method resembles the approach used in SQL databases, where a star schema data mart is maintained apart from the normalized operational database. In MongoDB, the document model uses embedding and referencing patterns similar to a star schema and is structured for operational transactions. Search indexes provide similar functionality without moving data to a separate system. The method, implemented as a three-stage star query, can be integrated into client applications to optimize query execution and enable near-real-time analytics on complex data. This approach supports hybrid transactional and analytical processing (HTAP) workloads.

By Franck Pachot

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Understanding the 5 Levels of LeetCode to Crack Coding Interview

June 17, 2025 by Sajid khan

Before You Microservice Everything, Read This

June 16, 2025 by Nizam Abdul Khadar

Scrum Smarter, Not Louder: AI Prompts Every Developer Should Steal

June 16, 2025 by Ella Mitkin

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

TIOBE Programming Index News June 2025: SQL Falls to Record Low Popularity

June 18, 2025 by Megan Crouse

Top Trends for Data Streaming with Apache Kafka and Flink

June 18, 2025 by Kai Wähner DZone Core CORE

A New Era of Unified Lakehouse: Who Will Reign? A Deep Dive into Apache Doris vs. ClickHouse

June 18, 2025 by Michael Hayden

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

How to Master a DevSecOps Pipeline that Devs and AppSec Love

June 18, 2025 by Eran Kinsbruner

It’s Not Magic. It’s AI. And It’s Brilliant.

June 18, 2025 by Ananya K V

Secure DevOps in Serverless Architecture

June 18, 2025 by Gaurav Shekhar

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

TIOBE Programming Index News June 2025: SQL Falls to Record Low Popularity

June 18, 2025 by Megan Crouse

Top Trends for Data Streaming with Apache Kafka and Flink

June 18, 2025 by Kai Wähner DZone Core CORE

Why Whole-Document Sentiment Analysis Fails and How Section-Level Scoring Fixes It

June 18, 2025 by Sanjay Krishnegowda

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

How to Master a DevSecOps Pipeline that Devs and AppSec Love

June 18, 2025 by Eran Kinsbruner

The Shift of DevOps From Automation to Intelligence

June 18, 2025 by Arunsingh Jeyasingh Jacob

Secure DevOps in Serverless Architecture

June 18, 2025 by Gaurav Shekhar

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Top Trends for Data Streaming with Apache Kafka and Flink

June 18, 2025 by Kai Wähner DZone Core CORE

It’s Not Magic. It’s AI. And It’s Brilliant.

June 18, 2025 by Ananya K V

The Shift of DevOps From Automation to Intelligence

June 18, 2025 by Arunsingh Jeyasingh Jacob

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: