DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

DZone Spotlight

Saturday, September 20 View All Articles »
Building a Platform Abstraction for EKS Cluster Using Crossplane

Building a Platform Abstraction for EKS Cluster Using Crossplane

By Ramesh Sinha
Building on what we started earlier in an earlier article, here we’re going to learn how to extend our platform and create a platform abstraction for provisioning an AWS EKS cluster. EKS is AWS’s managed Kubernetes offering. Quick Refresher Crossplane is a Kubernetes CRD-based add-on that abstracts cloud implementations and lets us manage Infrastructure as code. Prerequisites Set up Docker Kubernetes.Follow the Crossplane installation based on the previous article.Follow the provider configuration based on the previous article.Apply all the network YAMLs from the previous article (including the updated network composition discussed later). This will create the necessary network resources for the EKS cluster. Some Plumbing When creating an EKS cluster, AWS needs to: Spin up the control plane (managed by AWS)Attach security groups Configure networking (ENIs, etc)Access the VPC and subnetsManage API endpointsInteract with other AWS services (e.g., CloudWatch for logging, Route53) To do this securely, AWS requires an IAM role that it can assume. We create that role here and reference it during cluster creation; details are provided below. Without this role, you'll get errors like "access denied" when creating the cluster. Steps to Create the AWS IAM Role Log in to the AWS Console and go to the IAM creation page.In the left sidebar, click RolesClick Create Role.Choose AWS service as the trusted entity type.Select the EKS use case, and choose the EKS Cluster.Attach the following policies: AmazonEKSClusterPolicyAmazonEKSServicePolicyAmazonEC2FullAccessAmazonEKSWorkerNodePolicyAmazonEC2ContainerRegistryReadOnlyAmazonEKS_CNI_PolicyProvide the name eks-crossplane-cluster and optionally add tags. Since we'll also create NodeGroups, which require additional permissions, for simplicity, I'm granting the Crossplane user (created in the previous article) permission to PassRole for the Crossplane cluster role, and this permission allows this user to tell AWS services (EKS) to assume the Crossplane cluster role on its behalf. Basically, this user can say, "Hey, EKS service, create a node group and use this role when doing it." To accomplish this, add the following inline policy to the Crossplane user: JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::914797696655:role/eks-crossplane-clsuter" } ] } Note: Typically, to follow the principle of Least Privilege, you should separate roles with policies: Control plane role with EKS admin permissionsNode role with permissions for node group creation. In the previous article, I had created only one subnet in the network composition, but the EKS control plane requires at least two AZs, with one subnet per AZ. You should modify the network composition from the previous article to add another subnet. To do so, just add the following to the network composition YAML, and don't forget to apply the composition and claim to re-create the network. YAML - name: subnet-b base: apiVersion: ec2.aws.upbound.io/v1beta1 kind: Subnet spec: forProvider: cidrBlock: 10.0.2.0/24 availabilityZone: us-east-1b mapPublicIpOnLaunch: true region: us-east-1 providerConfigRef: name: default patches: - fromFieldPath: status.vpcId toFieldPath: spec.forProvider.vpcId type: FromCompositeFieldPath - fromFieldPath: spec.claimRef.name toFieldPath: spec.forProvider.tags.Name type: FromCompositeFieldPath transforms: - type: string string: fmt: "%s-subnet-b" - fromFieldPath: status.atProvider.id toFieldPath: status.subnetIds[1] type: ToCompositeFieldPath We will also need a provider to support EKS resource creation, to create the necessary provider, save the following content into .yaml file. YAML apiVersion: pkg.crossplane.io/v1 kind: Provider metadata: name: provider-aws spec: package: xpkg.upbound.io/crossplane-contrib/provider-aws:v0.54.2 controllerConfigRef: name: default And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composite Resource Definition (XRD) Below, we’re going to build a Composite Resource Definition for the EKS cluster. Before diving in, one thing to note: If you’ve already created the network resources using the previous article, you may have noticed that the network composition includes a field that places the subnet ID into the composition resource’s status, specifically under status.subnetIds[0]. This value comes from the cloud's Subnet resource and is needed by other XCluster compositions. By placing it in the status field, the network composition makes it possible for other Crossplane compositions to reference and use it. Similar to what we did for network creation in the previous article, we’re going to create a Crossplane XRD, a Crossplane Composition, and finally a Claim that will result in the creation of an EKS cluster. At the end, I’ve included a table that serves as an analogy to help illustrate the relationship between the Composite Resource Definition (XRD), Composite Resource (XR), Composition, and Claim. To create an EKS XRD, save the following content into .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: CompositeResourceDefinition metadata: name: xclusters.aws.platformref.crossplane.io spec: group: aws.platformref.crossplane.io names: kind: XCluster plural: xclusters claimNames: kind: Cluster plural: clusters versions: - name: v1alpha1 served: true referenceable: true schema: openAPIV3Schema: type: object required: - spec properties: spec: type: object required: - parameters properties: parameters: type: object required: - region - roleArn - networkRef properties: region: type: string description: AWS region to deploy the EKS cluster in. roleArn: type: string description: IAM role ARN for the EKS control plane. networkRef: type: object description: Reference to a pre-created XNetwork. required: - name properties: name: type: string status: type: object properties: network: type: object required: - subnetIds properties: subnetIds: type: array items: type: string And apply using: YAML kubectl apply -f <your-file-name>.yaml Crossplane Composition Composition is the implementation; it tells Crossplane how to build all the underlying resources (Control Plane, NodeGroup). To create an EKS composition, save the below content into a .yaml file: YAML apiVersion: apiextensions.crossplane.io/v1 kind: Composition metadata: name: cluster.aws.platformref.crossplane.io spec: compositeTypeRef: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XCluster resources: - name: network base: apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: XNetwork patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.networkRef.name toFieldPath: metadata.name - type: ToCompositeFieldPath fromFieldPath: status.subnetIds toFieldPath: status.network.subnetIds - type: ToCompositeFieldPath fromFieldPath: status.subnetIds[0] toFieldPath: status.network.subnetIds[0] readinessChecks: - type: None - name: eks base: apiVersion: eks.aws.crossplane.io/v1beta1 kind: Cluster spec: forProvider: region: us-east-1 roleArn: "" resourcesVpcConfig: subnetIds: [] endpointPrivateAccess: true endpointPublicAccess: true providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.roleArn - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.resourcesVpcConfig.subnetIds - name: nodegroup base: apiVersion: eks.aws.crossplane.io/v1alpha1 kind: NodeGroup spec: forProvider: region: us-east-1 clusterNameSelector: matchControllerRef: true nodeRole: "" subnets: [] scalingConfig: desiredSize: 2 maxSize: 3 minSize: 1 instanceTypes: - t3.medium amiType: AL2_x86_64 diskSize: 20 providerConfigRef: name: default patches: - type: FromCompositeFieldPath fromFieldPath: spec.parameters.region toFieldPath: spec.forProvider.region - type: FromCompositeFieldPath fromFieldPath: spec.parameters.roleArn toFieldPath: spec.forProvider.nodeRole - type: FromCompositeFieldPath fromFieldPath: status.network.subnetIds toFieldPath: spec.forProvider.subnets And apply using: YAML kubectl apply -f <your-file-name>.yaml Claim I'm taking the liberty to explain the claim in more detail here. First, it's important to note that a claim is an entirely optional entity in Crossplane. It is essentially a Kubernetes Custom Resource Definition (CRD) that the platform team can expose to application developers as a self-service interface for requesting infrastructure, such as an EKS cluster. Think of it as an API payload: a lightweight, developer-friendly abstraction layer. In the earlier CompositeResourceDefinition (XRD), we created the Kind XCluster. But by using a claim, application developers can interact with a much simpler and more intuitive CRD like Cluster instead of XCluster. For simplicity, I have referenced the XNetwork composition name directly instead of the Network claim resource name. Crossplane creates the XNetwork resource and appends random characters to the claim name when naming it. As an additional step, you'll need to retrieve the actual XNetwork name from the Kubernetes API and use it here. While there are ways to automate this process, I’m keeping it simple here, let me know via comments if there are interest and I write more about how to automate that. To create a claim, save the content below into a .yaml file. Please note the roleArn being referenced in this, that is the role I had mentioned earlier, AWS uses it to create other resources. YAML apiVersion: aws.platformref.crossplane.io/v1alpha1 kind: Cluster metadata: name: demo-cluster namespace: default spec: parameters: region: us-east-1 roleArn: arn:aws:iam::914797696655:role/eks-crossplane-clsuter networkRef: name: crossplane-demo-network-jpv49 # <important> this is how EKS composition refers the network created earlier not the random character "jpv49" from XNetwork name And apply using: YAML kubectl apply -f <your-file-name>.yaml After this, you should see an EKS cluster in your AWS console, and ensure you are looking in the correct region. If there are any issues, look for error logs in the composite and managed resource. You could look at them using: YAML -- to get XCluster detail k get XCluster demo-cluster -o yaml # look for reconciliation errors or messages, you will also find reference to managed resource -- to look for status of a managed resource, example. k get Cluster.eks.aws.crossplane.io As I mentioned before, below is a table where I attempt to provide another analogy for various components used in Crossplane: componentanalogy XRD The interface, or blueprint for a product, defines what knobs users can turn XR (XCluster) A specific product instance with user-provided values Composition The function that implements all the details of the product Claim A customer-friendly interface for ordering the product, or an api payload. Patch I also want to explain an important concept we've used in our Composition: patching. You may have noticed the patches field in the .yaml blocks. In Crossplane, a composite resource is the high-level abstraction we define — in our case, that's XCluster. Managed resources are the actual cloud resources Crossplane provisions on our behalf — for example, the AWS EKS Cluster, Nodegroup A patch in a Crossplane Composition is a way to copy or transform data from/to the composite resource (XCluster) to/from the managed resources (Cluster, NodeGroup, etc.). Patching allows us to map values like region, roleArn, and names from the high-level composite to the actual underlying infrastructure — ensuring that developer inputs (or platform-defined parameters) flow all the way down to the cloud resources. Conclusion Using Crossplane, you can build powerful abstractions that shield developers from the complexities of infrastructure, allowing them to focus on writing application code. These abstractions can also be made cloud-agnostic, enabling benefits like portability, cost optimization, resilience and redundancy, and greater standardization. More
From Data Growth to Data Responsibility: Building Secure Data Systems in AWS

From Data Growth to Data Responsibility: Building Secure Data Systems in AWS

By Junaith Haja
Enterprise data solutions are growing across data warehouses, data lakes, data lakehouse, and hybrid platforms in cloud services. As the data grows exponentially across these services, it's the data practitioners' responsibility to secure the environment with secure guardrails and privacy boundaries. In this article, we will learn a framework for implementing security protocols in AWS and learn how to implement them across Redshift, Glue, DynamoDB, and Aurora database services. The Security Framework for Modern Data Infrastructure When building scalable and secure AWS-native data platforms (Glue, Redshift, DynamoDB, Aurora), I recommend thinking of security in terms of seven pillars. Each pillar comes with practical checkpoints you can implement and audit against. Pillar 1: Identity and Access Control The identity and access control framework ensures only the right people and systems can touch your data. This starts with centralizing identities with IAM Identity Center/SSO. Enforce the principle of least privilege with IAM roles (not long-lived users) that will grant access to identities, and only the user needs access to perform their job duties. We can also leverage attribute-based access control, which uses tags at the department level, department=finance, or data_classification=pii. By starting with identity as the first pillar in building a secure data solution, we establish clear boundaries across each database object with an owning principal. Pillar 2: Data Classification and Catalog Governance The second step is to go a level deeper and classify the datasets attached to identities. In a data lake, we can label datasets, for example, like pii=high or pii=highly-confidential, etc. Once classified, these tags drive tag-based access control (TBAC) across services such as Glue and Redshift, ensuring only the right people see the right data. Along with this, maintaining column-level metadata like region or compliance domain in the Glue Data Catalog makes governance consistent and transparent. With proper classification and catalog governance, policies can be applied uniformly across the enterprise instead of in silos Pillar 3: Network and Perimeter Security Keep your data safe by making sure it only travels in private, secure paths. Put your databases in private networks, use special connections (like VPC endpoints) to reach services, and make sure all data leaving the system is encrypted and checked. Pillar 4: Encryption as Needed We should not treat every data in the same way; it has to be based on the data classification from Pillar 2. For example, some data are red (very sensitive, like financial or health records), which should be tightly secured in AWS at rest using KMS and CMKs with rotation turned on. A good practice is not to store red data in open or persistent storage. Orange data is important but less sensitive, like business logs, we should ensure proper bucket polices are applied. Green is general data that can be shared more freely, like logs, but encryption is not needed. Pillar 5: Secrets and Credential Management Never store your passwords in a code base or in any queries. In AWS, you can keep them safe in Secrets Manager, which locks them up and changes them periodically. Instead of giving every app a fixed password, let it borrow a temporary key through IAM roles, which is safer and harder to misuse. For databases like Aurora, you don’t even need a password at all; you can log in with a short-lived token. The rule is simple: don’t use permanent keys; always use rotating or temporary ones. Pillar 6: Monitoring, Detection, and Audit Think of monitoring like a CCTV camera for your data. You should always know who touched what, when, and why. In AWS, you can turn on CloudTrail to record all actions and save these records safely in CloudWatch Logs. Tools like GuardDuty act like guards watching for unusual activity, while Security Hub gathers all warnings in one place. For stricter checks, databases like Aurora and Redshift have their own audit logs, and tools like Macie scan S3 to catch if sensitive files are exposed. The idea is simple: if something goes wrong, you should be able to trace it back quickly. Pillar 7: Policy as Code We can manage the entire cloud policies as infrastructure as code rather than manual deployments for scalability purposes. In AWS, you can define things like KMS keys, IAM roles, or Lake Formation policies in CloudFormation, CDK, or Terraform. Before changes go live, tools like cfn-nag or tfsec check if something looks unsafe. For risky actions (like changing IAM roles or encryption keys), you can set up approval steps so no one sneaks in a bad change. Example #1: AWS Glue + Lake Formation (Catalog, ETL, Data Perimeter) AWS Glue works like the factory that moves and transforms your data, while Lake Formation is the guardrail that makes sure only the right people and systems can see the right parts of that data. Together, they help centralize governance, protect sensitive fields, and ensure ETL jobs run safely without leaking information. Steps to Implement Security 1. Classify your data with tags: Define tags such as: pii= {none, low, high}, pii={true, false}, region={us, eu}. Apply these tags to databases, tables, and even columns in the Glue Data Catalog. 2. Control access with tag-based policies (TBAC): Create Lake Formation permissions using tags: Analyst role: pii!=highOps role: pii in {none, low}Compliance role: {Full access, audit rights} 3. Apply row-level filters and column masking: Use LF-governed tables to filter rows (e.g., only show region=session_region). Mask sensitive columns like email, date of birth, with hash values. 4. Secure your Glue jobs: Turn on encryption for S3, CloudWatch, and job bookmarks with KMS CMKs.Run Glue jobs inside a VPC, with S3 routed through Gateway/Interface Endpoints, not the public internet. Assign a minimal IAM role per job, keeping dev and prod roles separate and scoped to exact resources. 5. Keep catalog and ETL hygiene strong: Block public access to S3 buckets (disable ACLs/policies). Require encryption on all writes (aws:SecureTransport=true, x-amz-server-side-encryption). Enable continuous logging of Glue jobs into CloudWatch for audit and troubleshooting. Example #2: Amazon Redshift (Warehouse Analytics) Amazon Redshift is your data warehouse; it's powerful for analytics, but also home to a lot of sensitive data. Protecting it means enforcing who can see which rows or columns, isolating traffic so nothing leaks, and making sure every action is logged. Steps to Implement Security 1. Network and encryption: Place Redshift clusters or serverless workgroups in private subnets (no public endpoints). Turn on encryption at rest with a customer-managed KMS key. Force SSL connections (reject non-TLS). Use Enhanced VPC Routing so COPY/UNLOAD only moves data via VPC endpoints. 2. Identity and SSO: Use IAM Identity Center or SAML for single sign-on. Avoid static keys, rely on role chaining for COPY/UNLOAD to S3. 3. Fine-grained controls: Enable Row-Level Security (RLS) and Column-Level Security (CLS). Use dynamic data masking for fields like SSNs, showing only partial data unless the role allows full access. 4. Audit and logging: Enable database audit logging to S3/CloudWatch. Integrate with CloudTrail for management events. Example #3: Amazon DynamoDB (Operational Data) Amazon DynamoDB powers fast apps at scale, but governance here is about restricting who can touch which items, keeping traffic private, and ensuring logs exist for compliance. Steps to Implement Security 1. Item-level permissions: Use IAM conditions like dynamodb:LeadingKeys to tie access to a user’s partition key (e.g., only see their own orders). For Example, bind customer_id in the request to the caller’s IAM tag. 2. Private access and encryption: Use Gateway VPC Endpoints for DynamoDB; block non-VPC traffic if possible (via SCP). Require encryption at rest with customer-managed KMS keys. 3. Resilience and lifecycle: Turn on Point-in-Time Recovery (PITR) and on-demand backups. Use TTL for short-lived items to reduce exposure. (But don’t rely on TTL alone for compliance deletion.) 4. Audit: Enable CloudTrail data events for sensitive tables where you need full visibility (note: extra cost). 5. Streams and integrations: If using DynamoDB Streams for CDC, ensure consumer apps (Lambda, Glue) run inside a VPC with least-privilege roles. Force them to write only into encrypted destinations. Example #4: Amazon Aurora (Relational Data) Amazon Aurora is a managed relational database (compatible with PostgreSQL and MySQL) that runs mission-critical workloads. Because it often stores highly sensitive transactional data, the governance model here must combine AWS controls (encryption, network) with native SQL features (roles, RLS, auditing). Steps to Implement Security 1. Network and endpoints: Deploy Aurora clusters in private subnets, never expose public endpoints. Restrict inbound rules to application security groups only, not wide CIDRs. 2. Encryption and TLS: Enable KMS CMK encryption at cluster creation. Enforce TLS connections: set rds.force_ssl=1 (Postgres) to reject non-SSL clients. 3. Identity and credentials: Store master and user credentials in AWS Secrets Manager with automatic rotation (Lambda). Use IAM Database Authentication for short-lived token-based access — integrates neatly with CloudTrail for auditing. 4. Database-level governance: Define roles with least privilege: Shell CREATE ROLE analyst NOINHERIT; GRANT USAGE ON SCHEMA sales TO analyst; GRANT SELECT (order_id, amount, region) ON sales.orders TO analyst; Enable row-level security (RLS): Shell ALTER TABLE sales.orders ENABLE ROW LEVEL SECURITY; CREATE POLICY region_isolation ON sales.orders USING (region = current_setting('app.user_region', true)); 5. Auditing: Enable pgaudit to log SELECT, DDL, and DML events as needed. Stream Aurora/Postgres logs to CloudWatch Logs; set appropriate retention policies. 6. Backups, PITR, and disaster recovery: Turn on automated backups and Point-in-Time Recovery (PITR). Regularly test restores to verify recovery SLAs.For stronger assurance, create cross-region read replicas and protect them with replicated CMKs. AWS Security Framework Cheatsheet ControlGlueRedshiftDynamodbaurora Network isolation VPC jobs, endpoints Private subnets, no public endpoint, Enhanced VPC Routing Gateway VPC Endpoint Private subnets, SG-only ingress Encryption at rest KMS on catalog, logs, job I/O KMS CMK cluster/workgroup KMS CMK table KMS CMK cluster TLS in transit VPC → endpoints Require SSL TLS to endpoint (SigV4) Enforce SSL (rds.force_ssl) Fine-grained access LF TBAC, row/cell masking RLS/CLS + masking policies + late-binding views IAM + LeadingKeys ABAC GRANTs + RLS + views/pgcrypto Secrets & auth Job role least privilege SSO/SAML + IAM roles for COPY/UNLOAD IAM roles, no static keys Secrets Manager + rotation, optional IAM DB Auth Audit & detection Catalog access logs, Glue job logs User activity log, CloudTrail, QMRs CloudTrail data events pgaudit + CloudWatch Logs Backup/Recovery ETL is stateless Snapshots, cross-region as needed PITR + on-demand backups Automated backups, PITR, cross-region replica By grounding security in seven pillars, identity, classification, network, encryption, secrets management, monitoring, and policy as code, it helps organizations gain more than guardrails; they gain a framework for sustainable and secure growth. More

Trend Report

Data Engineering

Across the globe, companies aren't just collecting data, they are rethinking how it's stored, accessed, processed, and trusted by both internal and external users and stakeholders. And with the growing adoption of generative and agentic AI tools, there is a renewed focus on data hygiene, security, and observability.Engineering teams are also under constant pressure to streamline complexity, build scalable pipelines, and ensure that their data is high quality, AI ready, available, auditable, and actionable at every step. This means making a shift from fragmented tooling to more unified, automated tech stacks driven by open-source innovation and real-time capabilities.In DZone's 2025 Data Engineering Trend Report, we explore how data engineers and adjacent teams are leveling up. Our original research and community-written articles cover topics including evolving data capabilities and modern use cases, data engineering for AI-native architectures, how to scale real-time data systems, and data quality techniques. Whether you're entrenched in CI/CD data workflows, wrangling schema drift, or scaling up real-time analytics, this report connects the dots between strategy, tooling, and velocity in a landscape that is only becoming more intelligent (and more demanding).

Data Engineering

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta DZone Core CORE
Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett
Java Caching Essentials

More Articles

Enable AWS Budget Notifications With SNS Using AWS CDK
Enable AWS Budget Notifications With SNS Using AWS CDK

Keeping track of AWS spend is very important. Especially since it’s so easy to create resources, you might forget to turn off an EC2 instance or container you started, or remove a CDK stack for a specific experiment. Costs can creep up fast if you don’t put guardrails in place. Recently, I had to set up budgets across multiple AWS accounts for my team. Along the way, I learned a few gotchas (especially around SNS and KMS policies) that weren’t immediately clear to me as I started out writing AWS CDK code. In this post, we’ll go through how to: Create AWS Budgets with AWS CDKSend notifications via email and SNSHandle cases like encrypted topics and configuring resource policies If you’re setting up AWS Budgets for the first time, I hope this post will save you some trial and error. What Are AWS Budgets? AWS Budgets is part of AWS Billing and Cost Management. It lets you set guardrails for spend and usage limits. You can define a budget around cost, usage, or even commitment plans (like Reserved Instances and Savings Plans) and trigger alerts when you cross a threshold. You can think of Budgets as your planned spend tracker. Budgets are great for: Alerting when costs hit predefined thresholds (e.g., 80% of your budgeted spend)Driving team accountability by tying alerts to product or account ownersEnforcing a cap on monthly spend, triggering an action, and shutting down compute (EC2), if you go over budget (be careful with this) Keep in mind that budgets and their notifications are not instant. AWS billing data is processed multiple times a day, but you might trigger your budget a couple of hours after you’ve passed your threshold. This is clearly stated in the AWS documentation as: AWS billing data, which Budgets uses to monitor resources, is updated at least once per day. Keep in mind that budget information and associated alerts are updated and sent according to this data refresh cadence. Defining Budgets With AWS CDK You can create different kinds of budgets, depending on your requirements. Some examples are: Fixed budgets: Set one amount to monitor every budget period.Planned budgets: Set different amounts to monitor each budget period.Auto-adjusting budgets: Set a budget amount to be adjusted automatically based on the spending pattern over a time range that you specify. We’ll start with a simple example of how you can create a budget in the CDK. We’ll go for a fixed budget of about $100. The AWS CDK currently only has Level 1 constructs available for budgets, which means that the classes in the CDK are a 1 to 1 mapping to the CloudFormation resources. Because of this, you will have to explicitly define all required properties (constructs, IAM policies, resource policies, etc), which otherwise could be taken care of by a CDK L2 construct. It also means your CDK code will be a bit more verbose. We’ll start by using the CfnBudget construct. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' } } In the above example, we’ve created a budget with a limit of $100 per month. A budget alone isn’t very useful. You’d still have to check into the AWS console manually to see what your spend is compared to your budget. The important thing is that we want to get notified in case we reach our budget or our forecasted budget reaches our threshold, so let’s add a notification and a subscriber. TypeScript new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'EMAIL', address: '<your-email-address>' }] }] }); Based on the notification settings, interested parties are notified when the spend is forecasted to exceed 100% of our defined budget limit. You can put a notification on forecasted or actual percentages. When that happens, an email is sent to the designated email address. Subscribers, at the time of writing, can be either email recipients or a Simple Notification Service (SNS) topic. In the above code example, we use email subscribers for which you can add up to 10 recipients. Depending on your team or organization, it might be beneficial to switch to using an SNS topic. The advantage of using an SNS topic over a set of email subscribers is that you can add different kinds of subscribers (email, chat, custom lambda functions) to your SNS topic. With an SNS topic, you have a single place to configure subscribers, and if you change your mind, you can do so in one place instead of updating all budgets. Using an SNS Topic also allows you to push budget notifications to, for instance, a chat client like MS Teams or Slack. In this case, we will make use of SNS in combination with email subscribers. Let’s start by defining an SNS topic with the AWS CDK. TypeScript // Create a topic for email notifications let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic' }); Now, let’s add an email subscriber, as this is the simplest way to receive budget notifications. TypeScript // Add email subscription topic.addSubscription( new EmailSubscription("your-email-address")); This looks pretty straightforward, and you might think you’re done, but there is one important step to take next, which I initially forgot. The AWS budgets service will need to be granted permissions to publish messages to the topic. To be able to do this, we will need to add a resource policy to the topic that allows the budgets service to call the SNS:Publish action for our topic. TypeScript // Add resource policy to allow the budgets service to publish to the SNS topic topic.addToResourcePolicy(new PolicyStatement({ actions:["SNS:Publish"], effect: Effect.ALLOW, principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: [topic.topicArn], conditions: { ArnEquals: { 'aws:SourceArn': `arn:aws:budgets::${Stack.of(this).account}:*`, }, StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, }, })) Now, let’s assign the SNS topic as a subscriber in our CDK code. TypeScript // Define a fixed budget with SNS as subscriber new cdk.aws_budgets.CfnBudget(this, 'fixed-monthly-cost-budget', { budget: { budgetType: 'COST', budgetLimit: {amount: 100, unit: 'USD'}, budgetName: 'Monthly Costs Budget', timeUnit: 'MONTHLY' }, notificationsWithSubscribers: [{ notification: { comparisonOperator: 'GREATER_THAN', notificationType: 'FORECASTED', threshold: 100, thresholdType: 'PERCENTAGE' }, subscribers: [{ subscriptionType: 'SNS', address: topic.topicArn }] }] }); Working With Encrypted Topics If you have an SNS topic with encryption enabled (via KMS), you will need to make sure that the corresponding service has access to the KMS key. If you don’t, you will not get any messages, and as far as I could tell, you will see no errors (at least I could find none in CloudTrail). I actually wasted a couple of hours trying to figure this part out. I should have read the documentation, as it is explicitly stated to do so. I guess I should start with the docs instead of diving right into the AWS CDK code. TypeScript // Create KMS key used for encryption let key = new Key(this,'sns-kms-key', { alias: 'sns-kms-key', enabled: true, description: 'Key used for SNS topic encryption' }); // Create topic and assign the KMS key let topic = new Topic(this, 'budget-notifications-topic', { topicName: 'budget-notifications-topic', masterKey: key }); Now, let’s add the resource policy to the key and try to trim down the permissions as much as possible. TypeScript // Allow access from budgets service key.addToResourcePolicy(new PolicyStatement({ effect: Effect.ALLOW, actions: ["kms:GenerateDataKey*","kms:Decrypt"], principals: [new ServicePrincipal("budgets.amazonaws.com")], resources: ["*"], conditions: { StringEquals: { 'aws:SourceAccount': Stack.of(this).account, }, ArnLike: { "aws:SourceArn": "arn:aws:budgets::" + Stack.of(this).account +":*" } } })); Putting It All Together If you’ve configured everything correctly and deployed your stack to your target account, you should be good to go. Once you cross your threshold, you should be notified by email that your budget is exceeding one of your thresholds (depending on the threshold set). Summary In this post, we explored how to create AWS Budgets with AWS CDK and send notifications through email or SNS. Along the way, we covered some important topics like: Budgets alone aren’t useful until you add notifications.SNS topics need a resource policy so the Budgets service can publish.Encrypted topics require KMS permissions for the Budgets service. With these pieces in place, you’ll have a setup that alerts your team when costs exceed thresholds via email, chat, or custom integrations. A fully working CDK application with the code mentioned in this blog post can be found in the following GitHub repo.

By Jeroen Reijn DZone Core CORE
*You* Can Shape Trend Reports: Join DZone's Observability Research
*You* Can Shape Trend Reports: Join DZone's Observability Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you choose) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Observability Research Observability has grown from basic monitoring into a discipline that shapes how teams plan, build, and maintain software. As teams move beyond tracking uptime alone, it all boils down to knowing which practices, tools, and metrics truly drive performance, reliability, and value. Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Did we mention that anyone who takes the survey will be eligible for a chance to enter a raffle to win an e-gift card of their choosing? We're exploring key topics such as: Observability strategy, maturity, and tooling Key performance metrics, bottlenecks, and root causes Monitoring, logging, tracing, and open standards AIOps and AI for anomaly detection and incident response Join the Observability Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By DZone Editorial
Creating a Distributed Computing Cluster for a Data Base Management System: Part 1
Creating a Distributed Computing Cluster for a Data Base Management System: Part 1

Ideas of creating a distributed computing cluster (DCC) for database management systems (DBMS) have been striking me for quite a long time. If simplified, the DCC software makes it possible to combine many servers into one super server (cluster), performing an even balancing of all queries between individual servers. In this case, everything will appear for the application running on the DCC as if it was running with one server and one database (DB). It will not be dispersed databases on distributed servers, but work as one virtual one. All network protocols, replication exchanges, and proxy redirections will be concealed inside the DCC. At the same time, all resources of distributed servers, in particular RAM and CPU time, will be utilized evenly and in an efficient fashion. For example, in a cloud data processing center (DPC), it is possible to take one physical super server and divide it into a number of virtual DBMS servers. But the reverse procedure was not possible until now, i.e., it is not possible to take a number of physical servers and merge them into a single virtual DBMS super server. In some specified sense, DCC is a technology that makes it possible to merge physical servers into one virtual DBMS super server. I will take the liberty to make another comparison: DCC is just the same as the coherent nonuniform memory access (NUMA) technology except that it is used to merge SQL servers. But unlike NUMA, in DCC, software handles the synchronization (coherence) of the data and partly of the RAM, not the controller. For the sake of clarity, below is a diagram of the well-known connection of the client application to the DBMS server, and the DCC diagram immediately below that. Both diagrams are simplified, just for easy understanding. The idea behind the cluster is a decentralized model. In the figure above there is only one proxy server, but in general there can be more than one. This solution will result in the possibility to increase the DBMS scalability by a substantial margin relative to a typical single-server solution with the most powerful server at the moment. No such solution currently exists, or, at least, no one in my vast professional community is aware of such a solution. After five years of research, I worked out the logical architecture and interaction protocols in detail and, with the assistance of a handful of development personnel, created a working prototype that is undergoing load tests on a popular 1C8.x IT system under the management of PostgreSQL DBMS. MS SQL or Oracle may be the DBMS. Fundamentally, the choice of DBMS does not affect the ideas I will bring up. With this article, I am starting a series of articles on DCC, where I will gradually disclose one or another issue and offer solutions to them. I came up with this structure after speaking at one of the IT conferences, where the topic was found to be quite difficult to understand. The first article will be introductory, I will hit the peaks, skip the valleys (emphasizing non-obvious assertions), and outline what's to come in the following publications. For Which IT Systems DCC Is Effective The idea of DCC is to create a special software shell, which will perform all write requests simultaneously and synchronously on all servers, and read requests will be performed on a specific node (server) with user binding. In other words, users will be evenly distributed among the servers of the cluster: read requests will be executed locally on the server to which the user is bound, and change requests will be synchronously executed simultaneously on all servers (no logic violations will occur as a result). Therefore, provided that read requests significantly exceed write requests in terms of load, we get a roughly uniform load distribution among DCC servers (nodes). Let's first review this question: is the statement that the load of read requests far outweighs the load of write requests correct? To answer this question, it will be helpful to look back a bit at the history of the SQL language: what was the goal and what eventually came to fruition. A Quick Dive Into SQL SQL was originally planned as a language that could be used without programming or math skills. Here's an excerpt from Wikipedia: Codd used symbolic notation with mathematical notation of operations, but Chamberlin and Boyce wanted to design the language so that it could be used by any user, even those without programming skills or knowledge of math.[5] For now, it can be argued that programming skills for SQL are still needed, but definitely minimal. Most programmers have studied some basics of query optimization and have never heard of SQL Tuning by Dan Toe. A lot of logic for optimizing queries is concealed inside the DBMS. In the past, for example, MS SQL had a limit of 256 table joins; now in modern IT systems, it is common to have thousands of joins in a query. Dynamic SQL, where a query is constructed dynamically, is used widely and sometimes without much thought. The truth is there is no mathematically accurate model for plotting the optimal plan for executing a complex query. This problem is somewhat similar to the traveling salesman problem, and it is believed to have no exact mathematical solution. The conclusion is as follows: SQL queries have evolutionarily proven their effectiveness and almost all reporting is generated on SQL queries, which is not the case with business logic and transactional logic. Many of the SQL languages turned out to be not very convenient in terms of programming complex transactional logic. It does not support object-oriented programming and has very clumsy logic constructs. Therefore, it is safe to say that programming has split into two components. Writing a variety of SQL query reports, getting data to a customer or application server, and implementing the rest of the logic in the application-oriented language of the application (no matter if it's a two-tier or three-tier architecture). In terms of load on the DBMS, it looks like a hefty SQL constructs for reading and then lots of small ones for changing. Let us now consider the issue of load distribution of read-write queries on the DBMS server time wise. First, we need to define what load means and how it can be measured. The load will mean (in the order of priority of description): CPU (processor load), utilized RAM, and load on disk subsystem. CPU will be the main resource in terms of load. Let's consider an abstract OLTP system and divide all SQL calls from a set of parallel threads into two categories Read and Write. Next, based on the performance monitoring tools, plot an integral value such as CPU on a diagram. If the value is averaged for at least 30 seconds, we see that the value of the “Read” diagram is tens or even hundreds of times higher than the value of the “Write” diagram. This is because more users per unit of time can execute reports or macro transactions that use hefty SQL constructs on reading. Sure, there may be tasks when the system regularly loads data from replication queues and external systems, period end routine procedures are started, and backup procedures are started. But based on long-term statistics for an overwhelming number of IT systems, the load of SQL constructs on Read exceeds the load on Write by ten folds. Certainly, there may be exceptions, for example, billing systems where the fact of changes is recorded without any complex logic and reporting, but it is easy to check this with a special-purpose software and understand how effective DCC will be for the IT system. Strategic Area of Application Currently, DCC will be useful and perhaps vital for major companies with extensive information flows and a strong analytical component. For example, major banks may profit. With the help of relatively small servers, it is possible to compose a DCC, which will be far ahead of all existing supercomputers in terms of power. Needless to say, it won't be all about the pros. The downside will be the increasingly complex part of administering a distributed system and a definitive transaction slowdown. Unfortunately, it is true that the network protocols and logic circuits that DCC utilizes cause transactions to slow down. Currently, the target parameter is a transaction slowdown of no more than 15% in terms of time. But once again I repeat that in this case the system will become much more scalable, all peak loads will be without problems, and on average the transaction time will be less than in the case of using a single server. Therefore, if the system faces no problems with peak loads and strategically it is not expected, DCC will not be effective. In the future, DCC after the automation of administrative processes and optimization will probably also be effective for medium-sized companies because it will be possible to build a powerful cluster even using PCs with SSD (fast, unreliable, and cheap) disks. Its distributed structure will make it possible to easily disconnect a failed PC and connect a new one right on-the-fly. DCC's transaction control system will prevent data from being incorrectly recorded. Also, geopolitics cannot be ignored. For example, in case of lack of access to powerful servers, DCC will make it possible to build a powerful cluster using servers produced by domestic manufacturers. Why Transactional Replication Cannot Be Used for DCC This section requires a detailed description, and I will cover it in a separate article. Here I will point out only these problems: Many application developers, when using a DBMS, do not even think about what data access conflicts the system resolves within the engine. For example, it is impossible to set up transactional replication, achieve data synchronization across multiple servers, and call it a DBMS cluster. This solution will not resolve the conflict of simultaneous access to a writer-writer type record. Such collisions will certainly lead to a violation of the logic of the system behavior. Existing transaction replication protocols are also costly and such a system will be very much inferior to the single server option. In total, transactional replication is not suitable for ВСС because: 1. Excessive Costs of Typical Synchronous Transaction Replication Protocols Typical distributed transaction protocols have too many additional, primarily time-related, network costs. For one network call, up to three additional calls are received. In such a form, the simplest atomic operations degrade dramatically. 2. The Writer-Writer Conflict Is Not Resolved A conflict happens when the same data is changed simultaneously in different transactions. In terms of past change, the system only “remembers” the absolute last change (or history). The point of the SQL construct for sequential application gets lost. Such replication conflicts sometimes have no solutions at all. In a separate article, I will give an example of different replication types for PostgreSQL and Microsoft SQL, and I will explain: Why they cannot solve the transactional load balancing problem architecturallyWhy it is not solved architecturally at the hardware level The writer-writer problem is fundamentally unsolvable without a proxy service at the logical level of analyzing the application's SQL traffic. Exchange Mechanisms (Protocols) A full architectural description of DCC will be provided in a separate article. For now, let's confine ourselves to a brief summary to outline the issue at hand. All queries to the DBMS in DCC go through a proxy service. For example, on 1C systems, it can be installed on the application server. The proxy service recognizes whether the query type is Read or Write. And if Read, it sends it to the server bound to the user (session). If the query type is Change, it sends it to all servers asynchronously. It does not proceed to the next query until it receives a positive response from all servers. If an error occurs, it is propagated to the client application and the transaction is rolled back on all servers. If all servers have confirmed successful execution of the SQL construct, only then does the proxy process the next client SQL query. This is the kind of logic that does not result in logical contradictions. As can be seen, this arrangement incurs additional network and logical costs, although with proper optimization, they are minimal (we seek to achieve no more than 15% of the time delay of transactions). The algorithm described above is the basic protocol, and it is what we will call mirror-parallel. Unfortunately, this protocol is not logically capable of implementing mirrored data replication for all IT systems. In some cases, the data might for sure differ due to the specific nature of the system, another protocol is implemented for this purpose — “centralized asynchronous” — which will resolve synchronous information transfer for sure. The next section will cover it. Why a Centralized Protocol Is Needed in DCC Unfortunately, in some cases, sending the same structure to different servers gets assuredly different results. For example, when inserting a new record into a table, the primary key is generated based on the GUID on the server part. In this case, based on the definition alone, we will for sure get different results on different servers. As an option, it is possible to train the expert system of the proxy service to recognize that this field is formed on the server, form it explicitly on the proxy, and insert it into the query text. What if it is impossible to do so for some reason? To resolve such problems, another protocol and server is introduced. Let's call it Conditionally Central. Next, it will be clear that it is not actually a central server. The protocol algorithm is as follows. The proxy service recognizes that a SQL construct for a change is highly likely to produce different results on different servers. Therefore, it immediately redirects the query to the Conditionally Central server. Then after it is executed, using replication triggers, retrieve the changes that the query resulted in and send asynchronously all those changes to the remaining servers. And then proceed to execute the next command. Similar to the mirror-parallel protocol, if at least one of the servers encounters an error, it is redirected to the client and the transaction is rolled back. In this protocol, any collisions are completely prevented, data will always be guaranteed to be synchronous, and there will be almost no distributed deadlocks. But there is an essential downside: Due to its specific nature, the protocol imposes the highest runtime costs. Therefore, it will only be used in exceptional cases, otherwise no target delay parameters of no more than 15% will be even possible. Mechanisms for Ensuring Integrity and Synchrony of Distributed Data at the Transaction Level in DCC As we discussed in the previous section, there are logical (e.g., NEWGUID) SQL operations on change that, when executed simultaneously on different servers, will for sure take different values. Let us rule out all sorts of random functions and fluctuations. Let's assume we have explicit arithmetic procedures, e.g. UPDATE Summary Table SET Total = Total+Delta WHERE ProductID = Y. Certainly, such an arrangement in a single-thread option will lead to the same result and the data will be synchronous, because there are always laws of mathematics. But, if such constructs are executed in multithread mode by varying the Delta value, thread tangling may occur due to violation of the chronology of query execution. Which will lead to either deadlocks or data synchronization violations. In fact, it may turn out that the results of transactions on different servers may differ. Sure, it will be a rare occurrence, and it can be reduced by certain actions, but it cannot be completely ruled out, as well as it cannot be completely resolved without significant performance degradation. Such algorithms do not exist as a matter of fact, just as there is no such thing as fully synchronous time for multiple servers or network queries that are executed for sure for a certain amount of time. Therefore, DCC has a distributed transaction management service and, in particular, a transaction hash-sum check is mandatory. Why hash-sum? Because it is possible to quickly check the content of these changes on all servers. If everything matches, the transaction is confirmed, and if not, it is rolled back with a corresponding error. More details will follow in a separate article. In terms of mathematics, there are some interesting similarities with quantum mechanics, in particular with the transactional-loop theory (there is such a marginal theory). The Issue of Distributed Deadlocks in DCC This problem is that one of the key problems in DCC and in terms of risks of DCC implementation is the most dangerous. This is due to the fact that the occurrence of distributed deadlocks in DCC is a consequence of thread confusion due to the change in the chronology of SQL queries execution on different servers. This situation occurs due to uneven load on servers and network interfaces. In this case, unlike local deadlocks, which require at least two locking objects to occur, there can be only one object in a distributed deadlock. To reduce distributed deadlocks, several process challenges need to be addressed, one of them being the allocation of different physical network interfaces for writing and reading. After all, if we consider the ratio of CPU operations like Read to Write, there will be a ratio of one order, but for network traffic, the ratio will start from two orders of magnitude, more than hundreds of times. Therefore, by splitting these operations (Read-Write) on physically different channels of network communications, we can guarantee a certain time of delivery of Write-type SQL queries to all servers. Also, the fewer locks there are in the system, the less likely distributed deadlocks are in data. However, using DCCs as an additional benefit, it is possible to expand such bottlenecks, if any, in the system at the level of settings. If distributed deadlocks still occasionally occur, there is a special DCC service that monitors all blocking processes on all servers and resolves them by rolling back one of the transactions. More details will follow in a separate article. Special Features of DCC Administration Administration of a distributed system is certainly more complicated than that of a local system, especially with the requirements of operation 24/7. And all potential DCC users are just the proud owners of IT systems with 24/7 operation mode. Immediate problems include distributed database recovery and hot plugging of a new server to the DCC. Prompt data reconciliation in distributed databases is also necessary, despite transaction reconciliation mechanisms. Performance monitoring tasks and, in particular, the aggregation of counter data across related transactions and cluster servers in general begin to emerge. There are some security issues with setting up a proxy service. A full list of problems and proposed solutions will be in a separate article. Parallel Computing Technologies as a Solution to Increase the Efficiency of DCC Use For a scalable IT system, high parallelism of processes within the database is essential. For parallel reporting, as a rule, this issue does not occur. For transactional workloads, due to historical vestiges of suboptimal architecture, locks at the level of changing the same records (writer-writer conflict) are possible. If the IT system can be changed and there is open-source code, then the system can be optimized. And if the system is closed, what shall we do in this case? In case of using DCC, there are opportunities at the level of administration to circumvent such restrictions. Or at least expand the possibilities. In particular, through customizations, we can enable changing the same record without waiting for the transaction to be committed — if a dirty read is possible, of course. At the same time, if the transaction is rolled back, the change data in the chronological sequence are also rolled back. This situation is exactly appropriate, for example, for tables with aggregation of totals. I already have solutions for this problem, and I believe that regardless of using DCC, it is necessary to expand administrative settings of the DBMS, both Postgres and MSSQL (haven't investigated the issue on Oracle). More details will follow in a separate article. It is also necessary to disclose the topic of dirty reading in DCC and possible minor improvements taking this into account, such as the introduction of virtual locks. Plan for the Following Publications on the Topic of DCC Article 2. DCC load-testing results Article 3. Why transactional replication can't be used for DCC Article 4. Brief architectural description of the DCC Article 5. The purpose of a centralized protocol in DCC Article 6: Mechanisms for ensuring integrity and synchronization of distributed data at the transaction level in DCC Article 7. The problem of distributed deadlocks in DCC Article 8. Special features of DCC administration Article 9: Parallel computing technologies as a tool to increase the efficiency of DCC utilization Article 10. Example of DCC integration with 1C 8.x

By Vladimir Serdyuk
Mastering Fluent Bit: Top 3 Telemetry Pipeline Input Plugins for Developers (Part 6)
Mastering Fluent Bit: Top 3 Telemetry Pipeline Input Plugins for Developers (Part 6)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit tips for developers. In case you missed the previous article, check out the developer's guide to service section configuration, where you get tips on making the most of your developer inner loop with Fluent Bit. This article will be a hands-on tour of the things that help you as a developer testing out your Fluent Bit pipelines. We'll take a look at the input plugin section of a developer's telemetry pipeline configuration. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at a few tips and tricks to help you with your local development testing with regard to Fluent Bit input plugins. Pipeline Input Tricks See the previous article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the input plugins that can be of great help in our developer inner loop during testing. 1. Dummy Input Plugin When trying to narrow down a problem in our telemetry pipeline filtering or parsing actions, a developer's best friend is the ability to reduce the environment down to a controllable log input to process. Once the filter or parsing action has been validated to work, we can then proceed in our development environment with further testing of our project. The best pipeline input plugin helping us when trying to sort out that complex, pesky regular expression or Lua script in a filter is the dummy input plugin. It's also a quick and easy plugin to use for exploring a chain of filters that you want to verify before setting loose on your organization's telemetry data. We see below the ability to generate any sort of telemetry data using the key field dummy and tagging it with the key field tag. It can be as simple or complex as we need, including pasting in a copy of our production telemetry data if needed to complete our testing. Note that this is a simple example configuration and contains a simple modify filter to illustrate the input plugin usage. YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: # This entry generates a successful message. - name: dummy tag: event.success dummy: '{"message":"true 200 success"}' # This entry generates a failure message. - name: dummy tag: event.error dummy: '{"message":"false 500 error"}' filters: # Example testing filter to modify events. - name: modify match: '*' condition: - Key_Value_Equals message 'true 200 success' remove: message add: - valid_message true - code 200 - type success outputs: - name: stdout match: '*' format: json_lines Running the above configuration file allows us to tinker with the filter until we are satisfied that it works before we install it on our developer testing environments. For completeness, we run this configuration to see the output as follows: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.0.8 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":1756813283.411961,"valid_message":"true","code":"200","type":"success"} {"date":1756813283.413117,"message":"false 500 error"} {"date":1756813284.410205,"valid_message":"true","code":"200","type":"success"} {"date":1756813284.41048,"message":"false 500 error"} {"date":1756813285.410716,"valid_message":"true","code":"200","type":"success"} {"date":1756813285.410987,"message":"false 500 error"} ... As you can see, for developers, the options here are endless to quickly verify or tune your telemetry pipelines in a testing environment. Let's look at another handy plugin for developers, the tail input plugin. 2. Tail Input Plugin Much of our development work and testing takes place in cloud native environments; therefore, this means we are dealing with dynamic logging and streams of log telemetry data. In UNIX-based operating systems, there is a very handy tool for looking at large files, specifically connecting to expanding files while displaying the incoming data. This tool is called tail, and with a '-f' flag added, it provides a file name to attach to while dumping all incoming data to the standard output. The tail input plugin has been developed with that same idea in mind. In a cloud native environment where Kubernetes is creating dynamic pods and containers, the log telemetry data is collected in a specific location. By using wildcard path names, you can ensure that you are connecting and sending incoming telemetry data to your Fluent Bit pipeline. YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: tail tag: kube.* read_from_head: true path: /var/log/containers/*.log multiline.parser: docker, cri outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp This was detailed in a previous article from this series, Controlling Logs with Fluent Bit on Kubernetes, where logs were collected and sent to Fluent Bit. Instead of rehashing the details of this article, we will reiterate a few of the important tips for developers here: When attaching to a location, such as the log collection path for a Kubernetes cluster, the tail plugin only collects telemetry data from that moment forward. It's missing previously logged telemetry.Using this adjustment to the tail input plugin configuration ensures we see the entire log telemetry data from your testing application: read_from_head: trueNarrowing the log telemetry data down from all running containers on a cluster is shown below, the first item being all container logs, the second being your specific app only. This tail input plugin configuration modification is very handy when testing our applications: path: /var/log/containers/*.logpath: /var/log/containers/*APP_BEING_TESTED*Finally, we need to ensure we are making use of filters in our pipeline configuration to reduce the telemetry noise when testing applications. This helps us to narrow the focus to the debugging telemetry data that is relevant. Another plugin worth mentioning in the same breath is the head input plugin. This works the same as the UNIX command head, where we are giving it a number of lines to read out of the top of a file (the first N number of lines). Below is an example configuration: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: head tag: head.kube.* path: /var/log/containers/*.log lines: 30 multiline.parser: docker, cri outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp With these two input plugins, we now have the flexibility to collect the telemetry data from large and dynamic sets of files. Our final plugin for developers, covered in the next section, gives us the ability to run almost anything while capturing its telemetry data. 3. Exec Input Plugin The final input plugin to be mentioned in our top three listing is the exec input plugin. This powerful plugin gives us the ability to execute any command and process the telemetry data output into our pipeline. Below is a simple example of the configuration executing a shell command, and that command is the input for our Fluent Bit pipeline to process: YAML service: flush: 1 log_level: info hot_reload: on pipeline: inputs: - name: exec tag: exec_demo command: 'for s in $(seq 1 10); do echo "The count is: $s"; done;' oneshot: true exit_after_oneshot: true outputs: - name: stdout match: '*' Now let's run this and see the output as follows: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.0.8 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... [0] exec_demo: [[1757023157.090932000, {}], {"exec"=>"The count is: 1"}] [1] exec_demo: [[1757023157.090968000, {}], {"exec"=>"The count is: 2"}] [2] exec_demo: [[1757023157.090974000, {}], {"exec"=>"The count is: 3"}] [3] exec_demo: [[1757023157.090978000, {}], {"exec"=>"The count is: 4"}] [4] exec_demo: [[1757023157.090982000, {}], {"exec"=>"The count is: 5"}] [5] exec_demo: [[1757023157.090986000, {}], {"exec"=>"The count is: 6"}] [6] exec_demo: [[1757023157.090990000, {}], {"exec"=>"The count is: 7"}] [7] exec_demo: [[1757023157.090993000, {}], {"exec"=>"The count is: 8"}] [8] exec_demo: [[1757023157.090997000, {}], {"exec"=>"The count is: 9"}] [9] exec_demo: [[1757023157.091001000, {}], {"exec"=>"The count is: 10"}] [2025/09/04 14:59:18] [ info] [engine] service has stopped (0 pending tasks) [2025/09/04 14:59:18] [ info] [output:stdout:stdout.0] thread worker #0 stopping... [2025/09/04 14:59:18] [ info] [output:stdout:stdout.0] thread worker #0 stopped ... Another developer example is while testing our Java service metrics instrumentation, we'd like to capture the exposed metrics as input to our telemetry pipeline to verify in one location that all is well. With the Java service running, the following exec input plugin configuration will do just that: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: exec tag: exec_metrics_demo command: 'curl curl http://localhost:7777/metrics' oneshot: true exit_after_oneshot: true propagate_exit_code: true outputs: - name: stdout match: '*' When we run this configuration, we see that the online exposed metrics URL is dumped just once (for this example, but you can remove the oneshot part of the configuration once verified to work) to the telemetry pipeline for processing: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.0.8 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb [0] exec_demo: [[1757023682.745158000, {}], {"exec"=>"# HELP java_app_c_total example counter"}] [1] exec_demo: [[1757023682.745209000, {}], {"exec"=>"# TYPE java_app_c_total counter"}] [2] exec_demo: [[1757023682.745215000, {}], {"exec"=>"java_app_c_total{status="error"} 29.0"}] [3] exec_demo: [[1757023682.745219000, {}], {"exec"=>"java_app_c_total{status="ok"} 58.0"}] [4] exec_demo: [[1757023682.745223000, {}], {"exec"=>"# HELP java_app_g_seconds is a gauge metric"}] [5] exec_demo: [[1757023682.745227000, {}], {"exec"=>"# TYPE java_app_g_seconds gauge"}] [6] exec_demo: [[1757023682.745230000, {}], {"exec"=>"java_app_g_seconds{value="value"} -0.04967858477018794"}] [7] exec_demo: [[1757023682.745234000, {}], {"exec"=>"# HELP java_app_h_seconds is a histogram metric"}] [8] exec_demo: [[1757023682.745238000, {}], {"exec"=>"# TYPE java_app_h_seconds histogram"}] [9] exec_demo: [[1757023682.745242000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.005"} 0"}] [10] exec_demo: [[1757023682.745246000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.01"} 1"}] [11] exec_demo: [[1757023682.745250000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.025"} 1"}] [12] exec_demo: [[1757023682.745253000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.05"} 1"}] [13] exec_demo: [[1757023682.745257000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.1"} 1"}] [14] exec_demo: [[1757023682.745261000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.25"} 1"}] [15] exec_demo: [[1757023682.745265000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="0.5"} 1"}] [16] exec_demo: [[1757023682.745269000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="1.0"} 1"}] [17] exec_demo: [[1757023682.745273000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="2.5"} 3"}] [18] exec_demo: [[1757023682.745277000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="5.0"} 5"}] [19] exec_demo: [[1757023682.745280000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="10.0"} 10"}] [20] exec_demo: [[1757023682.745284000, {}], {"exec"=>"java_app_h_seconds_bucket{method="GET",path="/",status_code="200",le="+Inf"} 29"}] [21] exec_demo: [[1757023682.745288000, {}], {"exec"=>"java_app_h_seconds_count{method="GET",path="/",status_code="200"} 29"}] [22] exec_demo: [[1757023682.745292000, {}], {"exec"=>"java_app_h_seconds_sum{method="GET",path="/",status_code="200"} 407.937109325"}] [23] exec_demo: [[1757023682.745296000, {}], {"exec"=>"# HELP java_app_s_seconds is summary metric (request latency in seconds)"}] [24] exec_demo: [[1757023682.745299000, {}], {"exec"=>"# TYPE java_app_s_seconds summary"}] [25] exec_demo: [[1757023682.745303000, {}], {"exec"=>"java_app_s_seconds{status="ok",quantile="0.5"} 2.3566848312406656"}] [26] exec_demo: [[1757023682.745307000, {}], {"exec"=>"java_app_s_seconds{status="ok",quantile="0.95"} 4.522227972308204"}] [27] exec_demo: [[1757023682.745311000, {}], {"exec"=>"java_app_s_seconds{status="ok",quantile="0.99"} 4.781636377835897"}] [28] exec_demo: [[1757023682.745315000, {}], {"exec"=>"java_app_s_seconds_count{status="ok"} 29"}] [29] exec_demo: [[1757023682.745318000, {}], {"exec"=>"java_app_s_seconds_sum{status="ok"} 70.511430859743"}] ... This has ingested our external Java service metrics instrumentation into our telemetry pipeline for automation and processing during our development and testing. This wraps up a few handy tips and tricks for developers getting started with Fluent Bit input plugins. The ability to set up and leverage these top input plugins is a big help in speeding up your inner development loop experience. More in the Series In this article, you learned a few handy tricks for using the Fluent Bit service section in the configuration to improve the inner developer loop experience. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring some of the more interesting Fluent Bit output plugins for developers.

By Eric D. Schabell DZone Core CORE
The AI FOMO Paradox
The AI FOMO Paradox

TL; DR: AI FOMO — A Paradox AI FOMO comes from seeing everyone’s polished AI achievements while you see all your own experiments, failures, and confusion. The constant drumbeat of AI breakthroughs triggers legitimate anxiety for Scrum Masters, Product Owners, Business Analysts, and Product Managers: “Am I falling behind? Will my role be diminished?” But here’s the truth: You are not late. Most teams are still in their early stages and uneven. There are no “AI experts” in agile yet — only pioneers and experimenters treating AI as a drafting partner that accelerates exploration while they keep judgment, ethics, and accountability. Disclaimer: I used a Deep Research report by Gemini 2.5 Pro to research sources for this article. The Reality Behind the AI Success Stories Signals are distorted: Leaders declare AI-first while data hygiene lags. Shadow AI usage inflates progress without creating stable practices. Generative AI has officially entered what Gartner calls the “Trough of Disillusionment” in 2024-2025. MIT Sloan’s research reveals only 5% of business AI initiatives generate meaningful value (Note: The MIT Sloan report needs to be handled with care due to its design.) Companies spend an average of $1.9 million on generative AI initiatives. Yet, less than 30% of AI leaders report CEO satisfaction. Meanwhile, individual workers report saving 2.2–2.5 hours weekly — quiet, durable gains beneath the noise generated by the AI hype. The “AI Shame” phenomenon proves the dysfunction: 62% of Gen Z workers hide their AI usage, 55% pretend to understand tools they don’t, with only a small fraction receiving adequate guidance. This isn’t progress; it’s organizational theater. Good-Enough Agile Is Ending AI is not replacing Agile. It’s replacing the parts that never created differentiated value. “Good-Enough Agile,” teams going through Scrum events without understanding the principles, are being exposed. Ritualized status work, generic Product Backlog clerking, and meeting transcription: all becoming cheap, better, and plentiful. Research confirms AI as a “cybernetic teammate” amplifying genuine agile principles. The Agile Manifesto’s first value, “Individuals and interactions over processes and tools,” becomes clearer. AI is the tool. Your judgment remains irreplaceable. The AI for Agile anti-patterns revealing shallow practice include: Tool tourism: Constant switching that hides a weak positioningHero prompts: One person becomes the AI bottleneck instead of distributing knowledgeVanity dashboards: Counting prompts instead of tracking outcome-linked metricsAutomation overreach: Brittle auto-actions that save seconds but cost days. These patterns expose teams practicing cargo cult Agile. Career insecurity triggers documented fears of exclusion, but the real threat isn’t being excluded from AI knowledge. It’s being exposed as having practiced shallow Agile all along. (Throwing AI at a failed approach to “Agile” won’t remedy the main issue.) The Blunt Litmus Test If you can turn messy inputs into falsifiable hypotheses, define the smallest decisive test, and defend an ethical error budget, AI gives you lift. If you cannot, AI will do your visible tasks faster while exposing absent value and your diminished utility from an organization’s point of view. Your expertise moves upstream to framing questions and downstream to evaluating evidence. AI handles low-leverage generation; you decide what matters, what’s safe, and what ships. Practical Leverage Points There are plenty of beneficial approaches to employing AI for Agile, for example: Product Teams: Convert qualitative inputs into competing hypotheses. AI processes customer transcripts in minutes, but you determine which insights align with the product goal. Then, validate or falsify hypotheses with AI-built prototypes faster than ever before. Scrum Masters: Auto-compile WIP ages, handoffs, interrupting flow, and PR latency to move Retrospectives from opinions to evidence. AI surfaces patterns; you guide systemic improvements. Seriously, talking to management becomes significantly easier once you transition from “we feel that…” to “we have data on…” Developers: Generate option sketches, then design discriminating experiments. PepsiCo ran thousands of virtual trials; Wayfair evolved its tool through rapid feedback — AI accelerating empirical discovery. Stanford and World Bank research shows a 60% time reduction on cognitive tasks. But time saved means nothing without judgment about which tasks matter. Building useless things more efficiently won’t prove your value as an agile practitioner to the organization when a serious voice questions your effectiveness. Conclusion: From Anxiety to Outcome Literacy The path forward isn’t frantically learning every tool. Start with one recurring problem. Form a hypothesis. Run a small experiment. Inspect results. Adapt. This is AI for Agile applied to your development. The value for the organization shifts from execution to strategic orchestration. Your experience building self-managing teams becomes more valuable as AI exposes the difference between genuine practice and cargo cult Agile. Durable wins come from workflow redesign and sharper questions, not model tricks. If you can frame decisions crisply, choose discriminating tests, and hold ethical lines, you’re ahead where it counts. AI FOMO recedes when you trade comparison for learning velocity. Choose an outcome that matters, add one AI-assisted step that reduces uncertainty, measure honestly, and keep what proves worth. AI won’t replace Agile; it will replace Good-Enough Agile, and outcome-literate practitioners will enjoy a massive compound advantage. It helps if you know what you are doing for what purpose. Food for Thought on AI FOMO How might recognizing AI as exposing “Good-Enough Agile” rather than threatening genuine practice change your approach to both AI adoption and agile coaching in organizations that have been going through the motions?Given that AI makes shallow practice obvious by automating ritual work, what specific anti-patterns in your organization would become immediately visible, and how would you address the human dynamics of that exposure?If the differentiator is “boring excellence”—clean operational data, evaluation harnesses, and reproducible workflows—rather than AI tricks, what foundational practices need strengthening in your context before AI can actually accelerate value delivery? Sources Gartner AI Hype Cycle ReportAI FOMO, Shadow AI, and Other Business ProblemsAI Hype Cycle – Gartner Charts the Rise of Agents, HPCwireImpact of Generative AI on Work Productivity, Federal Reserve Bank of St. LouisAI Shame Grips the Present Generation, Times of IndiaGenerative AI & Agile: A Strategic Career Decision, Scrum.orgFear of Missing Out at Work, Frontiers in Organizational Psychology (link downloads as an EPUB)Artificial Intelligence in Agile, SprightbulbAI & Agile Product Teams, Scrum.orgProductivity Gains from Using AI, Visual CapitalistHuman + AI: Rethinking the Roles and Skills of Knowledge Workers, AI Accelerator Institute

By Stefan Wolpers DZone Core CORE
Beyond Retrieval: How Knowledge Graphs Supercharge RAG
Beyond Retrieval: How Knowledge Graphs Supercharge RAG

Retrieval-augmented generation (RAG) enhances the factual accuracy and contextual relevance of large language models (LLMs) by connecting them to external data sources. RAG systems use semantic similarity to identify text relevant to a user query. However, they often fail to explain how the query and retrieved pieces of information are related, which limits their reasoning capability. Graph RAG addresses this limitation by leveraging knowledge graphs, which represent entities (nodes) and their relationships (edges) in a structured, machine-readable format. This framework enables AI systems to link related facts and draw coherent, explainable conclusions, moving closer to human-like reasoning (Hogan et al., 2021). In this article and the accompanying tutorial, we explore how Graph RAG compares to traditional RAG by answering questions drawn from A Study in Scarlet, the first novel in the Sherlock Holmes series, demonstrating how structured knowledge supports more accurate, nuanced, and explainable insights. Constructing Knowledge Graphs: From Expert Ontologies to Large-Scale Extraction The first step in implementing graph RAG is building a knowledge graph. Traditionally, domain experts manually define ontologies and map entities and relationships, producing high-quality structures. However, this approach does not scale well for large volumes of text. To handle bigger datasets, LLMs and NLP techniques automatically extract entities and relationships from unstructured content. In the tutorial accompanying this article, we demonstrate this process by breaking it down into three practical stages: 1. Entity and Relationship Extraction The first step in building the graph is to extract entities and relationships from raw text. In our tutorial, this is done using Cohere’s command-a model, guided by a structured JSON schema and a carefully engineered prompt. This approach enables the LLM to systematically identify and extract the relevant components. For example, the sentence: “I was dispatched, accordingly, in the troopship ‘Orontes,’ and landed a month later on Portsmouth jetty.” can be converted into the following graph triple: (Watson) – [travelled_on] → (Orontes) Instead of processing the entire novel in one pass, we split it into chunks of about 4,000 characters. Each chunk is processed with two pieces of global context: Current entity list: all entities discovered so far.Current relation types: the set of relations already in use. This approach prevents the model from creating duplicate nodes and keeps relation labels consistent across the book. An incremental merging step then assigns stable IDs to new entities and reuses them whenever known entities reappear. For instance, once the relation type meets is established, later chunks will consistently reuse it rather than inventing variants such as encounters or introduced_to. This keeps the graph coherent as it grows across chunks. 2. Entity Resolution Even with a global context, the same entity often appears in different forms; a character, location, or organization may be mentioned in multiple ways. For example, “Stamford” might also appear as “young Stamford”, while “Dr Watson” is later shortened to “Watson.” To handle this, we apply a dedicated entity resolution step after extraction. This process: Identify and merge duplicate nodes based on fuzzy string similarity (e.g., Watson ≈ Dr Watson).Expand and unify alias lists so that each canonical entity maintains a complete record of its variants.Normalize relationships so that all edges consistently point to the resolved nodes. This ensures that queries such as “Who introduced Watson to Holmes?” always return the correct canonical entity, regardless of which alias appears in the text. 3. Graph Population and Visualization With resolved entities and cleaned relationships, the data is stored in a graph structure. Nodes can be enriched with further attributes such as type (Person, Location, Object), aliases, and optionally traits like profession or role. Edges are explicitly typed (e.g., meets, travels_to, lives_at), enabling structured queries and traversal. Together, these stages transform unstructured narrative text into a queryable, explainable Knowledge Graph, laying the foundation for Graph RAG to deliver richer insights than traditional retrieval methods. Once the Knowledge Graph is constructed, it is stored in a graph database designed to handle interconnected data efficiently. Databases like Neo4j allow nodes and relationships to be queried and traversed intuitively using declarative languages such as Cypher. Graph RAG vs. Traditional RAG The main advantage of Graph RAG over traditional RAG lies in its ability to leverage structured relationships between entities. Traditional RAG retrieves semantically similar text fragments but struggles to combine information from multiple sources, making it difficult to answer questions that require multi-step reasoning. In the tutorial, for example, we can ask: “Who helped Watson after the battle of Maiwand, and where did this occur?”Graph RAG answers this by traversing the subgraph: Dr Watson → [HELPED_BY] → Murray → [LOCATED_AT] → Peshawar.Traditional RAG would only retrieve sentences mentioning Watson and Murray without connecting the location or providing a reasoning chain. This shows that graph RAG produces richer, more accurate, and explainable answers by combining entity connections, attributes, and relationships captured in the knowledge graph. A key benefit of graph RAG is transparency. Knowledge graphs provide explicit reasoning chains rather than opaque, black-box outputs. Each node and edge is traceable to its source, and attributes such as timestamps, provenance, and confidence scores can be included. This level of explainability is particularly important in high-stakes domains like healthcare or finance, but it also enhances educational value in literary analysis, allowing students to follow narrative reasoning in a structured, visual way. Enhancing Retrieval With Graph Embeddings Graph data can be enriched through graph embeddings, which transform nodes and their relationships into a vector space. This representation captures both semantic meaning and structural context, making it possible to identify similarities beyond surface-level text. Embedding algorithms such as FastRP and Node2Vec enable the retrieval of relationally similar nodes, even when their textual descriptions differ By integrating LLMs with knowledge graphs, graph RAG transforms retrieval from a simple text lookup into a structured reasoning engine. It enables AI systems to link facts, answer complex questions, and provide explainable, verifiable insights. Graph RAG represents a step toward AI systems that are not only larger and more capable but also smarter, more transparent, and more trustworthy, capable of structured reasoning over rich information sources. References Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., de Melo, G., Gutierrez, C., Gayo, J.E.L., Kirrane, S., Neumaier, S., Polleres, A., Navigli, R., Ngomo, A.C.N., Rashid, S.M., Rula, A., Schmelzeisen, L., Sequeda, J., Staab, S., and Zimmermann, A. (2021) ‘Knowledge Graphs’, ACM Computing Surveys, 54(4), pp. 1–37.Larson, J. and Truitt, A. (2024) ‘Beyond RAG: Graph-based retrieval for multi-step reasoning’, Microsoft Research Blog. Available at: https://www.microsoft.com/en-us/research/blog/beyond-rag-graph-based-retrieval-for-multi-step-reasoning/ (Accessed: 20 July 2025).Neo4j (2024) ‘Graph data science documentation’, Neo4j. Available at: https://neo4j.com/docs/graph-data-science/current/ (Accessed: 20 July 2025).

By Salman Khan DZone Core CORE
How TBMQ Uses Redis for Persistent Message Storage
How TBMQ Uses Redis for Persistent Message Storage

TBMQ was primarily designed to aggregate data from IoT devices and reliably deliver it to backend applications. Applications subscribe to data from tens or even hundreds of thousands of devices and require reliable message delivery. Additionally, applications often experience periods of downtime due to system maintenance, upgrades, failover scenarios, or temporary network disruptions. IoT devices typically publish data frequently but subscribe to relatively few topics or updates. To address these differences, TBMQ classifies MQTT clients as either application clients or standard IoT devices. Application clients are always persistent and rely on Kafka for session persistence and message delivery. In contrast, standard IoT devices — referred to as DEVICE clients in TBMQ — can be configured as persistent depending on the use case. This article provides a technical overview of how Redis is used within TBMQ to manage persistent MQTT sessions for DEVICE clients. The goal is to provide practical insights for software engineers looking to offload database workloads to persistent caching layers like Redis, ultimately improving the scalability and performance of their systems. Why Redis? In TBMQ 1.x, DEVICE clients relied on PostgreSQL for message persistence and retrieval, ensuring that messages were delivered when a client reconnected. While PostgreSQL performed well initially, it had a fundamental limitation — it could only scale vertically. We anticipated that as the number of persistent MQTT sessions grew, PostgreSQL’s architecture would eventually become a bottleneck. For a deeper look at how PostgreSQL was used and the architectural limitations we encountered, see our blog post. To address this, we evaluated alternatives that could scale more effectively with increasing load. Redis was quickly chosen as the best fit due to its horizontal scalability, native clustering support, and widespread adoption. Migration to Redis With these benefits in mind, we started our migration process with an evaluation of data structures that could preserve the functionality of the PostgreSQL approach while aligning with Redis Cluster constraints to enable efficient horizontal scaling. Redis Cluster Constraints While working on a migration, we recognized that replicating the existing data model would require multiple Redis data structures to efficiently handle message persistence and ordering. This, in turn, meant using multiple keys for each persistent MQTT client session. Redis Cluster distributes data across multiple slots to enable horizontal scaling. However, multi-key operations must access keys within the same slot. If the keys reside in different slots, the operation triggers a cross-slot error, preventing the command from executing. We used the persistent MQTT client ID as a hash tag in our key names to address this. By enclosing the client ID in curly braces {}, Redis ensures that all keys for the same client are hashed to the same slot. This guarantees that related data for each client stays together, allowing multi-key operations to proceed without errors. Atomic Operations via Lua Scripts Consistency is critical in high-throughput environments where many messages may arrive concurrently for the same MQTT client. Hashtagging helps to avoid cross-slot errors, but without atomic operations, there is a risk of race conditions or partial updates. This could lead to message loss or incorrect ordering. It is important to make sure that operations updating the keys for the same MQTT client are atomic. Redis is designed to execute individual commands atomically. However, in our case, we need to update multiple data structures as part of a single operation for each MQTT client. Executing these sequentially without atomicity opens the door to inconsistencies if another process modifies the same data in between commands. That’s where Lua scripting comes in. Lua script executes as a single, isolated unit. During script execution, no other commands can run concurrently, ensuring that the operations inside the script happen atomically. Based on this information, we decided that for any operation, such as saving messages or retrieving undelivered messages upon reconnection, we will execute a separate Lua script. This ensures that all operations within a single Lua script reside in the same hash slot, maintaining atomicity and consistency. Choosing the Right Redis Data Structures One of the key requirements for persistent session handling in an MQTT broker is maintaining message order across client reconnects. After evaluating various Redis data structures, we found that sorted sets (ZSETs) provided an efficient solution to this requirement. Redis sorted sets naturally organize data by score, enabling quick retrieval of messages in ascending or descending order. While sorted sets provided an efficient way to maintain message order, storing full message payloads directly in sorted sets led to excessive memory usage. Redis does not support per-member TTL within sorted sets. As a result, messages persisted indefinitely unless explicitly removed. Similar to PostgreSQL, we had to perform periodic cleanups using ZREMRANGEBYSCORE to delete expired messages. This operation carries a complexity of O(log N + M), where M is the number of elements removed. To overcome this limitation, we decided to store message payloads using strings data structure while storing in the sorted set references to these keys. client_id is a placeholder for the actual client ID, while the curly braces {} around it are added to create a hash tag. In the image above, you can see that the score continues to grow even when the MQTT packet ID wraps around. Let’s take a closer look at the details illustrated in this image. At first, the reference for the message with the MQTT packet ID equal to 65534 was added to the sorted set: Shell ZADD {client_id}_messages 65534 {client_id}_messages_65534 Here, {client_id}_messages is the sorted set key name, where {client_id} acts as a hash tag derived from the persistent MQTT client’s unique ID. The suffix _messages is a constant added to each sorted set key name for consistency. Following the sorted set key name, the score value 65534 corresponds to the MQTT packet ID of the message received by the client. Finally, we see the reference key that points to the actual payload of the MQTT message. Similar to the sorted set key, the message reference key uses the MQTT client’s ID as a hash tag, followed by the _messages suffix and the MQTT packet ID value. In the next iteration, we add the message reference for the MQTT message with a packet ID equal to 65535 into the sorted set. This is the maximum packet ID, as the range is limited to 65535. Shell ZADD {client_id}_messages 65535 {client_id}_messages_65535 So, at the next iteration MQTT packet ID should be equal to 1, while the score should continue to grow and be equal to 65536. Shell ZADD {client_id}_messages 65536 {client_id}_messages_1 This approach ensures that the message’s references will be properly ordered in the sorted set regardless of the packet ID’s limited range. Message payloads are stored as string values with SET commands that support expiration (EX), providing O(1) complexity for writes and TTL applications: Shell SET {client_id}_messages_1 "{ \"packetType\":\"PUBLISH\", \"payload\":\"eyJkYXRhIjoidGJtcWlzYXdlc29tZSJ9\", \"time\":1736333110026, \"clientId\":\"client\", \"retained\":false, \"packetId\":1, \"topicName\":\"europe/ua/kyiv/client/0\", \"qos\":1 }" EX 600 Another benefit, aside from efficient updates and TTL applications, is that the message payloads can be retrieved: Shell GET {client_id}_messages_1 Or removed: Shell DEL {client_id}_messages_1 with constant complexity O(1) without affecting the sorted set structure. Another very important element of our Redis architecture is the use of a string key to store the last MQTT packet ID processed: Shell GET {client_id}_last_packet_id "1" This approach serves the same purpose as in the PostgreSQL solution. When a client reconnects, the server must determine the correct packet ID to assign to the next message that will be saved in Redis. Initially, we considered using the sorted set’s highest score as a reference. However, since there are scenarios where the sorted set could be empty or completely removed, we concluded that the most reliable solution is to store the last packet ID separately. Managing Sorted Set Size Dynamically This hybrid approach, leveraging sorted sets and string data structures, eliminates the need for periodic cleanups based on time, as per-message TTLs are now applied. In addition, following the PostgreSQL design, we needed to somehow address the cleanup of the sorted set based on the message limit set in the configuration. YAML # Maximum number of PUBLISH messages stored for each persisted DEVICE client limit: "${MQTT_PERSISTENT_SESSION_DEVICE_PERSISTED_MESSAGES_LIMIT:10000}" This limit is an important part of our design, allowing us to control and predict the memory allocation required for each persistent MQTT client. For example, a client might connect, triggering the registration of a persistent session, and then rapidly disconnect. In such scenarios, it is essential to ensure that the number of messages stored for the client (while waiting for a potential reconnection) remains within the defined limit, preventing unbounded memory usage. Java if (messagesLimit > 0xffff) { throw new IllegalArgumentException("Persisted messages limit can't be greater than 65535!"); } To reflect the natural constraints of the MQTT protocol, the maximum number of persisted messages for individual clients is set to 65535. To handle this within the Redis solution, we implemented dynamic management of the sorted set’s size. When new messages are added, the sorted set is trimmed to ensure the total number of messages remains within the desired limit, and the associated strings are also cleaned up to free up memory. Lua -- Get the number of elements to be removed local numElementsToRemove = redis.call('ZCARD', messagesKey) - maxMessagesSize -- Check if trimming is needed if numElementsToRemove > 0 then -- Get the elements to be removed (oldest ones) local trimmedElements = redis.call('ZRANGE', messagesKey, 0, numElementsToRemove - 1) -- Iterate over the elements and remove them for _, key in ipairs(trimmedElements) do -- Remove the message from the string data structure redis.call('DEL', key) -- Remove the message reference from the sorted set redis.call('ZREM', messagesKey, key) end end Message Retrieval and Cleanup Our design not only ensures dynamic size management during the persistence of new messages but also supports cleanup during message retrieval, which occurs when a device reconnects to process undelivered messages. This approach keeps the sorted set clean by removing references to expired messages. Lua -- Define the sorted set key local messagesKey = KEYS[1] -- Define the maximum allowed number of messages local maxMessagesSize = tonumber(ARGV[1]) -- Get all elements from the sorted set local elements = redis.call('ZRANGE', messagesKey, 0, -1) -- Initialize a table to store retrieved messages local messages = {} -- Iterate over each element in the sorted set for _, key in ipairs(elements) do -- Check if the message key still exists in Redis if redis.call('EXISTS', key) == 1 then -- Retrieve the message value from Redis local msgJson = redis.call('GET', key) -- Store the retrieved message in the result table table.insert(messages, msgJson) else -- Remove the reference from the sorted set if the key does not exist redis.call('ZREM', messagesKey, key) end end -- Return the retrieved messages return messages By leveraging Redis’ sorted sets and strings, along with Lua scripting for atomic operations, our new design achieves efficient message persistence and retrieval, as well as dynamic cleanup. This design addresses the scalability limitations of the PostgreSQL-based solution. Migration from Jedis to Lettuce To validate the scalability of the new Redis-based architecture for persistent message storage, we selected a point-to-point (P2P) MQTT communication pattern as a performance testing scenario. Unlike fan-in (many-to-one) or fan-out (one-to-many) scenarios, the P2P pattern typically involves one-to-one communication and creates a new persistent session for each communicating pair. This makes it well-suited for evaluating how the system scales as the number of sessions grows. Before starting large-scale tests, we conducted a prototype test that revealed the limit of 30k msg/s throughput when using PostgreSQL for persistence message storage. At the moment of migration to Redis, we used the Jedis library for Redis interactions, primarily for cache management. As a result, we initially decided to extend Jedis to handle message persistence for persistent MQTT clients. However, the initial results of the Redis implementation with Jedis were unexpected. While we anticipated Redis would significantly outperform PostgreSQL, the performance improvement was modest, reaching only 40k msg/s throughput compared to the 30k msg/s limit with PostgreSQL. This led us to investigate the bottlenecks, where we discovered that Jedis was a limiting factor. While reliable, Jedis operates synchronously, processing each Redis command sequentially. This forces the system to wait for one operation to complete before executing the next. In high-throughput environments, this approach significantly limited Redis’s potential, preventing the full utilization of system resources. RedisInsight shows ~66k commands/s per node, aligning with TBMQ’s 40k msg/s, as Lua scripts trigger multiple Redis operations per message. To overcome this limitation, we migrated to Lettuce, an asynchronous Redis client built on top of Netty. With Lettuce, our throughput increased to 60k msg/s, demonstrating the benefits of non-blocking operations and improved parallelism. At 60k msg/s, RedisInsight shows ~100k commands/s per node, aligning with the expected increase from 40k msg/s, which produced ~66k commands/s per node. Lettuce allows multiple commands to be sent and processed in parallel, fully exploiting Redis’s capacity for concurrent workloads. Ultimately, the migration unlocked the performance gains we expected from Redis, paving the way for successful P2P testing at scale. For a deep dive into the testing architecture, methodology, and results, check out our detailed performance testing article. Conclusion In distributed systems, scalability bottlenecks often emerge when vertically scaled components, like traditional databases, are used to manage high-volume, session-based workloads. Our experience with persistent MQTT sessions for DEVICE clients demonstrated the importance of designing around horizontally scalable solutions from the start. By offloading session storage to Redis and implementing key architectural improvements during the migration, TBMQ 2.x built a persistence layer capable of supporting a high number of concurrent sessions with exceptional performance and guaranteed message delivery. We hope our experience provides practical guidance for engineers designing scalable, session-aware systems in distributed environments.

By Dmytro Shvaika
From Laptop to Cloud: Building and Scaling AI Agents With Docker Compose and Offload
From Laptop to Cloud: Building and Scaling AI Agents With Docker Compose and Offload

Running AI agents locally feels simple until you try it: dependencies break, configs drift, and your laptop slows to a crawl. An agent isn’t one process — it’s usually a mix of a language model, a database, and a frontend. Managing these by hand means juggling installs, versions, and ports. Docker Compose changes that. You can now define these services in a single YAML file and run them together as one app. Compose even supports declaring AI models directly with the models element. With one command — docker compose up — your full agent stack runs locally. But local machines hit limits fast. Small models like DistilGPT-2 run on CPUs, but bigger ones like LLaMA-2 need GPUs. Most laptops don’t have that kind of power. Docker Offload bridges this gap. It runs the same stack in the cloud on GPU-backed hosts, using the same YAML file and the same commands. This tutorial walks through: Defining an AI agent with ComposeRunning it locally for fast iterationOffloading the same setup to cloud GPUs for scale The result: local iteration, cloud execution — without rewriting configs. Why Agents + Docker AI agents aren’t monoliths. They’re composite apps that bundle services such as: Language model (LLM or fine-tuned API)Vector database for long-term memory and embeddingsFrontend/UI for user interactionOptional monitoring, cache, or file storage Traditionally, you’d set these up manually: Postgres installed locally, Python for the LLM, Node.js for the UI. Each piece required configs, version checks, and separate commands. When one broke, the whole system failed. Docker Compose fixes this. Instead of manual installs, you describe services in a single YAML file. Compose launches containers, wires them together, and keeps your stack reproducible. There are also options such as Kubernetes, HashiCorp Nomad, or even raw Docker commands, but all options have a trade-off. Kubernetes can scale to support large-scale production applications, providing sophisticated scheduling, autoscaling, and service discovery capabilities. Nomad is a more basic alternative to Kubernetes that is very friendly to multi-cloud deployments. Raw Docker commands provide a level of control that is hard to manage when managing more than a few services. Conversely, Docker Compose targets developers expressing the need to iterate fast and have a lightweight orchestration. It balances the requirements of just containers with full Kubernetes, and thus it is suitable for local development and early prototyping. Still, laptops have limits. CPUs can handle small models but not the heavier workloads. That’s where Docker Offload enters. It extends the same Compose workflow into the cloud, moving the heavy lifting to GPU servers. Figure 1: Local vs. Cloud workflow with Docker Offload AI agent services (LLM, database, frontend) run locally with Docker Compose. With docker offload up, the same services move to GPU-backed cloud servers, using the same YAML file. Define the Agent With Compose Step 1: Create a compose.yaml File YAML services: llm: image: ghcr.io/langchain/langgraph:latest ports: - "8080:8080" db: image: postgres:15 environment: POSTGRES_PASSWORD: secret ui: build: ./frontend ports: - "3000:3000" This file describes three services: llm: Runs a language model server on port 8080. You could replace this with another image, such as Hugging Face’s text-generation-inference.db: Runs Postgres 15 with an environment variable for the password. Using environment variables avoids hardcoding sensitive data.ui: Builds a custom frontend from your local ./frontend directory. It exposes port 3000 for web access. For more advanced setups, your compose.yaml can include features like multi-stage builds, health checks, or GPU requirements. Here’s an example: YAML services: llm: build: context: ./llm-service dockerfile: Dockerfile deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - "8080:8080" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s retries: 3 db: image: postgres:15 environment: POSTGRES_PASSWORD: ${POSTGRES_PASSWORD} ui: build: ./frontend ports: - "3000:3000" In this configuration: Multi-stage builds reduce image size by separating build tools from the final runtime.GPU requirements ensure the service runs on a node with NVIDIA GPUs when offloaded.Health checks allow Docker (and Offload) to detect when a service is ready. Step 2: Run the Stack PowerShell docker compose up Compose builds and starts all three services. Containers are networked together automatically. Expected output from docker compose ps: PowerShell NAME IMAGE PORTS agent-llm ghcr.io/langchain/langgraph 0.0.0.0:8080->8080/tcp agent-db postgres:15 5432/tcp agent-ui frontend:latest 0.0.0.0:3000->3000/tcp Now open http://localhost:3000 to see the UI talking to the LLM and database. You can use docker compose ps to check running services and Docker Compose logs to see real-time logs for debugging. Figure 2: Compose stack for AI agent (LLM + DB + UI) A compose.yaml defines all agent components: LLM, database, and frontend. Docker Compose connects them automatically, making the stack reproducible across laptops and the cloud. Offload to the Cloud Once your local laptop hits its limit, shift to the cloud with Docker Offload. Step 1: Install the Extension PowerShell docker extension install offload Step 2: Start the Stack in the Cloud PowerShell docker offload up That’s it. Your YAML doesn’t change. Your commands don’t change. Only the runtime location does. Step 3: Verify PowerShell docker offload ps This shows which services are running remotely. Meanwhile, your local terminal still streams logs so you can debug without switching tools. Other useful commands: docker offload status – Check if your deployment is healthy.docker offload stop – Shut down cloud containers when done.docker offload logs <service> – View logs for a specific container. You can use .dockerignore to reduce build context, especially when sending files to the cloud. Figure 3: Dev → Cloud GPU Offload → Full agent workflow The workflow for scaling AI agents is straightforward. A developer tests locally with docker compose up. When more power is needed, docker offload up sends the same stack to the cloud. Containers run remotely on GPUs, but logs and results stream back to the local machine for debugging. Real-World Scaling Example Let’s say you’re building a research assistant chatbot. Local testing: Model: DistilGPT-2 (lightweight, CPU-friendly)Database: PostgresUI: simple React appRun with docker compose up This setup is fine for testing flows, building the frontend, and validating prompts. Scaling to cloud: Replace the model service with LLaMA-2-13B or Falcon for better answers.Add a vector database like Weaviate or Chroma for semantic memory.Run with docker offload up Now your agent can handle larger queries and store context efficiently. The frontend doesn’t care if the model is local or cloud-based — it just connects to the same service port. This workflow matches how most teams build: fast iteration locally, scale in the cloud when ready for heavier testing or deployment. Advantages and Trade-Offs Figure 4: Visual comparison of Local Docker Compose vs. Docker Offload The same compose.yaml defines both environments. Locally, agents run on CPUs with minimal cost and latency. With Offload, the same config shifts to GPU-backed cloud servers, enabling scale but adding cost and latency. Advantages One config: Same YAML works everywhereSimple commands: docker compose up vs. docker offload upCloud GPUs: Access powerful hardware without setting up infraUnified debugging: Logs stream to the local terminal for easy monitoring Trade-Offs Latency: Cloud adds round trips. A 50ms local API call may take 150–200ms remotely, depending on network conditions. This matters for latency-sensitive apps like chatbots.Cost: GPU time is expensive. A standard AWS P4d.24xlarge (8×A100) costs about $32.77/hour, or $4.10 per GPU/hour. On GCP, an A100-80 GB instance is approximately $6.25/hour, while high-end H100-equipped VMs can reach $88.49/hour. Spot instances, when available, can offer 60–91% discounts, cutting costs significantly for batch jobs or CI pipelines.Coverage: Offload supports limited backends today, though integrations are expanding. Enterprises should check which providers are supported.Security implications: Offloading workloads implies that your model, data, and configs execute on remote infrastructure. Businesses must consider transit (TLS), data at rest, and access controls. Other industries might also be subject to HIPAA, PCI DSS, or GDPR compliance prior to the offloading of workloads.Network and firewall settings: Offload requires outbound access to Docker’s cloud endpoints. In enterprises with restricted egress policies or firewalls, security teams may need to open specific ports or allowlist Offload domains. Best Practices To get the most out of Compose + Offload: Properly manage secrets: To use hardcoded sensitive values in compose.yaml, use core secrets with .env files or Docker secrets. This prevents inadvertent leaks in version control.Pin image versions: Avoid using :latest tags, as they can pull unexpected updates. Pin versions like :1.2.0 for stability and reproducibility.Scan images for vulnerabilities: Use docker scout cves to scan images before offloading. Catching issues early helps avoid deploying insecure builds.Optimize builds with multi-stage: Multi-stage builds and .dockerignore files keep images slim, saving both storage and bandwidth during cloud offload.Add health checks: Health checks let Docker and Offload know when a service is ready, improving resilience in larger stacks. PowerShell healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s retries: 3 Monitor usage: Use Docker offload status and logs to track GPU consumption and stop idle workloads to avoid unnecessary costs.Version control your YAML: Commit your Compose files to Git so the entire team runs the same stack consistently. These practices reduce surprises and make scaling smoother. Conclusion AI agents are multi-service apps. Running them locally works for small tests, but scaling requires more power. Docker Compose defines the stack once. Docker Offload runs the same setup on GPUs in the cloud. This workflow — local iteration, cloud execution — means you can build and test quickly, then scale up without friction. As Docker expands AI features, Compose and Offload are becoming the natural choice for developers building AI-native apps. If you’re experimenting with agents, start with Compose on your laptop, then offload when you need more processing power. The change is smooth, and the payoff is quicker, and it builds with fewer iterations.

By Pragya Keshap
Automating RCA and Decision Support Using AI Agents
Automating RCA and Decision Support Using AI Agents

With the AI boom over the past couple of years, almost every business is trying to innovate and automate its internal business processes or front-end consumer experiences. Traditional business intelligence tools require manual intervention for querying and interpreting data, leading to inefficiencies. AI agents are changing this paradigm by automating data analysis, delivering prescriptive insights, and even taking autonomous actions based on real-time data. Obviously, it is the humans who set the goals, but it is an AI agent that autonomously decides on the best action to perform these goals. What makes an AI agent so special is that it can perceive both physical and software interfaces to make a rational decision. For instance, a robotic agent gathers information from the sensors, while a chatbot takes in customer prompts or queries as input. Then, the AI agent processes this data, evaluates it, and determines the most suitable course of action that aligns with its set objectives. Why AI Agents for Decision Support? AI agents can empower non-technical teams by uncovering insights instantly through NLP querying or prompt engineering, eliminating the need for manual data wrangling — hence enabling true no-code, self-serve data visualization. For example: A user asks: "What are the top-selling products in Q2 2024?"AI agent converts it into: SELECT product_name, SUM(sales) FROM sales_data WHERE quarter = 'Q2 2024' GROUP BY product_name ORDER BY SUM(sales) DESC; Nonetheless, it can scrape through voluminous data files without limitations and provide real-time insights without delays. This allows business owners more time to think about innovative strategies rather than juggling between finding the correct data sources. AI Agent Architecture for Decision Support The architecture for an AI agent for decision automation consists of multiple layers. 1. Data Sources and Events The events that teams add during feature development help to capture data and interactions. These act as data sources. The table below depicts the key data sources and event types captured during user and product interaction, which AI agents use to extract insights and inform decisions. sourceexamplesnature Customer Interactions App usage, page views, clicks Real-time events Transcations Orders, refunds, payment failures Batch + real-time events Marketing and Promotions Coupon redemptions, Campaign codes Batch Product Price changes, inventoryTransactional (batch/real-time) Support Logs Tickets, call transcripts Unstructured User Prompts User input queries NLP/Conversational 2. Data Ingestion Layer This layer collects the data from different sources. As highlighted above, these data sources can be structured (SQL, NoSQL databases), semi-structured (APIs, logs), and unstructured (CSV, Parquet files). Streaming pipelines or real-time data processing systems like Kafka, AWS Kinesis, and Google Pub/Sub handle continuous data flows from various sources. 3. Data Storage Layer It has three main layers: Raw data layer: It stores the data in its original, unprocessed form.Cleaned layer: Data transformation occurs, where you parse the raw files to handle nulls/missing values, etc. The final data is stored in a partitioned, queryable format.Feature engineering layer: The raw, cleaned data is used to extract reusable features. All these reusable features are stored in a centralized repository like Databricks or BigQuery to be used for ML training.Modeled layer: Presents high-level business KPIs like engagement score, LTV, churn rate, etc. 4. AI-Agent Layers Prompt interface: For users to input prompts.LangChain orchestration engine: Central controller of AI-agent (RAG → Calls APIs → Apply reasoning rules)RAG: Retrieves data chunks from feature stores, SQL models, etc.Reasoning engine: Encodes business logic and heuristics. For example: “Trigger a churn alert if score> 0.8 and last login > 30 days” OR “Recommend the bundle if a user shows interest in A and B”.Explainable AI: Before automating key business decisions, it is important to review the explanation. For example, after model training, SHAP analyzes how each feature impacts predictions by creating a visual explanation.Execution layer: This is the last layer that provides users with insights. There are two ways in which the reporting can be done: Dashboards: AI-generated/automated visual reports via Power BI, Tableau, or Looker.Slack, email by text summarization: Auto-execute decisions like campaign pausing and CX alert notifications. 5. Feedback Loop User actions and feedback are logged and fed back to retrain the models and agents. Alternatively, reinforcement learning can also be applied for rule tuning. Industry Application Let’s understand through a use case study how AI agents can significantly aid product teams in detecting any metric drop. Problem Scenario Conversion rates (payment success/ sessions) have dropped suddenly by 23% across the Android App in the last 6 hours. Detection Flow Kafka detects a drop in event payment_success.Statistical deviation is confirmed by the anomaly agent.The root cause analysis (RCA) agent traces the issue to a recent feature release, which led to increased latency in loading card payment options.The insight agent converts the RCA findings into a human-readable explanation: “Conversion rate dropped 23% at the post-checkout screen on the V8.1 Android release, due to higher latency.”Action agents pull historical remediation data, analyze similar past events, and suggest a roll-back. (If AI confidence is high (say above 90%), then roll-back is recommended. If the actions are auto-configured, then the rollback can be autonomously triggered based on a confidence threshold.)Finally, the system keeps everyone in the loop by triggering Slack notifications to the respective Product Team, and the newly generated insights are reflected on daily dashboards, ensuring full visibility. Architecture Diagram The diagram illustrates the end-to-end basic architecture of an agentic AI-driven decision support system, tracing the flow from data producers to the final teams that actually utilize the extracted insights and visualizations. By weaving AI agents into a decision-making fabric, organizations can improve decision accuracy, reduce manual effort, and respond faster to ever-changing business conditions. However, implementing such systems demands thoughtful setup of data pipelines, careful model training, and a robust mechanism for safe decision execution. To stay ahead, it’s crucial to regularly update the AI agents based on user feedback and to embed AutoML capabilities that allow rapid experimentation and improvement without long development cycles. Thus, in a world changing at lightning speed, these AI agents are not just assistants; they are the copilots driving faster, smarter, and more resilient business outcomes.

By Ishita Choudhary
Protecting Non-Human Identities: Why Workload MFA and Dynamic Identity Matter Now
Protecting Non-Human Identities: Why Workload MFA and Dynamic Identity Matter Now

We’ve normalized multi-factor authentication (MFA) for human users. In any secure environment, we expect login workflows to require more than just a password — something you know, something you have, and sometimes something you are. This layered approach is now foundational to protecting human identities. But today, the majority of interactions in our infrastructure aren’t human-driven. They’re initiated by non-human entities — services, microservices, containerized workloads, serverless functions, background jobs, and AI agents. Despite this shift, most of these systems still authenticate using a single factor: a secret. According to CyberArk’s 2025 Identity Security Landscape, machine identities now outnumber humans 82 to 1, and 42% of them have privileged access to critical systems. Identity-based attacks on cloud infrastructure grew 200% in 2024, with 87% of organizations experiencing multiple identity-centric breaches, most involving compromised credentials. It’s no longer viable to treat non-human identities as second-class citizens in our security architecture. These entities perform high-impact tasks and need the same protection we give to people. That starts with moving beyond static secrets. The Limits of Static Secrets in Microservices In a modern microservices architecture, dozens or even hundreds of independently deployed services each need access to sensitive resources — databases, APIs, message queues, and more. Each microservice typically requires its own set of secrets: API keys, database credentials, encryption keys, and certificates. As organizations scale, so does the number of secrets, leading to what’s known as secrets sprawl. Secrets sprawl occurs when secrets are scattered across countless repositories, configuration files, deployment templates, and cloud environments. Without centralized management, secrets are often duplicated, orphaned across environments, or embedded in legacy systems. This dramatically increases the attack surface and makes it nearly impossible to track, scope, or revoke them in real time. Recent industry reports highlight the scale of the problem: GitGuardian’s 2025 State of Secrets Sprawl Report found nearly 23.8 million new hardcoded secrets in public GitHub commits in 2024 — a 25% increase from the previous year.ByteHide notes that over 10 million secrets were identified in public code repositories in 2023, a 67% year-over-year increase.Akeyless reports that 96% of organizations have secrets sprawled in code, configuration files, and infrastructure tooling — and 70% have experienced a secret leak in the past two years. When a secret is leaked — even accidentally — it can be reused by anyone who finds it, often with no expiration, no attribution, and no context. Rotating secrets becomes a manual, error-prone process, requiring teams to hunt down every instance across environments and codebases. The result is operational friction and a vastly expanded attack surface. Redefining Workload Authentication: MFA for Systems What would it look like if workloads were treated as first-class identities and authenticated like humans? Workload MFA doesn’t mean giving systems phones or fingerprints. It means verifying multiple contextual signals to assert identity: Identity: What is this system? (e.g., spiffe://internal.myorg.dev/ns/payment/sa/service-a)Attestation: Where is it running? (e.g., namespace, node fingerprint, cloud instance ID)Provenance: What launched it? (e.g., container image hash, verified build pipeline) These three factors — who, where, and what — combine to create a cryptographically verifiable trust boundary. Instead of relying on possession of a secret, you validate the origin, location, and role of the system. This model aligns with zero-trust principles: trust nothing by default and verify everything continuously. It also supports the emerging practice of Workload Identity and Access Management (WIAM), which treats system identity with the same rigor we apply to user identity (CybersecurityTribe, WIAM Primer). Real-World Example: Uber’s SPIFFE/SPIRE Deployment Uber uses SPIRE, the reference implementation of SPIFFE, to issue workload identity across multiple environments, including GCP, AWS, OCI, and on-premise data centers. These identities apply to stateless services, storage systems, batch workloads, and infrastructure services. SPIRE issues short-lived, cryptographically verifiable identities (SVIDs) to each workload. These identities are then used for mutual TLS authentication, replacing traditional secrets-based access. Certificates are rotated frequently and automatically. This model allows Uber to enforce zero-trust security at a massive scale without relying on any cloud-specific identity mechanism. How SPIFFE Identity Works SPIFFE, or Secure Production Identity Framework for Everyone, defines a uniform identity format for non-human systems. These identities are issued by SPIRE and scoped based on workload attributes. Instead of injecting secrets into environments, workloads are issued SPIFFE IDs and certificates at runtime: Shell spiffe://internal.myorg.dev/ns/payment/sa/service-a This identity is tied to a specific namespace and service account, and is issued only after SPIRE verifies environmental attributes like: Container image signaturePod metadata or node labelsCloud instance fingerprintRuntime selectors The issued certificate is valid for a few minutes, rotates automatically, and is revoked if the workload no longer matches the trust policy. Identity Issuance Flow: Plain Text [Workload/Software Process Starts] ↓ [SPIRE agent attests environment] ↓ [Workload receives SPIFFE ID and certificate] ↓ [Certificate used for mutual TLS or API authentication] ↓ [Certificate expires and rotates automatically] This model removes static secrets entirely. Access is now based on verified identity. Why Attestation Is the Second Factor Secrets offer no context. Attestation enforces it. SPIRE supports attestation mechanisms for both the host (node) and the workload (container, process, or function). Before an identity is issued, SPIRE checks: Is the workload running in a trusted environment?Does it originate from a trusted image or deployment?Is it operating under a trusted identity scope? These determinations are made using attestation signals such as Kubernetes namespaces, node fingerprints, container image hashes, or metadata selectors defined in policy. If any of these checks fail, identity is not granted. This is the second factor — context as proof. Industry Standards Are Converging on Workload Identity The IETF WIMSE Working Group is formalizing SPIFFE-style identity for global adoption: Token exchange: Mapping SPIFFE IDs to OAuth claimsTransaction chaining: Enforcing JWT-based call-chain integrityFederation: Supporting cross-cloud identity federation In parallel, OAuth 2.0 Attestation-Based Client Authentication allows trusted clients like serverless functions to authenticate without shared secrets. This future replaces bearer tokens with environment-bound, signed assertions of identity. Together, these efforts are creating an ecosystem where non-human identities are first-class, portable, and verifiable. Final Thought: Time to Protect Non-Human Identities Like We Protect People In 2025, the majority of systems operating in cloud and enterprise environments are not human — they’re services, functions, microservices, batch jobs, and ephemeral containers. These non-human identities perform critical tasks: processing payments, managing infrastructure, handling sensitive data, and securing communications. Yet most are still authenticated with long-lived secrets — copied across environments, hardcoded into configurations, or rotated on a manual schedule. That’s no longer sustainable. If humans require MFA, real-time verification, and identity-aware access, the systems acting on their behalf must meet the same bar. Identity — not static secrets — must become the foundation of trust. With SPIFFE, SPIRE, and global standards like WIMSE now in place, we have the foundation to treat non-human identities as first-class entities in our infrastructure. Non-human identity is no longer a niche concept. It is the new baseline for trust.

By Surya Avirneni

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Best Software Engineer Books: Build Your Personal Library

September 18, 2025 by Bartłomiej Żyliński DZone Core CORE

Anything Rigid Is Not Sustainable: Why Flexibility Beats Dogma in Agile and Project Management

September 17, 2025 by Pabitra Saikia

The AI FOMO Paradox

September 17, 2025 by Stefan Wolpers DZone Core CORE

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

LLMs for Debugging Code

September 18, 2025 by Surya Teja Appini

Blueprint for Agentic AI: Azure AI Foundry, AutoGen, and Beyond

September 18, 2025 by Anand Singh

FOSDEM 2025 Recap: Open Source Contributors Unite to Collaborate and Help Advance Apache Software Projects

September 18, 2025 by Brian Proffitt

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

September 18, 2025 by Swati Sharma

Blueprint for Agentic AI: Azure AI Foundry, AutoGen, and Beyond

September 18, 2025 by Anand Singh

Unified Checkout Experience Through Micro Frontend Architecture

September 18, 2025 by Vaibhav Rastogi

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Blueprint for Agentic AI: Azure AI Foundry, AutoGen, and Beyond

September 18, 2025 by Anand Singh

Remote Android Management: A Step-by-Step Guide

September 18, 2025 by Sergei Shaikin

Creating a Distributed Computing Cluster for a Data Base Management System: Part 1

September 18, 2025 by Vladimir Serdyuk

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

LLMs for Debugging Code

September 18, 2025 by Surya Teja Appini

Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

September 18, 2025 by Swati Sharma

Blueprint for Agentic AI: Azure AI Foundry, AutoGen, and Beyond

September 18, 2025 by Anand Singh

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

LLMs for Debugging Code

September 18, 2025 by Surya Teja Appini

Blueprint for Agentic AI: Azure AI Foundry, AutoGen, and Beyond

September 18, 2025 by Anand Singh

FOSDEM 2025 Recap: Open Source Contributors Unite to Collaborate and Help Advance Apache Software Projects

September 18, 2025 by Brian Proffitt

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: