DZone Spotlight

Wednesday, December 17 View All Articles »

Streaming vs In-Memory DataWeave: Designing for 1M+ Records Without Crashing

By Sree Harsha Meka

The Real Problem With Scaling DataWeave MuleSoft is built to handle enterprise integrations — but most developers test with small payloads. Everything looks fine in dev, until one day a real file with 1 million records hits your flow. Suddenly, your worker crashes with an OutOfMemoryError, and the job fails halfway through. The truth is, DataWeave by default works in-memory. That’s acceptable for small datasets, but in production, we often deal with: Banking: daily ACH transaction exports, sometimes hundreds of MBs.Healthcare: claims data with millions of rows and deeply nested fields.Retail: product catalogs or clickstream logs from thousands of stores. If you’re not designing for streaming, your flow will eventually hit the wall. In-Memory vs. Streaming — What’s Actually Happening? In-Memory (Default Behavior) Mule loads the entire payload into memory before transformations.Fast when the payload is small (<50k records).Breaks down once files grow into hundreds of MBs or GBs. Think of it like opening a giant Excel file. It works fine with a few thousand rows, but try opening 2 million, and Excel freezes. If a file is 1 GB, Mule will attempt to hold that 1 GB in memory, plus the transformed copy. That’s a recipe for a crash. Streaming (The Right Way for Big Data) Mule reads the file record by record (or in small chunks).Each record is transformed and discarded before moving to the next.Memory usage stays flat and predictable. Think of it as a conveyor belt — records come in, get processed, and move out. You never hold the entire dataset at once. Markdown %dw 2.0 output application/json var file = readUrl("classpath://data.json", "application/json", {streaming:true}) --- file pluck ((value, key) -> { id: value.id, amount: value.amount }) This approach safely scales to millions of records. Why Use pluck Instead of map? Both map and pluck transform collections, but they differ in how they handle memory. map – creates a new array with all results in memory. If you have 1M records, Mule holds 1M transformed objects.pluck – iterates through key-value pairs and streams results more efficiently. In practice, pluck is the safer option for massive CSV or JSON datasets. Performance Comparison: 1.2M Records on 0.2 vCore Worker In-Memory Memory usage: Scales with file size, spikes near heap limits.Processing speed: ~8–10 minutes.Notes: Prone to crashes on large payloads. Streaming (streaming: true) Memory usage: Flat, ~300MB steady.Processing speed: ~10–12 minutes.Notes: Stable, no crashes. Takeaway In-memory seems faster at a small scale, but streaming wins in real-world production because stability always beats small performance gains. Real Case: Processing ACH Transactions At a credit union, we processed 1.2M ACH transactions daily from a Symitar core system. In-memory: A 0.2 vCore worker (500MB heap) crashed midway.Streaming: The same worker processed the entire dataset without issues. Markdown %dw 2.0 output application/json var file = readUrl("classpath://ACH_Transactions.csv", "application/csv", {streaming:true}) --- file pluck ((txn, i) -> { transactionId: txn.txnId, amount: txn.amount, date: (txn.date as Date {format: "MM/dd/yyyy"}) as String }) This ran smoothly in production — no heap errors, no worker restarts. Common Mistakes That Break Streaming Even seasoned developers sometimes disable streaming unintentionally: Assigning the payload to a variable (var bigData = payload) forces in-memory storage.Using map instead of pluck for huge collections.Forgetting to test with production-scale data and relying only on 1k-row samples.Adding deep nesting in transformations without early filters, which multiplies memory use. Avoiding these mistakes can save hours of debugging and prevent costly production outages. Why This Matters for Enterprises At enterprise scale, memory errors aren’t just technical headaches — they turn into business risks. In banking, a failed ACH batch can delay payroll for thousands of employees. In healthcare, a rejected claims file can stall reimbursements for hospitals. Every retry not only burns time but also wastes compute resources, driving up CloudHub costs. By enabling streaming from the start, enterprises minimize downtime, avoid SLA penalties, and reduce operational costs. This small design decision translates into measurable savings and customer trust. Performance Tuning Tips for Large Data Sets Streaming alone isn’t a silver bullet. To get the most out of it: Right-size workers: A 0.2 vCore can handle millions of rows with streaming, but complex joins may need 1 vCore.Filter early: Shrink datasets upfront before expensive transformations.Batch where possible: Split very large files into smaller chunks for parallelism.Use smart logging: Log samples, not every record, to prevent disk bloat.Property-driven configs: Parameterize {streaming: true} to avoid accidental overrides.Automated testing: Add performance tests with large payloads in CI/CD pipelines to catch scaling issues early. Future-Proofing With Event-Driven Architectures Streaming aligns perfectly with modern event-driven patterns. Whether MuleSoft consumes from Kafka, Azure Event Hub, or AWS Kinesis, the principle is the same: don’t load everything into memory. As organizations adopt real-time analytics and data pipelines, MuleSoft’s streaming capabilities become the bridge between batch-oriented flat files and event-driven systems. Building with streaming today sets the stage for tomorrow’s real-time enterprise integrations. Wrapping Up MuleSoft’s DataWeave is powerful, but its default in-memory mode wasn’t designed for 1M+ record datasets. To build resilient, production-ready flows: Enable streaming for large files.Prefer pluck over map.Avoid holding payloads in variables.Test with real-world datasets, not just dev samples. It’s a small design choice that makes the difference between a flow that fails in production and one that scales effortlessly to millions of records every single day. More

DZone's 2025 Community Survey

By Carisse Dumaua

Another year passed right under our noses, and software development trends moved along with it. The steady rise of AI, the introduction of vibe coding — these are just among the many impactful shifts, and you've helped us understand them better. Now, as we move on to another exciting year, we would like to continue to learn more about you as software developers, your tech habits and preferences, and the topics you wish to know more about. With that comes our annual community survey — a great opportunity for you to give us more insights into your interests and priorities. We ask this because we want DZone to work for you. Click below to participate ⬇️ And as a small token, you will have a chance to win up to $300 in gift cards and exclusive DZone swag! All it will take is just 10–15 minutes of your time. Now, how cool is that? Over the years, DZone has remained an ever-growing avenue for exploring technology trends, looking for solutions to technical problems, and engaging in peer discussions — and we aim to keep it that way. We're going to need your help to create a more relevant and inclusive space for the DZone community. This year, we want to hear your thoughts on: Who you are as a developer: your experience and how you use toolsWhat you want to learn: your preferred learning formats and topics of interestYour DZone engagement: how often you visit DZone, which content areas pique your interest, and how you interact with the DZone community You are what drives DZone, so we want you to get the most out of every click and scroll. Every opinion is valuable to us, and we use it to equip you with the right resources to support your software development journey. And that will only be possible with your help — so thank you in advance! — Your DZone Content and Community team and our little friend, Cardy More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #375

Cloud-Native Application Security Patterns and Anti-Patterns

By Samir Behara

From Metrics to Action: Adding AI Recommendations to Your SaaS App

You log into your DevOps portal, pinched to think about 300 different metrics: CPU, latency, errors, all lighting up red on your dashboard. But what to prioritize? It’s what an AI-based recommendation tool could resolve. Every SaaS platform managing cloud operations records an incredible amount of telemetry data. Most products, however, simply provide visualization: interesting graphics, yet no actionable information. What if your product could provide automated suggestions for config, scaling, or alerts based on tenant behavior? Using this article, you can incorporate AI recommendation functionality into your SaaS-based platform, thereby turning data into meaningful information. AI’s role in transforming DevOps dashboards into optimization engines shall also be discussed. Why This Matters Now It is now common for multi-cloud platforms to produce data points in the tens of millions each day. Engineers simply can't process these results. AI-powered assistance is no longer a "nice to have," it’s the next frontier in dashboard usability for new SaaS interfaces. Include context for actionable, customized suggestions to encourage faster decision-making, reduce unnecessary expenditures, and improve adoption. Background Most DevOps tools are built to monitor, not to advise. Dashboards and alerts tell you "what’s happening," not "what to do next!" Some typical problems are: Decision fatigue: too many charts with no actionable conclusions.Compute waste: 40-60% of servers are below 40% utilization.Low feature adoption: less than 25% of alert templates are in use. Conventional rules-based automated approaches simply cannot be scaled for individual workloads or tenants. It follows naturally to move to a model based on learning from usage patterns. Solution Overview To resolve these kinds of earlier problems, something more than dashboards with static rules is needed. Intelligence with the capacity to learn from user activity, in addition to directing these engineers, would be necessary to inform their efforts, rather than having to interpret each of these many different graphics. It starts with behavioral learning. Every engagement, whether it's opening a metric, muting an alert, or modifying an autoscaling rule, leaves traces, or signals, about what's useful, confusing, or volatile. To illustrate, if multiple groups of users are consistently modifying the volatile settings for an alert, it suggests modifications to point to a more stable environment. On this foundation, collaborative intelligence finds similarities among tenants. Even for tenants whose data isn't shared, there's apt to be similarity in their workloads. If one set of tenants has successfully optimized CPU variability by turning on scheduled autoscaling, such variability is noticed by collaborative intelligence, which suggests an analogous solution for other tenants with similar workloads. Generalization to new, unseen situations would be enabled by vector embeddings, which represent the properties of resources, such as instance types, alerting settings, or metric behaviors. A new instance type, for instance, could then be automatically matched with tenants likely to benefit from their workloads. Every finding passes through a low-latency scoring layer, which instantly recommends actions in the SaaS interface. With each new telemetry, the engine suggests actions based on rankings, including offers like "Enable autoscaling: Similar groups with analogous workloads observed a 15% cost savings." Since inference times are below 100ms, these suggestions are perceived as native to the platform. It improves with feedback. Successful predictions strengthen the model, while failed predictions decrease model confidence. It’s retrained nightly or hourly to keep its model fresh, eliminating costly real-time retraining. Trials with such a model have provided clouds with performance variability reductions of 22% with inference latency below 80ms. With such embedded intelligence, your SaaS offering not only remains a monitoring solution but also becomes a proactive collaborator, which assists in taking faster actions, optimizing expenditures, and making more informed operational choices. Architecture and Flow Below is a visual representation of the complete AI recommendation system architecture utilized in these types of SaaS/DevOps applications. It highlights the flow of data from telemetry, behavioral, or workload metadata through the ingestion, model, and real-time scoring to provide actionable recommendations in product UI. Before vs. After: Operational Intelligence Transformation Before AI Recommendations DevOps professionals spend inordinate amounts of their time trying to understand their dashboards, navigating between their metric views, or manually setting thresholds for alerts or scaling rules. Most choices are based on tribal knowledge or iterative testing. Inconsistent settings, sluggish response, or continuous over- or under-provisioning plague DevOps professionals' systems. Highly useful functions, such as anomaly detection or predictive scaling, go unused simply due to unfamiliarity with their appropriate use. After AI Recommendations It transforms from being a passive observer to an active decision-maker. Rather than looking at several hundred graphs, engineers get their most pressing answers in context with what they are working on. It shows them what they care about most, why, and what they should do next, which could be solved with one click. It’s dynamic because it changes based on what’s being received from the users, which improves with time. Implementation Steps A DevOps SaaS platform employed AI in offering cost-efficient settings based on cross-tenant activity. 1. Resource optimization begins with aggregating workload data with operational activity driven by users. It is able to capture the patterns in machine activity, such as over-provisioned machines, noisy pages, misconfigured scaling rules, and inefficient workloads, while also taking into consideration human intention. SQL: Aggregate multi-tenant resource signals SQL SELECT tenant_id, user_id, resource_id, AVG(cpu_util) AS avg_cpu, AVG(mem_util) AS avg_mem, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY cpu_util) AS p95_cpu, COUNT(*) FILTER (WHERE action='resize') AS resize_events, MAX(event_ts) AS last_seen FROM telemetry_user_actions GROUP BY tenant_id, user_id, resource_id; Apply decay weighting to prioritize current workloads. Python df['weight'] = df['resize_events'] * np.exp(-(now - df['last_seen']).dt.days / 7) Why this matters for optimization: Utilization patterns shift weeklyUsers adjust scaling rules when workloads misbehaveDecay prevents old sizing decisions from influencing new recommendations Compliance Per-tenant matrices keep resource/workload behavior isolated for SOC2/HIPAA environments. 2. Build a hybrid behavioral/semantic model for optimization. DevOps workloads vary widely — there are APIs, constant batch processing, and spiky ML inference work. Consequently, it needs to have an understanding of the nature of similar workloads, plus adjustments made to them by engineers. Collaborative filtering learns human decisions. Python model_cf = AlternatingLeastSquares(factors=64, regularization=0.1) model_cf.fit(user_resource_matrix) Embeddings capture workload semantics. Python resource_desc = workload["instance_type"] + " " + workload["traffic_pattern"] item_embeddings = model_emb.encode(resource_desc) Blend into unified optimization score. Python score = 0.65 * cf_score + 0.35 * cosine_similarity(item_embeddings) Why this hybrid approach works: CF mirrors real optimization patterns (“teams like you downsize this VM after consistently low CPU”).Embeddings capture resource characteristics, enabling recommendations for new instance families or unknown workloads.Hybrid ensures stability during high variance and low-data periods. 3. Deploy low-latency optimization API with SLOs and observability. These optimization suggestions have to be made in real time, within dashboards where engineers are analyzing performance. FastAPI Microservice (P99 < 120 ms target): Python @app.get("/optimize/{tenant}/{resource}") async def optimize(tenant, resource): rec = hybrid_engine.recommend(tenant, resource) return { "resource": resource, "suggested_actions": rec, "explainability": rec.weights # CPU%, peer similarity, cost delta } Add system observability. Use Prometheus counters: recsys_latency_msrecsys_success_totalrecsys_failure_totalrecsys_cache_hit_ratio Example recommendations: “Resize c5.large → c5.medium (avg CPU: 22%, p95 CPU: 41%).”“Enable HPA (high variance workload detected).”“Adopt anomaly alerting (noise reduced 40% for similar apps).” 4. Feedback learning, drift detection, and retraining. Resource optimization begins with recognizing what drives performance and cost improvements. Log impact-aware feedback: Python feedback.append({ "tenant": tenant, "resource": resource, "action": action, "accepted": accepted, "cpu_delta": post_cpu - pre_cpu, "cost_delta": post_cost - pre_cost }) Drift detection (seasonality, traffic spikes, new deployments): Python drift_detector = drift.ADWIN() if drift_detector.update(current_avg_cpu): trigger_early_retrain() Nightly retraining via Airflow: Python with DAG("retrain_optimizer", schedule_interval="@daily") as dag: PythonOperator(task_id="retrain", python_callable=retrain_pipeline) 5. Validate the engine using optimization-centric metrics. Consumer recommenders differ in that success is not measured by click-throughs in this case. Offline Right-sizing prediction accuracyImprovement in CPU/memory balanceRecall@5 for optimization suggestions Online (A/B tests) % cost reduction per tenantReduction in manual resize/edit operationsAlert noise reductionImproved scaling stability (fewer OOMs, fewer restarts) Key Takeaways and Future Directions AI-driven optimization represents a paradigm shift for SaaS/DevOps platforms in general, since it translates raw infrastructure data into actionable insights to eliminate cloud inefficiencies and variability. By leveraging a hybrid approach to behavioral learning and workload semantics, the model is able to provide meaningful suggestions for rightsizing, scaling, and alarm tuning with significantly improved accuracy levels compared to traditional rule-based systems. These features are most beneficial in multi-tenant scenarios, which are influenced by tenant isolation, compliance, and real-time inference needs in terms of constructing these suggestions. As the model learns from accepted and rejected suggestions, it gets more in line with operational needs for each tenant, thereby addressing their inefficiency levels, noise in alerts, and performance variability. These foundations will also lay the groundwork for what next-generation optimization engines look like. Future engines will look not just at recommendations, but at safe, autonomous changes to make within given guardrails. Enhanced support for CI/CD pipelines will enable changes to be tracked against deployments or new service launches, while seasonal forecasts and time-series analysis will enable platforms to forecast demand peaks in advance. With increasing adoption in multiple clouds, engines will also enable analogous instance types on AWS, GCP, and Azure, to provide truly cloud-agnostic optimization for workloads. It all suggests a next chapter in which SaaS platforms evolve to be intelligent partners, with partners that enable self-improved, self-optimized, and faster decision-making for operationally confident teams.

By Vasanthi Jangala Naga

AI Data Storage: Challenges, Capabilities, and Comparative Analysis

The explosion in the popularity of ChatGPT has once again ignited a surge of excitement in the AI world. Over the past five years, AI has advanced rapidly and has found applications in a wide range of industries. As a storage company, we’ve had a front-row seat to this expansion, watching more and more AI startups and established players emerge across fields like autonomous driving, protein structure prediction, and quantitative investment. AI scenarios have introduced new challenges to the field of data storage. Existing storage solutions are often inadequate to fully meet these demands. In this article, we’ll deep dive into the storage challenges in AI scenarios, critical storage capabilities, and comparative analysis of storage products. I hope this post will help you make informed choices in AI and data storage. Storage Challenges for AI AI scenarios have brought new data patterns: High-Throughput Data Access Challenges In AI scenarios, the growing use of GPUs by enterprises has outpaced the I/O capabilities of underlying storage systems. Enterprises require storage solutions that can provide high-throughput data access to fully leverage the computing power of GPUs. For instance, in smart manufacturing, where high-precision cameras capture images for defect detection models, the training dataset may consist of only 10,000 to 20,000 high-resolution images. Each image has several gigabytes in size, resulting in a total dataset size of 10 TB. If the storage system lacks the required throughput, it becomes a bottleneck during GPU training. Managing Storage for Billions of Files AI scenarios need storage solutions that can handle and provide quick access to datasets with billions of files. For example, in autonomous driving, the training dataset consists of small images, each about several hundred kilobytes in size. A single training set comprises tens of millions of such images, each sized several hundred kilobytes. Each image is treated as an individual file. The total training data amounts to billions or even 10 billion files. This creates a major challenge in effectively managing large numbers of small files. Scalable Throughput for Hot Data In areas like quantitative investing, financial market data is smaller compared to computer vision datasets. However, this data must be shared among many research teams, leading to hotspots where disk throughput is fully used but still cannot satisfy the application's needs. This shows that we need storage solutions that can handle a lot of hot data quickly. The basic computing environment has also changed a lot. These days, with cloud computing and Kubernetes getting so popular, more and more AI companies are setting up their data pipelines on Kubernetes-based platforms. Algorithm engineers request resources on the platform, write code in a Notebook to debug algorithms, use workflow engines like Argo and Airflow to plan data processing workflows, use Fluid to manage datasets, and use BentoML to deploy models into apps. Cloud-native technologies have become a standard consideration when building storage platforms. As cloud computing matures, AI businesses are increasingly relying on large-scale distributed clusters. With a significant increase in the number of nodes in these clusters, storage systems face new challenges related to handling concurrent access from tens of thousands of pods within Kubernetes clusters. IT professionals managing the underlying infrastructure face significant changes brought about by the evolving business scenarios and computing environments. Existing hardware-software-coupled storage solutions often suffer from several pain points, including a lack of elasticity, a lack of distributed high availability, and constraints on cluster scalability. Distributed file systems like GlusterFS, CephFS, and those designed for HPC, such as Lustre, BeeGFS, and GPFS, are typically designed for physical machines and bare-metal disks. While they can deploy large capacity clusters, they cannot provide elastic capacity and flexible throughput, especially when dealing with storage demands in the order of tens of billions of files. Key Capabilities for AI Data Storage Considering these challenges, we’ll outline essential storage capabilities critical for AI scenarios, helping enterprises make informed decisions when selecting storage products. POSIX Compatibility and Data Consistency In the AI/ML domain, POSIX is the most common API for data access. Previous-generation distributed file systems, except HDFS, are also POSIX-compatible, but products on the cloud in recent years have not been consistent in their POSIX support: Compatibility Users should not solely rely on the description "POSIX-compatible product" to assess compatibility. You can use pjdfstest and the Linux Test Project (LTP) framework for testing. Strong Guarantee of Data Consistency Storage systems have different ways of ensuring consistency. File systems usually use strong consistency, while object storage systems often use eventual consistency. It takes a lot of thought to choose the right storage system. User Mode or Kernel Mode Early developers preferred kernel mode because it could optimize I/O operations. However, in recent years, more developers have been moving away from kernel mode for several reasons: Using kernel mode ties the file system client to specific kernel versions. GPU and high-performance network card drivers often need compatibility with certain kernel versions. This combination of factors places a significant burden on kernel version selection and maintenance.Exceptions of kernel-mode clients can potentially freeze the host operating system. This is highly unfavorable for Kubernetes platforms.The user-mode FUSE library has undergone continuous iterations, resulting in significant performance improvements. It has been well-supported among JuiceFS customers for various business needs, such as autonomous driving perception model training and quantitative investment strategy training. This demonstrates that in AI scenarios, the user-mode FUSE library is no longer a performance bottleneck. Linear Scalability of Throughput Different file systems employ different principles for scaling throughput. Previous-generation distributed storage systems like GlusterFS, CephFS, the HPC-oriented Lustre, BeeGFS, and GPFS primarily use all-flash solutions to build their clusters. In these systems, peak throughput equals the total performance of the disks in the cluster. To increase cluster throughput, users must scale the cluster by adding more disks. However, when users have imbalanced needs for capacity and throughput, traditional file systems require scaling the entire cluster, leading to capacity wastage. For example, for a 500 TB capacity cluster using 8 TB hard drives with two replicas, 126 drives with a throughput of 150 MB/s each are needed. The theoretical maximum throughput of the cluster is 18 GB/s (126 ×150 = 18 GB/s). If the application demands 60 GB/s throughput, there are two options: Switching to 2 TB HDDs (with 150 MB/s throughput) and requiring 504 drivesSwitching to 8 TB SATA SSDs (with 500 MB/s throughput) while maintaining 126 drives The first solution increases the number of drives by four times, necessitating a corresponding increase in the number of cluster nodes. The second solution, upgrading to SSDs from HDDs, also results in a significant cost increase. As you can see, it’s difficult to balance capacity, performance, and cost. Capacity planning based on these three perspectives becomes a challenge because we cannot predict the development, changes, and details of the real business. Therefore, decoupling storage capacity from performance scaling would be a more effective approach for businesses to address these challenges. In addition, handling hot data is a common problem in AI scenarios. An effective approach is to employ a cache grouping mechanism that automatically distributes hot data to different cache groups. This means that it automatically creates multiple copies of hot data during computation to achieve higher disk throughput, and these cache spaces are automatically reclaimed after computation. Managing Massive Amounts of Files Efficiently managing a large number of files, such as 10 billion files, has three demands on the storage system: Elastic scalability: The real-world scenario of JuiceFS users is to expand from tens of millions of files to hundreds of millions of files and then to billions of files. This process is not possible by adding a few nodes. Storage clusters need to add nodes to achieve horizontal scaling, enabling them to support business growth effectively.Data distribution during horizontal scaling: During system scaling, data distribution rules based on directory name prefixes may lead to uneven data distribution.Scaling complexity: As the number of files increases, the ease of system scaling, stability, and the availability of tools for managing storage clusters become vital considerations. Some systems become more fragile as file numbers reach billions. Ease of management and high stability are crucial for business growth. Concurrent Load Capacity and Feature Support in Kubernetes Environments When we look at the storage system specifications, some specify the maximum number of concurrent accesses. Users need to conduct stress testing based on their business. When there are more clients, quality of service (QoS) management is required, including traffic control for each client and temporary read/write blocking policies. We must also note the design and supported features of CSI in Kubernetes. For instance, the deployment method of the mounting process, whether it supports ReadWriteMany, subPath mounting, quotas, and hot updates. Cost Analysis Cost analysis is a complex topic. It covers both hardware and software purchases, while often being overshadowed by operational and maintenance costs. As AI companies grow, the volume of data increases significantly. Storage systems must exhibit both capacity and throughput scalability, offering ease of adjustment. In the past, the procurement and scaling of systems like Ceph, Lustre, and BeeGFS in data centers involved lengthy planning cycles. It took months for hardware to arrive, be configured, and become operational. Time costs, notably ignored, were often the most significant expenditures. Storage systems that enable elastic capacity and performance adjustments equate to faster time-to-market. Another often overlooked cost is efficiency. In AI workflows, the data pipeline is extensive, involving many interactions with the storage system. Each step, from data collection, clear conversion, labeling, feature extraction, training, backtesting, to production deployment, is influenced by how efficient the storage system is. However, businesses typically utilize only a fraction (often less than 20%) of the entire dataset actively. This subset of hot data needs to work quickly, while warm or cold data may not be accessed very often or at all. It's hard to meet both needs in systems like Ceph, Lustre, and BeeGFS. As a result, many teams use more than one storage system to meet different needs. To get a lot of space and low costs, a common strategy is to use an object storage system for archiving. But object storage isn't exactly known for its speed, and it may handle data ingestion, preprocessing, and cleaning in the data pipeline. While this may not be the best way to preprocess data, it's often the pragmatic choice due to the sheer volume of data. Engineers then have to wait for a substantial period to transfer the data to the file storage system used for model training. Therefore, in addition to hardware and software costs of storage systems, total cost considerations should account for time costs invested in cluster operations (including procurement and supply chain management) and time spent managing data across multiple storage systems. Storage System Comparison Here's a comparative analysis of the storage products mentioned earlier for your reference: Categoryproductposix compatibilityelastic capacitymaximum supported file countperformancecost (USD) Amazon S3 Partially compatible through S3FS Yes Hundreds of billions Medium to Low About $0.02/GB/ month Alluxio Partial compatibility N/A 1 billion Depends on cache capacity N/A Cloud file storage service Amazon EFS NFSv4.1 compatible Yes N/A Depends on the data size. Throughput up to 3 GB/s, maximum 500 MB/s per client $0.043~0.30/GB/month Azure SMB & NFS for Premium Yes 100 million Performance scales with data capacity. See details $0.16/GiB/month GCP Filestore NFSv3 compatible Maxmium 63.9 TB Up to 67,108,864 files per 1 TiB capacity Performance scales with data capacity. See details $0.36/GiB/month Lustre Lustre Compatible No N/A Depends on cluster disk count and performance N/A Amazon FSx for Lustre Compatible Manual scaling, 1,200 GiB increments N/A Multiple performance types of 50 MB~200 MB/s per 1 TB capacity $0.073~0.6/GB/month GPFS GPFS Compatible No 10 billion Depends on cluster disk count and performance N/A BeeGFS Compatible No Billions Depends on cluster disk count and performance N/A JuiceFS Cloud Service Compatible Elastic capacity, no maximum limit 10 billion Depends on cache capacity JuiceFS $0.02/GiB/month + AWS S3 $0.023/GiB/month Conclusion Over the last decade, cloud computing has rapidly evolved. Previous-generation storage systems designed for data centers couldn't harness the advantages brought by the cloud, notably elasticity. Object storage, a newcomer, offers unparalleled scalability, availability, and cost-efficiency. Still, it exhibits limitations in AI scenarios. File storage, on the other hand, presents invaluable benefits for AI and other computational use cases. So, what does it take to build a file system ready for the AI era? It starts with the cloud. The key is to marry the virtually limitless scale of object storage with a high-performance file interface. The result? We get systems that are both powerful and practical. Storage is no longer a bottleneck — it's becoming a real driver of AI innovation.

By Rui Su

2026 IaC Predictions: The Year Infrastructure Finally Grows Up

The industry spent the last decade racing to automate the cloud, and 2026 will be the year we find out what happens when automation actually wins. AI is writing Terraform and OpenTofu faster than teams can review it. Cloud providers are shipping higher-level services every month. Business units want new environments on demand. The IaC footprint inside large enterprises is exploding. Anyone operating cloud infrastructure at scale feels the tension this creates: all that velocity without control is really just chaos. So our 2026 predictions aren’t about what’s trendy or all over LinkedIn. They’re about the essential points for teams that want to keep moving fast without breaking everything behind them. 1. Remediation Is the New Minimum Standard Detection-only tooling fades out in 2026. Teams won’t accept alerts that stack up in Jira while drift and misconfigurations quietly add risk. Platforms will be expected to automatically correct drift, reverse unauthorized console changes, and maintain the desired state continuously. Remediation engines will grow far more context-aware (understanding dependencies, policies, and intent) and will act on them without waiting for human approval. If your tooling can’t fix what it finds, it won’t be considered enterprise-grade. 2. AI IaC Generation Explodes… and Overruns Legacy Governance AI becomes the fastest junior engineer on every team. It will generate modules, baselines, and full environments in seconds. Some of that output will be brilliant; some will be dangerous. ControlMonkey’s GenAI Infrastructure Survey makes the shift unavoidable: 71% of cloud teams say GenAI is increasing their IaC volume, and 63% say GenAI-generated infrastructure is harder to govern than what engineers produce manually. Most importantly, 58% have already seen misconfigurations introduced directly by GenAI tools. The real shift is volume. AI won’t replace IaC — it will create far more of it. That means more deployments, more drift potential, more governance surface area, and far more chances for unsafe defaults or subtle mistakes to slip through. And governance teams know it: 81% say manual review simply cannot scale with GenAI-driven change velocity. This puts massive pressure on governance layers still relying on human review or ticket-driven processes. The only sustainable answer is automated policy enforcement directly in the merge and deployment path. 3. Instant Environment Recovery Becomes Mandatory After major cloud outages in 2025 (including the October 20 AWS meltdown and the November 18 Cloudflare network failure), the expectation inside the enterprise has changed. Executives no longer consider multi-hour restores acceptable. They expect recovery in minutes, and they expect it to be testable. In 2026, disaster recovery becomes entirely pipeline-driven. Configuration recovery patterns move from documentation to code. Teams adopt deterministic, IaC-based snapshots and full-environment recreation as part of normal operations, not as once-a-year exercises. The new standard: environments that can be restored or rebuilt as fast as they can be deployed. 4. Environment Duplication Becomes a Competitive Weapon Teams that can clone production in minutes, for testing, debugging, onboarding, or riskier launches, will simply move faster than those who can’t. Deterministic environment duplication becomes central to how high-velocity engineering organizations operate. Surrounding and IaC automation make this cheap, consistent, and safe. The organizations without this capability will move more slowly by design, because their environments are too fragile or too expensive to replicate. 5. AI Introduces More Chaos… Unless Governance Operates at Deployment Speed AI-generated infrastructure is fast, but not inherently safe. Expect more identity misconfigurations, more exposure risks, and far more sprawl. AI accelerates the number of changes, but also the number of mistakes that can slip through if governance isn’t fully automated. Manual review cycles simply won’t scale with AI-driven change volume. Policy-as-code and automated guardrails evaluated on every commit become the only reliable way to manage risk at AI speed. Governance becomes part of the deployment path, not a checkpoint outside it. 6. Cloud Resilience Shifts From the CISO Office to DevOps and Platform Teams Security will still define guardrails, but resilience is becoming a DevOps and platform responsibility. IaC is now the real system of record for how infrastructure is built, restored, and secured. Block’s platform team captured this shift well in our case study: once they moved recovery into the same automated pipeline as deployment, rolling back an entire region stopped being a special event. In their words, “We treat infrastructure like code and recovery the same way.” This is the new model: reproducible environments, consistent security baselines, fast rollback paths, and automated restores. RTO and RPO become engineering KPIs, not annual compliance exercises. 7. OpenTofu Takes a Meaningful Bite Out of Terraform OpenTofu’s adoption will accelerate for practical reasons: neutrality, auditability, long-term control, and regulatory comfort. (All the way back in June, the community hit 10 million downloads.) Large enterprises (especially in regulated industries) will increasingly treat OpenTofu as a strategic hedge and, in some cases, the preferred engine. Terraform isn’t disappearing. But serious organizations will expect dual-engine support, and the industry will move toward treating Terraform and OpenTofu as interchangeable components inside a larger governance and automation ecosystem. 8. GitOps and Policy-as-Code Cross the Line into Mandatory In 2026, GitOps and Policy-as-Code stop being “best practices” and become basic operational hygiene. If your cloud can be changed meaningfully outside Git, you do not have governance. You’ve got drift. Enterprises will increasingly rely on Terraform IaC service platforms that merge automation, governance, and remediation into a single delivery pipeline, ensuring infrastructure can evolve rapidly without spiraling into chaos. With AI accelerating change volume and multi-cloud environments expanding, Git becomes the definitive source of truth for infrastructure, and policy-as-code becomes the enforcement layer that makes velocity safe. This is the only sustainable operating model for modern infrastructure. The Throughline: Control Is the New Velocity The story of the last decade was getting infrastructure into code. The story of 2026 is whether you can control everything that code can now do. The organizations that win won’t be the ones generating the most IaC or adopting the most AI. They’ll be the ones that can fix, restore, duplicate, and govern infrastructure automatically, all at the speed modern engineering demands. IaC was the starting point. Full-lifecycle automation is the destination. 2026 is the year infrastructure finally grows up. If your organization is rethinking its IaC and resilience strategy for 2026, we’re happy to share what we’re seeing across some of the world’s most complex cloud environments. See how teams are doing it at controlmonkey.io. Read more articles in the ControlMonkey collection.

By Aharon Twizer

How to Test POST Requests With REST Assured Java for API Testing: Part II

In the previous article, we learnt the basics, setup, and configuration of the REST Assured framework for API test automation. We also learnt to test a POST request with REST Assured by sending the request body as: StringJSON Array/ JSON ObjectUsing Java CollectionsUsing POJO In this tutorial article, we will learn the following: How to use JSON files as a request body for API testing.Implement the Builder Design Pattern in Java to create request data dynamically.Integrate the Datafaker library to generate realistic test data at runtime.Perform assertions with the dynamic request data generated using the Builder design pattern and the Datafaker library. Writing a POST API Test With a Request Body as a JSON File The JSON files can be used as a request body to test the POST API requests. This approach comes in handy in the following scenarios: Multiple test scenarios with different payloads, and you need to maintain test data separately from test code.Large or complex payloads that need to be reused across multiple tests.Frequently changing request payloads that are easier to update in the JSON files rather than using other approaches, like dynamically updating the request body using JSON Objects/Arrays or POJOs. Apart from the above, JSON files can also be used when non-technical team members need to modify the test data before running the tests, without modifying the automation code. With the pros, this approach has some drawbacks as well. The JSON files must be updated with unique data before each test run to avoid duplicate data errors. If you prefer not to modify the JSON files before every execution, you’ll need to implement data cleanup procedures, which adds additional maintenance overhead. We will be using the POST /addOrder API from the RESTful e-commerce demo application to write the POST API requests test. Let’s add a new Java class, TestPostRequestWithJsonFile, and add a new method, getOrdersFromJson(), to it. Java public class TestPostRequestWithJsonFile { public List<Orders> getOrdersFromJson (String fileName) { InputStream inputStream = this.getClass () .getClassLoader () .getResourceAsStream (fileName); if (inputStream == null) { throw new IllegalArgumentException ("File not found!!"); } Gson gson = new Gson (); try (BufferedReader reader = new BufferedReader (new InputStreamReader (inputStream))) { Type listType = new TypeToken<List<Orders>> () { }.getType (); return gson.fromJson (reader, listType); } catch (IOException e) { throw new RuntimeException ("Error Reading the JSON file" + fileName, e); } } //... } Code Walkthrough The getOrdersFromJson() method accepts the JSON file as a parameter and returns a list of orders. This method functions as explained below: Locates the JSON file: The JSON file is placed in the src/test/resources folder, it searches for the JSON file in the classpath using the getResourcesAsStream() method. In case the file is not found, it will throw an IllegalArgumentException.Deserialise the JSON to Java objects: The BufferedReader is used for efficiently reading the file. Google’s Gson library uses theTypeToken to specify the target type (List<Orders>) for proper generic type handling, and converts JSON array into a typed list of order objects.The try-with-resources autocloses the resources to prevent memory leaks. The following test method, testCreateOrder(), tests the POST /addOrder API request: Java @Test public void testCreateOrders () { List<Orders> orders = getOrdersFromJson ("new_orders.json"); given ().contentType (ContentType.JSON) .when () .log () .all () .body (orders) .post ("http://localhost:3004/addOrder") .then () .log () .all () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")); } The following line of code will read the file new_orders.json and use its content as the request body to create new orders. Java List<Orders> orders = getOrdersFromJson("new_orders.json") The rest of the test method remains the same as explained in the previous tutorial, which sets the content type to JSON and sends the post request. It will verify that the status code is 201 and also assert the message field in the response body. Writing a POST API Test With a Request Body Using the Builder Pattern and Datafaker The recommended approach for real-time projects is to use the Builder Pattern with the Datafaker library, as it generates dynamic data at runtime, allowing random and fresh test data generation every time the tests are executed. The key advantages of using this approach are as follows: It provides a faster test setup as there are no I/O operations involved in searching, locating, and reading JSON files.It can easily handle parallel test execution as there is no conflict of test data between concurrent tests.It helps in easy maintenance as there is no need for manual updating of the test data. The Builder Pattern with Datafaker can be implemented using the following steps: Step 1: Generate a POJO for the Request Body The following is the schema of the request body of the POST /addOrder API: JSON [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] Let’s create a new Java class for POJO and name it OrderData. We will use Lombok in this POJO as it helps in reducing boilerplate code, such as getters, setters, and builders. By using annotations like @Builder, @Getter and @Setter, the class can be made concise, readable, and easier to maintain. Java @Getter @Setter @Builder @JsonPropertyOrder ({ "user_id", "product_id", "product_name", "product_amount", "qty", "tax_amt", "total_amt" }) public class OrderData { @JsonProperty ("user_id") private String userId; @JsonProperty ("product_id") private String productId; @JsonProperty ("product_name") private String productName; @JsonProperty ("product_amount") private int productAmount; private int qty; @JsonProperty ("tax_amt") private int taxAmt; @JsonProperty ("total_amt") private int totalAmt; } The field name of the JSON request body has a “_” in between them, and as per Java standard conventions, we follow the camelCase pattern. So, to mitigate this issue, we can make use of the @JsonProperty annotation by the Jackson DataBind library and provide the actual field name in the annotation over the respective Java variable names. The order of the JSON fields can be preserved by using the @JsonProperOrder annotation and passing the field names as per the required order. Step 2: Create a Builder Class for Generating Data at Runtime With Datafaker In this step, we will create a new Java class, OrderDataBuilder, for generating test data at runtime using the Datafaker library. Java public class OrderDataBuilder { public static OrderData getOrderData () { Faker faker = new Faker (); int productAmount = (faker.number () .numberBetween (1, 1999)); int qty = faker.number () .numberBetween (1, 10); int grossAmt = qty * productAmount; int taxAmt = (int) (grossAmt * 0.10); int totalAmt = grossAmt + taxAmt; return OrderData.builder () .userId (String.valueOf (faker.number () .numberBetween (301, 499))) .productId (String.valueOf (faker.number () .numberBetween (201, 533))) .productName (faker.commerce () .productName ()) .productAmount (productAmount) .qty (qty) .taxAmt (taxAmt) .totalAmt (totalAmt) .build (); } } A static method getOrderData() has been created inside the class that implements the Datakaker library and builds the OrderData for generating the request body in JSON format at runtime. The Faker class from the Datafaker library is instantiated first, which will be further used for creating fake data at runtime. It provides various methods to generate the required data, such as names, numbers, company names, product names, addresses, etc., at runtime. Using the OrderData POJO, we can populate the required fields through Java’s Builder design pattern. Since we have already applied the @Builder annotation from Lombok, it automatically enables an easy and clean way to construct OrderData objects. Step 3: Write the POST API Request Test Let’s create a new Java class, TestPostRequestWithBuilderPattern, for implementing the test. Java public class TestPostRequestWithBuilderPattern { @Test public void testCreateOrders () { List<OrderData> orderDataList = new ArrayList<> (); for (int i = 0; i < 4; i++) { orderDataList.add (getOrderData ()); } given ().contentType (ContentType.JSON) .when () .log () .all () .body (orderDataList) .post ("http://localhost:3004/addOrder") .then () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")); } } The request body requires the data to be sent in a JSON Array with multiple JSON objects. The OrderDataBuilder class will generate the JSON objects; however, the JSON Array can be handled in the test. Java List<OrderData> orderDataList = new ArrayList<> (); for (int i = 0; i < 4; i++) { orderDataList.add (getOrderData ()); } This code generates four unique order records using the getOrderData() method and adds them to a list named orderDataList. Once the loop completes, the list holds four unique OrderData objects, each representing a new order ready to be included in the test request. The POST test request is finally sent to the server, where it is executed, and the code checks for a status code of 201 and asserts the response body with the text “Orders added successfully!” Performing Assertions With the Builder Pattern When the request body and its data are generated dynamically, a common question arises: “Can we perform assertions on this dynamically created data?” The answer is “Yes.” In fact, it is much easier and quicker to perform the assertions with the request data generated using the Builder pattern and the Datafaker library. The following is the response body generated after successful order creation using the POST /addOrder API: Java { "message": "Orders fetched successfully!", "orders": [ { "id": 1, "user_id": "412", "product_id": "506", "product_name": "Enormous Wooden Watch", "product_amount": 323, "qty": 7, "tax_amt": 226, "total_amt": 2487 }, { "id": 2, "user_id": "422", "product_id": "447", "product_name": "Ergonomic Marble Shoes", "product_amount": 673, "qty": 2, "tax_amt": 134, "total_amt": 1480 }, { "id": 3, "user_id": "393", "product_id": "347", "product_name": "Fantastic Bronze Plate", "product_amount": 135, "qty": 9, "tax_amt": 121, "total_amt": 1336 }, { "id": 4, "user_id": "398", "product_id": "526", "product_name": "Incredible Leather Bottle", "product_amount": 1799, "qty": 4, "tax_amt": 719, "total_amt": 7915 } ] } Let’s say we need to perform the assertion for the user_id field in the second order and the total_amt field of the fourth order in the response. We can write the assertions with REST Assured as follows: Java given ().contentType (ContentType.JSON) .when () .log () .all () .body (orderDataList) .post ("http://localhost:3004/addOrder") .then () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")) .and () .assertThat () .body ("orders[1].user_id", equalTo (orderDataList.get (1) .getUserId ()), "orders[3].total_amt", equalTo (orderDataList.get (3) .getTotalAmt ())); The order array in the response holds all the data related to the orders. Using the JSONPath “orders[1].user_id”, the user_id of the second order will be retrieved. Similarly, the total amount of the fourth order can be fetched using the JSONPath orders[3].total_amt. The Builder design pattern comes in handy for comparing the expected values, where we can use the code orderDataList.get(1).getUserId and orderDataList.get(3).getTotalAmt to get the dynamic value of user_id (second order) and total_amount (fourth order) generated and used in the request body for creating orders at runtime. Summary The REST Assured framework provides flexibility to post the request body in the POST API requests. The request body can be posted using a String, JSON Object, or JSON Array, Java Collections such as List and Map, JSON files, and POJOs. The Builder design pattern in Java can be combined with the Datafaker library to generate a dynamic request body at runtime. Based on my experience, using the Builder Pattern in Java provides several advantages over other approaches for creating request bodies. It allows dynamic values to be easily generated and asserted, making test verification and validation more efficient and reliable.

By Faisal Khatri

CORE

Mastering Fluent Bit: Top 3 Telemetry Pipeline Filters for Developers (Part 11)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit filters for developers. In case you missed the previous article, check out three tips for using telemetry pipeline multiline parsers, where you explore how to handle complex multiline log messages. This article will be a hands-on exploration of filters that help you, as a developer, test out your Fluent Bit pipelines. We'll take a look at the top three filters you'll want to know about when building your telemetry pipeline configurations in Fluent Bit. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at the top three filters that will help you with your local development testing of Fluent Bit pipelines. Filtering in a Telemetry Pipeline See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the filters that can be of great help in managing our telemetry data during testing in our inner developer loop. Below, in the figure, you see the phases of a telemetry pipeline. The third phase is filter, which is where we can modify, enrich, or drop records based on specific criteria. Filters in Fluent Bit are powerful tools that operate on records after they've been parsed but before they reach their destination. Unlike processors that work on raw data streams, filters work on structured records, giving you the ability to manipulate individual fields, add metadata, remove sensitive information, or exclude records entirely based on conditions. In production environments, you need full control of the data you're collecting. Filtering lets you alter the collected data before delivering it to a destination. Each available filter can be used to match, exclude, or enrich your logs with specific metadata. Fluent Bit supports many filters, and understanding the most useful ones will dramatically improve your development experience. Now, let's look at the most interesting filters that developers will want to know more about. 1. Modify Filter One of the most versatile filters for telemetry pipelines that developers will encounter is the Modify filter. The Modify filter allows you to change records using rules and conditions, giving you the power to add new fields, rename existing ones, remove unwanted data, and conditionally manipulate your telemetry based on specific criteria. To provide an example, we start by creating a test configuration file called fluent-bit.yaml that demonstrates the Modify filter's capabilities: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"environment":"dev","level":"info","message":"Application started","memory_mb":512}' filters: - name: modify match: '*' add: - service_name my-application - version 1.2.3 - processed true rename: - environment env - memory_mb mem_usage remove: - level outputs: - name: stdout match: '*' format: json_lines Our configuration uses the modify filter with several different operations. The add operation inserts new fields into the record. This is extremely useful for adding metadata that your observability backend expects, such as service names, versions, or deployment information. The rename operation changes field names to match your preferred naming conventions or to comply with backend requirements. The remove operation strips out fields you don't want to send to your destination, which can reduce storage costs and improve query performance. Let's run this configuration to see the Modify filter in action: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 14:23:45.678901","env":"dev","message":"Application started","mem_usage":512,"service_name":"my-application","version":"1.2.3","processed":"true"} {"date":"2025-12-05 14:23:46.789012","env":"dev","message":"Application started","mem_usage":512,"service_name":"my-application","version":"1.2.3","processed":"true"} ... Notice how the output has been transformed? The original environment field is now env, memory_mb is now mem_usage, the level field has been removed entirely, and we've added three new fields: service_name, version, and processed. This kind of transformation is essential when you're working with multiple services that produce logs in different formats but need to be standardized before sending to your observability backend. The Modify filter also supports conditional operations using the Condition parameter. This allows you to apply modifications only when specific criteria are met. Let's extend our example to demonstrate conditional modifications: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"environment":"production","level":"error","message":"Database connection failed","response_time":5000}' - name: dummy tag: app.logs dummy: '{"environment":"dev","level":"info","message":"Request processed","response_time":150}' filters: - name: modify match: '*' condition: - key_value_equals environment production add: - priority high - alert true - name: modify match: '*' condition: - key_value_equals level error add: - severity critical outputs: - name: stdout match: '*' format: json_lines Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 14:30:12.345678","environment":"production","level":"error","message":"Database connection failed","response_time":5000,"priority":"high","alert":"true","severity":"critical"} {"date":"2025-12-05 14:30:13.456789","environment":"dev","level":"info","message":"Request processed","response_time":150} ... The first record matches both conditions (production environment AND error level), so it gets priority, alert, and severity fields added. The second record doesn't match any conditions, so it passes through unchanged. This conditional logic is incredibly powerful for implementing routing rules, prioritizing certain types of logs, or adding context based on the content of your telemetry data. 2. Grep Filter Another essential filter that developers need in their telemetry toolkit is the Grep filter. The Grep filter allows you to match or exclude specific records based on regular expression patterns, giving you fine-grained control over which events flow through your pipeline. This is particularly useful during development when you want to focus on specific types of logs or exclude noisy events that aren't relevant to your current debugging session. To demonstrate the power of the Grep filter, let's create a configuration that filters application logs: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"level":"DEBUG","message":"Processing request 12345","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"ERROR","message":"Failed to connect to database","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"INFO","message":"Request completed successfully","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"WARN","message":"High memory usage detected","service":"api"}' filters: - name: grep match: '*' regex: - level ERROR|WARN outputs: - name: stdout match: '*' format: json_lines Our configuration uses the grep filter with a regex parameter to keep only records where the level field matches either ERROR or WARN. This kind of filtering is invaluable when you're troubleshooting production issues and need to focus on problematic events while ignoring routine informational logs. Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 15:10:23.456789","level":"ERROR","message":"Failed to connect to database","service":"api"} {"date":"2025-12-05 15:10:24.567890","level":"WARN","message":"High memory usage detected","service":"api"} ... Notice that only the ERROR and WARN level logs appear in the output. The DEBUG and INFO logs have been filtered out completely. This dramatically reduces the volume of logs you need to process during development and testing. The Grep filter also supports excluding records using the exclude parameter. Let's modify our configuration to demonstrate this: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"User login successful","user":"[email protected]"}' - name: dummy tag: app.logs dummy: '{"message":"Health check passed","endpoint":"/health"}' - name: dummy tag: app.logs dummy: '{"message":"Database query executed","query":"SELECT * FROM users"}' - name: dummy tag: app.logs dummy: '{"message":"Metrics endpoint called","endpoint":"/metrics"}' filters: - name: grep match: '*' exclude: - message /health|/metrics outputs: - name: stdout match: '*' format: json_lines Let's run this updated configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 15:20:34.567890","message":"User login successful","user":"[email protected]"} {"date":"2025-12-05 15:20:35.678901","message":"Database query executed","query":"SELECT * FROM users"} ... The health check and metrics endpoint logs have been excluded from the output. This is extremely useful for filtering out routine monitoring traffic that generates high volumes of logs but provides little value during debugging. By combining regex to include specific patterns and exclude to filter out unwanted patterns, you can create sophisticated filtering rules that give you exactly the logs you need. An important note about the Grep filter is that it supports matching nested fields using the record accessor format. For example, if you have JSON logs with nested structures like {"kubernetes":{"pod_name":"my-app-123"}, you can use $kubernetes['pod_name'] as the key to match against nested values. 3. Record Modifier Filter The third essential filter for developers is the Record Modifier filter. While the Modify filter focuses on adding, renaming, and removing fields using static values, the Record Modifier filter excels at appending fields with dynamic values, such as environment variables, and removing or allowing specific keys using pattern matching. This makes it ideal for injecting runtime context into your logs. Let's create a configuration that demonstrates the Record Modifier filter: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"Application event","request_id":"req-12345","response_time":250,"internal_debug":"sensitive data","trace_id":"trace-abc"}' filters: - name: record_modifier match: '*' record: - hostname ${HOSTNAME} - pod_name ${POD_NAME} - namespace ${NAMESPACE} remove_key: - internal_debug outputs: - name: stdout match: '*' format: json_lines Our configuration uses the record_modifier filter with several powerful features. The record parameter adds new fields with values from environment variables. This is incredibly useful in containerized environments where hostname, pod names, and namespace information are available as environment variables but need to be injected into your logs for proper correlation and filtering in your observability backend. The remove_key parameter strips out sensitive fields that shouldn't be sent to your logging destination. Let's run this configuration with some environment variables set: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm \ -e HOSTNAME=dev-server-01 \ -e POD_NAME=my-app-pod-abc123 \ -e NAMESPACE=production \ fb ... {"date":"2025-12-05 16:15:45.678901","message":"Application event","request_id":"req-12345","response_time":250,"trace_id":"trace-abc","hostname":"dev-server-01","pod_name":"my-app-pod-abc123","namespace":"production"} ... Notice how the environment variables have been injected into the log record, and the internal_debug field has been removed. This pattern is essential for enriching your logs with contextual information that helps you understand where the logs originated in your distributed system. The Record Modifier filter also supports the allowlist_key parameter (and its legacy alias whitelist_key), which works inversely to remove_key. Instead of specifying which fields to remove, you specify which fields to keep, and all others are removed: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"User action","user_id":"12345","email":"[email protected]","password_hash":"abc123","session_token":"xyz789","action":"login","timestamp":"2025-12-05T16:20:00Z"}' filters: - name: record_modifier match: '*' allowlist_key: - message - user_id - action - timestamp outputs: - name: stdout match: '*' format: json_lines Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 16:20:01.234567","message":"User action","user_id":"12345","action":"login","timestamp":"2025-12-05T16:20:00Z"} ... The sensitive fields (email, password_hash, session_token) have been completely stripped out, leaving only the allowlisted fields. This approach is particularly useful when you're dealing with logs that might contain sensitive information, and you want to take a cautious approach by explicitly defining what's safe to send to your logging backend. Another powerful feature of the Record Modifier filter is the ability to generate UUIDs for each record. This is invaluable for tracking and correlating individual log entries across your distributed system: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"Processing request","service":"api"}' filters: - name: record_modifier match: '*' uuid_key: event_id outputs: - name: stdout match: '*' format: json_lines When you run this configuration, each record will have a unique event_id field added automatically, making it easy to reference specific log entries in your observability tools. This covers the top three filters for developers getting started with Fluent Bit while trying to transform and filter their telemetry data effectively and speed up their inner development loop. More in the Series In this article, you learned about three powerful Fluent Bit filters that improve the inner developer loop experience. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring Fluent Bit routing, as there are new ways for developers to leverage this feature.

By Eric D. Schabell

CORE

Secrets in Code: Understanding Secret Detection and Its Blind Spots

In a world where attackers routinely scan public repositories for leaked credentials, secrets in source code represent a high-value target. But even with the growth of secret detection tools, many valid secrets still go unnoticed. It’s not because the secrets are hidden, but because the detection rules are too narrow or overcorrect in an attempt to avoid false positives. This creates a trade-off between wasting development time investigating false signals and risking a compromised account. This article highlights research that uncovered hundreds of valid secrets from various third-party services publicly leaked on GitHub. Responsible disclosure of the specific findings is important, but the broader learnings include which types of secrets are common, the patterns in their formatting that cause them to be missed, and how scanners work so that their failure points can be improved. Further, for platforms that are accessed with secrets, there are actionable improvements that can better protect developer communities. What Are “Secrets” in Source Code? When we say “secrets,” we’re not only talking about API tokens. Secrets include any sensitive value that, if exposed, could lead to unauthorized access, account compromise, or data leakage. This includes: API Keys: Tokens issued by services like OpenAI, GitHub, Stripe, or Gemini.Cloud Credentials: Access keys for managing AWS cloud resources or infrastructure.JWT Signing Keys: Secrets used to sign or verify JSON Web Tokens, often used in authentication logic.Session Tokens or OAuth Tokens: Temporary credentials for session continuity or authorization.One-Time Use Tokens: Password reset tokens, email verification codes, or webhook secrets.Sensitive User Data: Passwords or user attributes included in authentication payloads. Secrets can be hardcoded, generated dynamically, or embedded in token structures like JWTs. Regardless of the specific form, the goal is always to keep them out of source control management systems. How Secret Scanners Work Secret scanners generally detect secrets using patterns. For example, a GitHub Personal Access Token (PAT) like: JavaScript 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n might be matched by a regex rule such as: JavaScript ghp_[A-Za-z0-9]{36} To reduce false positives that string literal matching alone might flag, scanners often rely on: Validation: Once a match is found, some tools will try to validate the secret is in fact a secret and not a placeholder example. This can be done by contacting its respective service. Making an authentication request to an API and interpreting the response code would let the scanner know if it is an active credential.Word Boundaries: Ensure the pattern is surrounded by non-alphanumeric characters (e.g. \bghp_...\b), to avoid matching base64 blobs or gibberish.Keywords: Contextual terms nearby (e.g. “github” or “openai”) can better infer the token’s source or use. This works well for many credential-like secrets, but for some tools this isn’t done in a way that is much more clever than running grep. Take another example: JavaScript const s = "h@rdc0ded-s3cr3t"; const t = jwt.sign(payload, s); There’s no unique prefix in cases like this. No format. But it’s still a secret, and if leaked, it could let an attacker forge authentication tokens. Secret scanners that only look for credential-shaped strings would miss this entirely. A Few Common Secret Blind Spots 1. Hardcoded JWT Secrets In a review of over 2,000 Node.js modules using popular JWT libraries, many hardcoded JWT secrets were found: JavaScript const opts = { secretOrKey: "hardcoded-secret-here" }; passport.use(new JwtStrategy(opts, verify)); These are not always caught by conventional secret scanners, because they don’t follow known token formats. If committed to source control, they can be exploited to sign or verify forged JWTs. The semantic data flow of a hardcoded secret to an authorization function can lead to much better results. 2. JWTs With Sensitive Payloads A subtle but serious risk occurs when JWTs are constructed with entire user objects, including passwords or admin flags: JavaScript const token = jwt.sign(user, obj); This often happens when working with ORM objects like Mongoose or Sequelize. If the model evolves over time to include sensitive fields, they may inadvertently end up inside issued tokens. The result: passwords, emails, or admin flags get leaked in every authentication response. 3. Secrets Hidden by Word Boundaries In a separate research survey project, hundreds of leaks were detected from overfitting word boundaries. Word boundaries (\b) in regex patterns are used to reduce noise by preventing matches inside longer strings. But they also miss secrets embedded in HTML, comments, or a misplaced paste: JavaScript {/* <CardComponent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nents> */} Scanners requiring clean boundaries around the token will miss this even if the secret is valid. Similarly, URL-encoded secrets (like in logs or scripts) are frequently overlooked: JavaScript %22Bearer%20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n%22 Scanning GitHub Repos and Finding Missed Secrets We wanted to learn how to better tune a tool and make adjustments for non-word boundary checks so tested it with the best secret scanning tools on the market for strengths and weaknesses: GitHub, GitGuardian, Kingfisher, Semgrep, and Trufflehog. The main tokens discovered across a wide number of open-source projects were GitHub classic and fine-grained PATs, in addition to AI services such as OpenAI, Anthropic, Gemini, Perplexity, Huggingface, xAI, and Langsmith. Less common but also discovered were email providers and developer platform keys. We found that few providers we tested detected the valid tokens associated with GitHub.GitHub’s default secret scanning did not detect OpenAI tokens within word-boundaries, this includes push protection and once leaked within a repository. The other tokens varied per-provider; some detected or missed Anthropic, Gemini, Perplexity, Huggingface, xAI, Deepseek and others. The keys were missed due to either overly strict non-word boundaries or looking for specific keywords that either were in the wrong place or did not exist in the file. Some of the common problem classes with non-word boundaries include: unintentional placement, terminal output, encodings and escape formats, non-word character end-lines, unnecessary boundaries, or generalized regex. Common Token Prefixes and Pattern Examples Here's a sampling of secret token formats that scanners might detect or miss. The reasons for this include the word boundary problems but also the non-unique prefixes can prevent the ability to validate against an authorization endpoint as a true secret that has been leaked. Service Providerpatternsrisk factorsGitHubghp_ github_pat_ gho_ ghu_ ghr_ ghs_Multiple formats to look for. Often can be missed if embedded in strings or URL-encoded.OpenAIsk-Using a hyphen can break some boundary-based detection methods. Ambiguity due to overlap with DeepSeek, but inclusion of T3BlbkFJ pattern in some formats can be a signal, but not consistently used.DeepSeeksk-Using a hyphen can break some boundary-based detection methods. Easily misclassified as OpenAI without additional hints.Anthropicsk-ant- Using a hyphen can break some boundary-based detection methods. End pattern of AA and ant- helps with unique identifiecation.Stripesk_live_ sk_test_Shares prefix with other service providers creating collisions for auth validation when discovered.APIDecksk_live_ sk_test_Shares prefixes with Stripe which makes validation difficult.Groqgsk_Similar format but has slightly different identifier which can help with uniqueness.Notionsecret_Common prefix for many services increases prevalence of false positives by not being able to validate authentication.ConvertAPIsecret_Common prefix for many services increases prevalence of false positives by not being able to validate authentication.LaunchDarklyapi-Common prefix for many services increase prevalence of false positives by not being able to validate authentication.Robinhoodapi-Common prefix for many services increase prevalence of false positives by not being able to validate authentication.Nvidianvapi-Allows string to end in a hyphen (-) which can break some boundary-based detection methods. This is just a sample of the many platforms that have secrets. To help safeguard them it is important to distinguish between an example placeholder and the real thing, so being able to uniquely identify the source becomes challenging. Improving Secret Detection To improve the accuracy and completeness of secret detection, consider the following strategies: For Development Teams Avoid hardcoded secrets. Use environment variables or secret managers even if only meant to be a placeholder example because it can fire false positives and risk missing true positives when they occur.Use static analysis. Catch patterns like string literals in crypto functions but also data flow patterns that can cross between files (inter-file) to expose secrets in unexpected ways that can be caught.Automate checking your codebase. Use tools that continuously monitor source code check-ins, preferably through pre-commit hooks to identify whenever secrets are accidentally introduced into the code base. Relying on your SCM provider to do this is not often enough. For Service Providers Use unique, identifiable prefixes for secrets. It helps with detection.Document exact token formats because the transparency makes it easier for tools to catch it. Offer validation endpoints so that development teams can be confident in any findings being true positives.Expire or encourage rotating tokens automatically to minimize damage. Conclusion Secrets aren’t always easy to spot. They’re not always wrapped in clear delimiters, and they don’t always look like credentials. Sometimes they hide in authentication logic, passed into token payloads, or hardcoded during development. We explained how secret detection works, where it falls short, and how real-world leaks occur in ways many scanners don’t expect. From hardcoded JWT secrets to misplaced token strings, the cost of undetected secrets is high but preventable.

By Jayson DeLancey

Mastering Fluent Bit: 3 Tips for Telemetry Pipeline Multiline Parsers for Developers (Part 10)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit multiline parsers for developers. In case you missed the previous article, check out using telemetry pipeline processors, where you explore the top three telemetry data processors for developers. This article will be a dive into parsers that help developers test Fluent Bit pipelines when dealing with difficult and long multiline log messages. We'll take a look at using multiline parsers for your telemetry pipeline configuration in Fluent Bit. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at the three tips for multiline parsers and how they help you manage complex log entries during your local development testing. Multiline Parsing in a Telemetry Pipeline See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the multiline parsers that can be of great help in managing our telemetry data during testing in our inner developer loop. Below, in the figure, you see the phases of a telemetry pipeline. The second phase is the parser, which is where unstructured input data is turned into structured data. Note that in this article, we explore Fluent Bit using multiline parsers that we can configure to process data in the input of our telemetry pipeline, but this is shown here as a separate phase. The challenge developers often face is that real-world applications don't always log messages on a single line. Stack traces, error messages, and debug output frequently span multiple lines. These multiline messages need to be concatenated before they can be properly parsed and processed. Fluent Bit provides multiline parsers to solve this exact problem. A multiline parser can recognize when multiple lines of log data belong together and concatenate them into a single event before further processing. An example of multiline log data that developers encounter daily would be a Java stack trace: Shell Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) Without multiline parsing, each line would be treated as a separate log entry. With multiline parsing, all these lines are correctly concatenated into a single structured event that maintains the complete context of the error. The Fluent Bit multiline parser engine exposes two ways to configure the feature: Built-in multiline parsersConfigurable multiline parsers Fluent Bit provides pre-configured built-in parsers for common use cases such as: docker – Process log entries generated by Docker container engine.cri – Process log entries generated by CRI-O container engine.go – Process log entries from Go applications.python – Process log entries from Python applications.ruby – Process log entries from Ruby applications.java – Process log entries from Java applications. For cases where the built-in parsers don't fit your needs, you can define custom multiline parsers. These custom parsers use regular expressions and state machines to identify the start and continuation of multiline messages. Let's look at how to configure a custom multiline parser that developers will want to know more about. Now, let's look at the most interesting tips for multiline parsers that developers will want to know more about. 1. Configurable Multiline Parser One of the more common use cases for telemetry pipelines that developers will encounter is dealing with stack traces and error messages that span multiple lines. These multiline messages need special handling to ensure they are concatenated properly before being sent to their destination. To provide an example, we start by creating a test log file called test.log with multiline Java stack trace data: Shell single line... Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) another line... Next, let's create a multiline parser configuration. We create a new file called parsers_multiline.yaml in our favorite editor and add the following configuration: Shell parsers: - name: multiline-regex-test type: regex flush_timeout: 1000 rules: - state: start_state regex: '/([a-zA-Z]+ \d+ \d+\:\d+\:\d+)(.*)/' next_state: cont - state: cont regex: '/^\s+at.*/' next_state: cont Let's break down what this multiline parser does: name – We give our parser a unique name, multiline-regex-test.type – We specify the type as regex for regular expression-based parsing.flush_timeout – After 1000ms of no new matching lines, the buffer is flushed.rules – We define the state machine rules that control multiline detection. The rules section is where the magic happens. A multiline parser uses states to determine which lines belong together: The start_state rule matches lines that begin a new multiline message. In our case, the pattern matches a timestamp followed by any text, which identifies the first line of our Java exception.The cont (continuation) rule matches lines that are part of the multiline message. Our pattern matches lines starting with whitespace followed by "at", which identifies the stack trace lines.Each rule specifies a next_state, which tells Fluent Bit what state to transition to after matching. This creates a state machine that can handle complex multiline patterns. When the parser sees a line matching start_state, it begins a new multiline buffer. Any subsequent lines matching the cont pattern are appended to that buffer. When a line doesn't match either pattern, or when the flush timeout expires, the complete multiline message is emitted as a single event. Now let's create our main Fluent Bit configuration file, fluent-bit.yaml, that uses this multiline parser: Shell service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on parsers_file: parsers_multiline.yaml pipeline: inputs: - name: tail path: test.log read_from_head: true multiline.parser: multiline-regex-test outputs: - name: stdout match: '*' Note several important configuration points here: We include the parsers_file in the service section to load our multiline parser definitionsWe use the tail input plugin to read from our test log fileWe set read_from_head: true to read the entire file from the beginningMost importantly, we specify multiline.parser: multiline-regex-test to apply our multiline parser The multiline parser is applied at the input stage, which is the recommended approach. This ensures that lines are concatenated before any other processing occurs. Let's run this configuration to see the multiline parser in action: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... [0] tail.0: [[1750332967.679671000, {}], {"log"=>"single line... "}] [1] tail.0: [[1750332967.679677000, {}], {"log"=>"Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) "}] [2] tail.0: [[1750332967.679677000, {}], {"log"=>"another line... ... Notice how the output shows three distinct events: The single-line message passes through unchanged.The entire stack trace is concatenated into one event, preserving the complete error context.The final single-line message passes through unchanged. This is exactly what we want. The multiline parser successfully identified the start of the Java exception and concatenated all the stack trace lines into a single structured event. 2. Extracting Structured Data From Multiline Messages Once you have your multiline messages properly concatenated, you'll often want to extract specific fields from them. Fluent Bit supports this through the parser filter, which can be applied after multiline parsing. Let's extend our example to extract the date and message components from the concatenated stack trace. First, we'll add a regular expression parser to our parsers_multiline.yaml file: Shell parsers: - name: multiline-regex-test type: regex flush_timeout: 1000 rules: - state: start_state regex: '/([a-zA-Z]+ \d+ \d+\:\d+\:\d+)(.*)/' next_state: cont - state: cont regex: '/^\s+at.*/' next_state: cont - name: named-capture-test format: regex regex: '/^(?<date>[a-zA-Z]+ \d+ \d+\:\d+\:\d+)\s+(?<message>(.|\n)*)$/m' The new named-capture-test parser uses named capture groups to extract: date - The timestamp at the start of the messagemessage - The remaining content, including all newlines Note the /m modifier at the end of the regex, which enables multiline mode where . (dot) can match newline characters. Now we update our main configuration to apply this parser using the parser filter: Shell service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on parsers_file: parsers_multiline.yaml pipeline: inputs: - name: tail path: test.log read_from_head: true multiline.parser: multiline-regex-test filters: - name: parser match: '*' key_name: log parser: named-capture-test outputs: - name: stdout match: '*' We've added a parser filter that: Matches all events with match: '*'Looks at the log field with key_name: logApplies the named-capture-test parser to extract structured fields Running this enhanced configuration produces: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... [0] tail.0: [[1750333602.460984000, {}], {"log"=>"single line... "}] [1] tail.0: [[1750333602.460998000, {}], {"date"=>"Dec 14 06:41:08", "message"=>"Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) "}] [2] tail.0: [[1750333602.460998000, {}], {"log"=>"another line... "}] ... Now the multiline Java exception event contains structured fields: date contains the timestampmessage contains the complete exception and stack trace This structured format makes it much easier to query, analyze, and alert on these error events in your observability backend. 3. Important Considerations for Multiline Parsers When working with multiline parsers, keep these important points in mind: Apply multiline parsing at the input stage. While you can apply multiline parsing using the multiline filter, the recommended approach is to configure it directly on the input plugin using multiline.parser. This ensures lines are concatenated before any other processing.Understand flush timeout behavior. The flush_timeout parameter determines how long Fluent Bit waits for additional matching lines before emitting the multiline buffer. Set this value based on your application's logging patterns. Too short and you might break up valid multiline messages. Too long and you'll introduce unnecessary latency.Use specific state patterns. Make your regular expressions as specific as possible to avoid false matches. The start_state pattern should uniquely identify the beginning of a multiline message, and continuation patterns should only match valid continuation lines.Be aware of resource implications. Multiline parsers buffer lines in memory until the complete message is ready. For applications with very large multiline messages (like huge stack traces), this can consume significant memory. The multiline parser bypasses the buffer_max_size limit to ensure complete messages are captured.Test with real data. Always test your multiline parser configurations with actual log data from your applications. Edge cases in log formatting can cause unexpected parsing behavior. This covers the three tips for developers getting started with Fluent Bit multiline parsers while trying to handle complex multiline log messages and speed up their inner development loop. More in the Series In this article, you learned how to use Fluent Bit multiline parsers to properly handle log messages that span multiple lines. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring some of the more interesting Fluent Bit filters for developers.

By Eric D. Schabell

CORE

A Guide for Deploying .NET 10 Applications Using Docker's New Workflow

Container deployment has become the cornerstone of scalable, repeatable application delivery. .NET 10 represents the latest evolution of Microsoft's cloud-native framework, offering exceptional performance, deep cross-platform support, and tight integration with modern DevOps practices. Developing with .NET 10 offers incredible performance and cross-platform capability. When paired with Docker, .NET 10 applications become truly portable artifacts that run identically across development laptops, CI/CD pipelines, staging environments, and production infrastructure — whether on-premises, cloud-hosted, or hybrid. This comprehensive guide walks you through a professional-grade containerization workflow using the .NET CLI and Docker's automated tooling, taking you from a fresh project scaffold to a production-ready, optimized container image. The next logical step is to deploy that application using Docker, which ensures that your code runs identically everywhere — from your local machine to any cloud environment. This guide outlines the most efficient process for containerizing any new .NET 10 web application using the integrated docker init tool. Why Docker and .NET 10 Are the Perfect Match The promise of containerization is straightforward in theory but demanding in practice: write once, deploy everywhere. .NET 10 and Docker together fulfill this promise with remarkable elegance. Reproducibility is the first pillar. Every developer, CI agent, and production server running your Docker image is executing identical bytecode in an identical runtime environment. No more "works on my machine" frustrations. Configuration drift — where servers gradually diverge due to manual patches, version mismatches, or environment-specific tweaks — becomes moot when your entire runtime is packaged as code. Portability extends beyond reproducibility. A .NET 10 Docker image can run anywhere Docker is supported: Linux and Windows containers, on-premises data centers, every major cloud provider (AWS ECS, Azure Container Instances, Google Cloud Run), Kubernetes clusters, edge devices, or developer workstations. Your investment in containerization unlocks unprecedented deployment flexibility. You're no longer locked into a single platform or hosting provider. Performance is where .NET 10 shines. The latest framework includes performance improvements across the runtime, IL compiler, and garbage collector. Combining this with Docker's efficient resource isolation means your containerized .NET 10 applications run lean and fast, scaling efficiently under load. Security and isolation are architectural benefits of containerization. Your application runs in a lightweight, isolated sandbox. Changes to one container don't cascade to others. Updates to your base image can be published centrally and adopted across your entire fleet without rewriting application code. This decoupling of application and infrastructure is essential for modern security practices. From a team perspective, Docker provides a shared contract between developers and the Operations team. Developers focus on code and dependencies within the Dockerfile; infrastructure teams focus on orchestration, networking, and resource allocation at the container level. This separation of concerns accelerates both development velocity and operational reliability. Setting Up Your Development Environment Prerequisites 1. Install .NET 10 SDK Download and install the .NET 10 SDK from dotnet.microsoft.com. Choose the installer for your operating system (Windows, macOS, or Linux). Verify installation: Shell dotnet --version dotnet --list-sdks You should see version 10.0.x listed. 2. Install Docker Desktop Download Docker Desktop from docker.com and run the installer for your operating system. Start Docker Desktop after installation. Verify installation: Shell docker --version Creating a New Web Application Using CLI, create a new web application and make sure to set the target framework to .NET 10.0. Shell dotnet new webapp -f net10.0 -o webapplication1 The -f net10.0 flag explicitly targets to create the project with .NET 10.0 as the target framework, as shown in the figure below. Once scaffolded, your project contains: Program.cs: The entry point, where you configure services and middlewareWebApplication1.csproj: The project file defining dependencies and build configurationProperties/launchSettings.json: Development launch profiles, including port mappings and loggingStandard folders like Pages, wwwroot, and others depending on your template choice Build and Test Your Application Locally Before moving to containers, verify the application runs correctly on your host: Shell dotnet run The CLI compiles your project, restores NuGet packages (if necessary), and starts the Kestrel web server. You should see output similar to the following. C# info: Microsoft.Hosting.Lifetime[14] Now listening on: http://localhost:5172 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down. Open a browser and navigate to the HTTPS URL (in this example, https://localhost:5172). You should see the default template page. If you're using a self-signed development certificate, your browser will warn you about the certificate; this is expected and safe to bypass during local development. This smoke test confirms that your application compiles, the Kestrel server starts correctly, and the basic request/response cycle works. Any configuration issues, missing dependencies, or logic errors will surface immediately. Catching these now saves time later in the Docker build pipeline. Containerizing With Docker Init Docker's init command is a game-changer for .NET developers. It analyzes your project structure and generates a production-grade Docker configuration tailored to your tech stack, eliminating tedious manual Dockerfile authoring for the common case. Make sure you complete the prerequisites above and ensure Docker Desktop is running. From your project root folder, run the command below: Shell docker init The command prompts you with a series of questions: Application platform: Select .NET (or .NET ASP.NET Core if more specific)Version: It will auto-detect .NET 10 from your project filePort: Enter the port your application should listen on (default is often 8080) After responding to the prompts, docker init generates three critical files as shown in the figure below. Dockerfile The Dockerfile is the recipe for building your container image. For .NET 10, Docker Init typically generates a multi-stage build file as shown below. Dockerfile # syntax=docker/dockerfile:1 # Comments are provided throughout this file to help you get started. # If you need more help, visit the Dockerfile reference guide at # https://docs.docker.com/go/dockerfile-reference/ # Want to help us make this template better? Share your feedback here: https://forms.gle/ybq9Krt8jtBL3iCk7 ################################################################################ # Learn about building .NET container images: # https://github.com/dotnet/dotnet-docker/blob/main/samples/README.md # Create a stage for building the application. FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:10.0-alpine AS build COPY . /source WORKDIR /source # This is the architecture you're building for, which is passed in by the builder. # Placing it here allows the previous steps to be cached across architectures. ARG TARGETARCH # Build the application. # Leverage a cache mount to /root/.nuget/packages so that subsequent builds don't have to re-download packages. # If TARGETARCH is "amd64", replace it with "x64" - "x64" is .NET's canonical name for this and "amd64" doesn't # work in .NET 6.0. RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \ dotnet publish -a ${TARGETARCH/amd64/x64} --use-current-runtime --self-contained false -o /app # If you need to enable globalization and time zones: # https://github.com/dotnet/dotnet-docker/blob/main/samples/enable-globalization.md ################################################################################ # Create a new stage for running the application that contains the minimal # runtime dependencies for the application. This often uses a different base # image from the build stage where the necessary files are copied from the build # stage. # # The example below uses an aspnet alpine image as the foundation for running the app. # It will also use whatever happens to be the most recent version of that tag when you # build your Dockerfile. If reproducibility is important, consider using a more specific # version (e.g., aspnet:7.0.10-alpine-3.18), # or SHA (e.g., mcr.microsoft.com/dotnet/aspnet@sha256:f3d99f54d504a21d38e4cc2f13ff47d67235efeeb85c109d3d1ff1808b38d034). FROM mcr.microsoft.com/dotnet/aspnet:10.0-alpine AS final WORKDIR /app # Copy everything needed to run the app from the "build" stage. COPY --from=build /app . # Switch to a non-privileged user (defined in the base image) that the app will run under. # See https://docs.docker.com/go/dockerfile-user-best-practices/ # and https://github.com/dotnet/dotnet-docker/discussions/4764 USER $APP_UID ENTRYPOINT ["dotnet", "WebApplication1.dll"] Multi-stage builds are the cornerstone of this Dockerfile. They solve a critical problem: if you built your image using only the SDK stage, the final image would be over 2 GB, containing the entire .NET SDK, build tools, source code, and intermediate artifacts. None of these are needed at runtime; they're build-time concerns only. The multi-stage approach separates concerns: Stage 1 (build): Starts from the full .NET SDK (mcr.microsoft.com/dotnet/sdk:10.0), which includes compilers, build tools, and everything needed to compile C#.Stage 2 (publish): Runs dotnet publish, which compiles the application in Release mode and packages only the runtime-necessary binaries into an /app/publish folder. Source code is not included.Stage 3 (runtime): Starts from a lean ASP.NET Core runtime image (mcr.microsoft.com/dotnet/aspnet:10.0), which contains only the .NET runtime, without the SDK or build tools. The COPY --from=publish instruction brings only the published binaries from Stage 2. The result: a final image of roughly 150–300 MB (depending on your application), down from over 2 GB — an 80%+ reduction. This has cascading benefits: faster builds, quicker deployments, lower storage and bandwidth costs, and a smaller attack surface for security. Layer caching is another critical optimization baked into this structure. Docker caches each layer (each line in the Dockerfile). When you change your C# code, Docker rebuilds only the layers after the change, reusing earlier cached layers. By copying *.csproj and running dotnet restore early, you maximize cache hits. If only your code changes (not your dependencies), the restore layer is skipped, and the build is much faster. .dockerignore This file tells Docker which files to exclude when building the image context. Excluding bin and obj folders is important as these folders contain compiled binaries from your host machine, and they are not needed within the Docker context. The build would happen inside the container to generate new binaries. Similarly, all irrelevant files or folders are not needed and are added to the .dockerignore file. Dockerfile **/.git **/.gitignore **/.vs **/.vscode **/bin **/obj **/node_modules ...... Compose.yaml This file orchestrates the local containerized development and is shown below. Dockerfile services: server: build: context: . target: final ports: - 8080:8080 Visual Studio Code and Visual Studio are smart enough to provide an easy way to run these services by automatically creating a "Run all Services" button. Let's look at each section services: Defines services in your stack.build: Specifies how to build the image. context: . means "use the current directory as the build context."ports: Maps container ports to host ports. "8080:8080" means "forward host port 8080 to container port 8080." When you access localhost:8080 on your development machine, traffic is routed to port 8080 inside the container. Compose.yaml is your main starting point to run your application inside Docker as a container. Depending on your application, you can make adjustments to the compose.yaml file, and there are clear comments provided in the auto-generated file to give you more knowledge about how to add other services, like adding PostgreSQL or any other dependencies that your application can use. Readme.Docker.md These files provide detailed instructions on how to build and run your application, as shown below. Let's use these instructions to build and run your application. It also provides instructions on how to deploy your application to the cloud. Building and Running Your Application as a Container Inside Docker Once you have adjusted the configuration as per your project needs, you can build and run your application by running the following Docker command from the terminal: Shell docker compose up --build The docker compose up command starts all services defined in your compose.yaml. Instead of executing the command, you can also click on the "Run All Services" button within your Visual Studio editor. Depending on the size of your base images in your Docker file, it can take a few minutes to build and run your application, as shown in the image below. Once it's completed, you can navigate to your application by opening the URL https://localhost:8080 in your browser, and you can also verify the Docker image by navigating to the Images tab in Docker Desktop, as shown below. You can also view application logs directly within your Docker Desktop. Before pushing this to the Docker Hub repository or running it in production, scan for any vulnerabilities by using the Docker Scout command, which is built into the Docker CLI. Conclusion Containerizing .NET 10 applications with Docker transforms development workflow and deployment reliability. The docker init tool streamlines the process, generating multi-stage Dockerfiles that produce lean, efficient images. Combined with Docker Compose for local development or managed container services for production, this workflow delivers reproducibility, portability, and operational excellence. From local development to global deployment, your .NET 10 application now runs consistently, scales elastically, and integrates seamlessly with modern cloud-native infrastructure. The investment in containerization pays dividends in deployment velocity, infrastructure cost, and team productivity.

By Naga Santhosh Reddy Vootukuri

CORE

How Migrating to Hardened Container Images Strengthens the Secure Software Development Lifecycle

Container images are the key components of the software supply chain. If they are vulnerable, the whole chain is at risk. This is why container image security should be at the core of any Secure Software Development Lifecycle (SSDLC) program. The problem is that studies show most vulnerabilities originate in the base image, not the application code. And yet, many teams still build their containers on top of random base images, undermining the security practices they already have in place. The result is hundreds of CVEs in security scans, failed audits, delayed deployments, and reactive firefighting instead of a clear vulnerability-management process. To establish reliable and efficient SSDLC processes, you need a solid foundation. This is where hardened base images enter the picture. This article explores the concept of hardened container images; how they promote SSDLC by helping teams reduce the attack surface, shift security left, and turn CVE management into a repeatable, SLA-backed workflow; and what measurable outcomes you can expect after switching to a hardened base. How the Container Security Issue Spirals Out of Control Across SSDLC Just as the life of an application starts with its programming language, the life of a container begins with its base image. Hence, the problem starts here and can be traced back as early as the requirements analysis stage of the SSDLC. This is because the requirements for selecting a base image — if they exist at all — rarely include security considerations. As a result, it is common for teams to pick a random base image. Such images often contain a full OS with numerous unnecessary components and may harbor up to 600 known vulnerabilities (CVEs) at once. Later, when the containerized application undergoes a security scan at the deployment stage, the results show hundreds of vulnerabilities. Most of them originate from the base image, not the application code, framework, or libraries. And yet, the security team must waste time addressing these flaws instead of focusing on application security. As a result: Vulnerabilities are ignored and make their way to production, orDeployments are delayed because of critical vulnerabilities, orThe team spends hours trying to patch the image. Sometimes, all three happen — if you are especially ‘lucky.’ When the container image finally reaches production, the risks associated with the existing CVEs grow as new critical CVEs appear. The team then scrambles to patch the base image, rebuild, and redeploy, hoping nothing breaks. But the problem doesn’t stop there. During preparation for a security audit, it may turn out that the base image lacks provenance data required by regulations, such as a software bill of materials (SBOM), a digital signature, or a strict update schedule. This makes it difficult for the team to meet audit requirements and may result in more than a fine for noncompliance. The presence of a package manager in the base image can worsen the problem, because the image may contain not only essential packages but many others. It is easy to add additional packages, but not as easy to trace their origin or determine whether they are required — especially when a package contains a critical CVE and you must act quickly. To summarize: a base image is not the only container security concern. However, it is the foundation of the container image — and often contains more security flaws than the application itself. This places unnecessary operational burden on the team and pulls their attention away from what truly requires strengthening and enhancement: the application. Hardened Container Images as an SSDLC Control Point If the foundation is rotten, the building won’t last long. Therefore, you fix the foundation. In the case of container images, you replace the underlying base image. What the team needs is not just another base image but a hardened container image that prevents the issues described above. So, what is a hardened container image? It is a strictly defined, minimal set of components required to run the application, which cannot be changed or inspected externally due to the absence of a package manager. This set of components is: Free from known CVEs from the start, guaranteeing a minimal attack surface throughout the lifecycleInventoried in an SBOM and signed with a digital signature, providing comprehensive security metadataContinuously monitored and patched by the vendor under an SLA, so the SRE and security teams can rely on a defined patch cadence Free from unnecessary packages and known vulnerabilities, a hardened container image reduces the attack surface of production containers immediately. But the image hardening is not just about reducing components — it is about helping teams establish a clear CVE management process where all components are listed, tracked, and continuously patched. As a result, hardened container images integrate naturally into the SSDLC program. Enhancing Secure SDLC Workflow with Hardened Images Thanks to the features described above, hardened container images can be smoothly integrated into SSDLC processes, allowing teams to shift security left without slowing down the release cadence or increasing developers' workload. If teams previously used random base images and dealt with patches and security audits reactively, hardened container images change the game from the start. According to the new workflow: The platform team selects a set of hardened container images as the only allowed bases at the planning stage.These hardened images are enforced during the build stage with CI templates and policies.Security scanners don’t choke on hundreds of CVEs during the testing stage; instead, scan results show only issues that matter.Immutable containers with a drastically reduced attack surface run in production; rolling updates are driven by business needs and base image updates, not manual patching.SBOMs, digital signatures, and SLA-backed patch timelines ensure compliance and simplify security audits.When a critical CVE appears, the vendor updates the hardened image, you rebuild your image on top of it, and the security team closes the ticket — now in days instead of weeks. At the same time, the developers’ workflow barely changes: they simply switch the base image and stop wasting time patching code that isn’t theirs. DIY vs. Vendor-Backed Hardened Images Creating and maintaining your own hardened container images is theoretically possible, but it imposes a tremendous operational burden on your team, effectively requiring them to become Linux and runtime maintainers. This requires: Deep knowledge of OS/runtime intrinsicsContinuous CVE monitoring and triageSigning, versioning, and SBOM policies But building a hardened base image is only part of the task. You must also patch it continuously, which requires: Monitoring security advisories for your distribution and runtime(s)Determining which CVEs matter to your environmentRebuilding images, running tests, coordinating rolloutsCommunicating breaking changes to all teams Therefore, maintaining your own hardened base implies high costs, resulting from engineering time spent maintaining the foundation instead of improving the product. Metaphorically, you must run an ultramarathon while maintaining sprinter speed. Fortunately, there is no need to hire a dedicated team solely for base images. Several reliable vendors — including BellSoft, Chainguard, and Docker — provide ready-made hardened container images for various runtimes. This means you can outsource the hard work of maintaining secure base images to experts who do it full-time. When selecting a vendor that ships hardened container images, make sure they provide: Teams focused on OS security, packaging, and complianceSigned images and standard attestationsSBOMs out of the boxRegularly updated images with tested patchesAn SLA for patchesOS and runtime built from source in every image, guaranteeing that no third-party binary — unknown CVEs or irregular update schedules — is included The full set of features depends on the vendor, so study their offerings carefully and select the base images that best fits your needs. This enables a centralized vulnerability-management process built around a trusted solution and allows engineers to focus on the product. Measurable Outcomes of Migrating to Hardened Container Images Migrating to hardened container images is not just about the abstract notion of "improved security." It’s about transforming the chaos of unmanaged base images and unmanageable CVEs into something measurable and controllable. The table below summarizes key areas where you can track improvements driven by hardened container images: Area/metric Result CVEs per image Low to Zero Scanner integration Major vulnerability scanners support base images; Base OS package ecosystem provides a scanner package Scanner noise Meaningful results, no false-positive alerts Package management Reliable ecosystem of verified packages Mean Time to Patch Days Compliance & Audit SBOMs, standardized images, documented patch flow and SLA Operational burden Low, base image patching is handled by the vendor Conclusion A secure software development lifecycle depends on the integrity of every layer in the stack. Hardened container images form the foundation of this stack and represent one of its key control points. Studies show that the majority of vulnerabilities in containerized workloads originate in the base image. Standardizing on hardened, minimal, vendor-supported base images reduces this risk, improves the signal quality of security scanners, and helps create a clear and auditable patching process. Importantly, migrating to hardened images is not difficult — and, surprisingly, hardened images can even be found for free. Therefore, migrating to hardened container images aligns day-to-day engineering practices with security and compliance objectives, shortens response times to critical vulnerabilities, and reduces the operational overhead of managing CVEs at scale — all without affecting product delivery timelines.

By Catherine Edelveis

Why Senior Developers Are Actually Less Productive with AI Copilot (And What That Tells Us)

I watched the tech lead spend forty-five minutes wrestling with GitHub Copilot suggestions for an API endpoint. The same task would have taken fifteen minutes without the AI assistant. That situation was not an isolated case. Across the organization, we started to notice a pattern: experienced developers were slower when using AI coding assistants than junior developers. This pattern made us rethink how we use these tools. While AI coding assistants slowed down experienced developers, junior developers maintained their momentum. Data from multiple organizations confirms what many of us are experiencing firsthand. While junior developers see productivity gains of 30-40% with AI assistants, senior developers often experience productivity decreases of 10-15%. This counterintuitive finding reveals something profound about expertise, trust, and the future of software development. The Trust Tax: When Verification Costs More Than Creation The main problem is not a technical one; it is psychological. Senior developers spend years building mental models of how systems work, gathering hard-earned knowledge about edge cases, performance implications, and architecture tradeoffs. When AI Copilot suggests code, they cannot simply accept it. Their expertise forces them to verify every line. A junior developer looks at AI-generated code and asks: "Does this work?" A senior developer looks at the same code and asks: "Does this work?""Is it optimal?""Are there edge cases?""What are the security implications?""How does this scale?" "What's the memory footprint?" "Are we introducing technical debt?" This verification tax is substantial. In a recent study of 250 developers across five organizations, senior developers spent an average of 4.3 minutes reviewing each AI suggestion compared to 1.2 minutes for junior developers. When you're reviewing dozens of suggestions per day, this adds hours to your workload. The Pattern Recognition Problem Here's where it gets interesting. Senior developers have honed their pattern recognition through years of debugging production incidents, seeing firsthand the consequences of code that looks harmless. When Copilot suggests using a simple map operation on a large dataset, a junior developer sees elegant functional code. A senior developer sees a potential memory spike during peak traffic because they've been paged at 3 AM for exactly this kind of issue before. The AI doesn't know about the time your service crashed because someone mapped over a million-item array. You do. Real-World Example: At a company I consulted with, a junior developer accepted an AI-generated authentication function that looked clean and passed all tests. A senior developer caught that it was vulnerable to timing attacks—a subtle security flaw that wouldn't show up in standard testing but could leak information about valid usernames. The junior developer didn't know to look for this. The senior developer couldn't not see it. The False Positive Burden I've watched senior developers struggle with a higher rate of false positives because of their heightened skepticism. They actively look for potential problems and sometimes find issues that aren't actually problems in the specific context. This often leads to unnecessary refactoring and over-engineering of AI-generated code. Senior developers sometimes reject AI suggestions because the code feels wrong based on patterns that don't match the current use case. They trust their gut-level instincts, which sometimes help but can slow down work when applied indiscriminately. Context Windows and Architectural Thinking The second major factor is how senior developers think about code. They don't focus solely on the immediate problem; instead, they consider broader system design, maintainability, and future extensibility. AI coding assistants excel at local optimization. They're remarkably good at solving the specific problem right in front of them, but they struggle to understand the architectural implications of their suggestions. A senior developer looks at AI-generated code and asks questions the AI cannot answer: "How does this fit with our service mesh architecture?" "Does it follow our team's coding standards?" "Will the next developer who touches this code understand the intent?" "Does it create coupling that will make future changes harder?" These are not just academic concerns. In complex systems, local optimizations can create global problems. A function that's perfect in isolation might introduce subtle dependencies that could cause issues months later. The Automation Irony There's an irony at play here. The tasks where AI assistants provide the most help are precisely the tasks that senior developers have already automated away in their minds. After years of experience, routine coding becomes muscle memory — you're barely thinking about it. When a junior developer writes a CRUD endpoint, it's a careful step-by-step process that requires focus. When a senior developer writes the same endpoint, it's largely a matter of typing speed. AI assistance makes junior developers work faster, but it doesn't significantly impact senior developers, since they were already working at or near optimal speed for routine tasks. Where AI could help senior developers — the genuinely novel problems, the complex architectural decisions, the subtle bug fixes — these are exactly the areas where current AI tools are weakest. As a result, senior devs get slowed down on routine tasks (because of verification overhead) without corresponding gains on complex tasks. What This Tells Us About the Future This productivity paradox reveals several important truths about AI-assisted development and the nature of software expertise: Expertise Is More Than Speed We've measured productivity in various ways, but the lines-of-code-per-day metric has always been flawed. AI assistants make that flaw more obvious. A senior developer who spends an hour thinking about architecture before writing twenty lines of code is more valuable than a developer who writes two hundred lines of AI-generated code that creates technical debt. Senior developers bring value not through their typing speed or raw problem-solving velocity but through their judgment, ability to see ripple effects, and wisdom about what not to build. Trust Calibration Is the New Skill The developers who will thrive with AI assistants will be neither those who accept every suggestion without question nor those who reject them all. The successful developers will build mental models that help them determine when to trust AI assistants and when to dig deeper. This requires a new kind of expertise: understanding the AI's strengths and weaknesses well enough to allocate verification effort efficiently. Some senior developers are learning to treat AI suggestions with the same calibrated skepticism they apply to code from junior team members — enough scrutiny to catch problems, but not so much that it becomes counterproductive. Emerging Best Practice The most effective senior developers I've seen aren't trying to verify everything AI-generated code does. Instead, they've developed heuristics for what to check carefully — security, performance, architectural fit — versus what to accept with minimal review — straightforward implementations of well-understood patterns). They're essentially building a "threat model" for AI code. The Context Problem Won't Solve Itself AI coding assistants operate with limited context. They can see the file you're working on and a few related files, but they don't truly understand your architecture, your team's conventions, your performance requirements, or your technical debt situation. Improving this will require more than just larger context windows. It requires AI systems capable of building and maintaining genuine architectural understanding — something that's still largely beyond current capabilities. Until then, the gap between "code that works" and "code that fits" will remain wide. Practical Implications for Teams Rethinking Code Review Teams need to evolve their code review practices for the AI era. The question is not just whether the code is correct, but also whether it was AI-generated and whether the developer properly verified it. I've seen some teams require developers to flag AI-generated code in pull requests—not to ban it, but to ensure appropriate scrutiny. In my view, AI assistants fundamentally change the economics of code creation. When they make code generation trivially easy, the bottleneck shifts to verification and integration. This makes code review more critical, and the skills required for effective review become more valuable. Training and Skill Development Junior developers who learn primarily with AI assistance face a real risk: they may never develop the deep understanding that comes from writing code the hard way. It's like a cook who learns with a chef who does all the prep work—they can still make meals, but they never develop essential knife skills. Organizations should consider having junior developers work without AI assistants for their first six months to a year, just as we don't let new drivers use autopilot before they've learned to drive manually. The goal isn't to make them suffer, but to ensure they build the foundational understanding that makes AI assistance valuable rather than just fast. The Meta-Lesson: Tools Shape Thinking The senior developer productivity paradox reveals the deep connection between tools and thought. Senior developers are slower with AI, not despite their expertise, but because of it. The verification overhead they experience stems from the tool not aligning with their mental model of how development should work. Junior developers are still building their mental models, so they adapt more easily to AI-assisted workflows. Senior developers, however, rely on approaches honed through years of experience, and AI assistants often work against these approaches rather than complementing them. This isn't a criticism of either group. It's an observation about how expertise works. Actual expertise isn't just knowledge—it's intuition, pattern recognition, and deeply internalized workflows. Any tool that disrupts those workflows will face resistance, and that resistance often reflects genuine wisdom rather than mere stubbornness. Looking Forward The productivity paradox we're seeing today isn't permanent. As AI coding assistants improve, they'll develop better contextual awareness and respect for coding conventions. They'll provide the kind of high-level assistance that senior developers actually need. However, we shouldn't expect the gap to close completely. The tension between AI's suggestions and human judgment will likely always exist, and that tension is healthy. The goal is not to eliminate verification but to make it more efficient. Meanwhile, we should resist the temptation to measure developer productivity solely by output velocity. The fact that senior developers are slower with AI assistants doesn't mean they're less valuable. It often means they're doing exactly what we need them to do: applying judgment, considering implications, and protecting the codebase from well-intentioned but ultimately problematic suggestions. Key Takeaway: The senior developer productivity paradox isn't a bug in how experienced developers use AI—it's a feature of expertise itself. The verification overhead they experience is the cost of judgment, and that judgment is precisely what makes them senior developers in the first place. Conclusion: Redefining Productivity We're in the middle of a fundamental shift in how software is built. AI coding assistants are potent tools, but like all transformative technologies, they bring complexity. The fact that they make senior developers slower in the short term tells us something important — we're not measuring what matters. The value of software development has never been in raw coding speed. It's in thoughtfulness, judgment, design insight, and the ability to anticipate problems. If AI assistants help junior developers become more productive while making senior developers more deliberate, that may not be a productivity loss at all. It might represent a shift in where the bottleneck lies — from creation to curation, from typing to thinking. In the long run, this shift could be exactly what the industry needs. We've built too much software with too little thought. If AI assistants force us to be more intentional about what we build, even if they slow the building process slightly, we may end up with better systems. The question isn't whether senior developers should use AI assistants — that decision has already been made by the market. The question is how we adapt our workflows, metrics, and expectations to a world in which the relationship between experience and productivity has fundamentally changed. Those who figure this out first will have a significant advantage in the AI-augmented development landscape we're entering.

By Dinesh Elumalai