DZone Spotlight

Monday, December 15 View All Articles »

DZone's 2025 Community Survey

By Carisse Dumaua

Another year passed right under our noses, and software development trends moved along with it. The steady rise of AI, the introduction of vibe coding — these are just among the many impactful shifts, and you've helped us understand them better. Now, as we move on to another exciting year, we would like to continue to learn more about you as software developers, your tech habits and preferences, and the topics you wish to know more about. With that comes our annual community survey — a great opportunity for you to give us more insights into your interests and priorities. We ask this because we want DZone to work for you. Click below to participate ⬇️ And as a small token, you will have a chance to win up to $300 in gift cards and exclusive DZone swag! All it will take is just 10–15 minutes of your time. Now, how cool is that? Over the years, DZone has remained an ever-growing avenue for exploring technology trends, looking for solutions to technical problems, and engaging in peer discussions — and we aim to keep it that way. We're going to need your help to create a more relevant and inclusive space for the DZone community. This year, we want to hear your thoughts on: Who you are as a developer: your experience and how you use toolsWhat you want to learn: your preferred learning formats and topics of interestYour DZone engagement: how often you visit DZone, which content areas pique your interest, and how you interact with the DZone community You are what drives DZone, so we want you to get the most out of every click and scroll. Every opinion is valuable to us, and we use it to equip you with the right resources to support your software development journey. And that will only be possible with your help — so thank you in advance! — Your DZone Content and Community team and our little friend, Cardy More

How to Test POST Requests With REST Assured Java for API Testing: Part II

By Faisal Khatri

CORE

In the previous article, we learnt the basics, setup, and configuration of the REST Assured framework for API test automation. We also learnt to test a POST request with REST Assured by sending the request body as: StringJSON Array/ JSON ObjectUsing Java CollectionsUsing POJO In this tutorial article, we will learn the following: How to use JSON files as a request body for API testing.Implement the Builder Design Pattern in Java to create request data dynamically.Integrate the Datafaker library to generate realistic test data at runtime.Perform assertions with the dynamic request data generated using the Builder design pattern and the Datafaker library. Writing a POST API Test With a Request Body as a JSON File The JSON files can be used as a request body to test the POST API requests. This approach comes in handy in the following scenarios: Multiple test scenarios with different payloads, and you need to maintain test data separately from test code.Large or complex payloads that need to be reused across multiple tests.Frequently changing request payloads that are easier to update in the JSON files rather than using other approaches, like dynamically updating the request body using JSON Objects/Arrays or POJOs. Apart from the above, JSON files can also be used when non-technical team members need to modify the test data before running the tests, without modifying the automation code. With the pros, this approach has some drawbacks as well. The JSON files must be updated with unique data before each test run to avoid duplicate data errors. If you prefer not to modify the JSON files before every execution, you’ll need to implement data cleanup procedures, which adds additional maintenance overhead. We will be using the POST /addOrder API from the RESTful e-commerce demo application to write the POST API requests test. Let’s add a new Java class, TestPostRequestWithJsonFile, and add a new method, getOrdersFromJson(), to it. Java public class TestPostRequestWithJsonFile { public List<Orders> getOrdersFromJson (String fileName) { InputStream inputStream = this.getClass () .getClassLoader () .getResourceAsStream (fileName); if (inputStream == null) { throw new IllegalArgumentException ("File not found!!"); } Gson gson = new Gson (); try (BufferedReader reader = new BufferedReader (new InputStreamReader (inputStream))) { Type listType = new TypeToken<List<Orders>> () { }.getType (); return gson.fromJson (reader, listType); } catch (IOException e) { throw new RuntimeException ("Error Reading the JSON file" + fileName, e); } } //... } Code Walkthrough The getOrdersFromJson() method accepts the JSON file as a parameter and returns a list of orders. This method functions as explained below: Locates the JSON file: The JSON file is placed in the src/test/resources folder, it searches for the JSON file in the classpath using the getResourcesAsStream() method. In case the file is not found, it will throw an IllegalArgumentException.Deserialise the JSON to Java objects: The BufferedReader is used for efficiently reading the file. Google’s Gson library uses theTypeToken to specify the target type (List<Orders>) for proper generic type handling, and converts JSON array into a typed list of order objects.The try-with-resources autocloses the resources to prevent memory leaks. The following test method, testCreateOrder(), tests the POST /addOrder API request: Java @Test public void testCreateOrders () { List<Orders> orders = getOrdersFromJson ("new_orders.json"); given ().contentType (ContentType.JSON) .when () .log () .all () .body (orders) .post ("http://localhost:3004/addOrder") .then () .log () .all () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")); } The following line of code will read the file new_orders.json and use its content as the request body to create new orders. Java List<Orders> orders = getOrdersFromJson("new_orders.json") The rest of the test method remains the same as explained in the previous tutorial, which sets the content type to JSON and sends the post request. It will verify that the status code is 201 and also assert the message field in the response body. Writing a POST API Test With a Request Body Using the Builder Pattern and Datafaker The recommended approach for real-time projects is to use the Builder Pattern with the Datafaker library, as it generates dynamic data at runtime, allowing random and fresh test data generation every time the tests are executed. The key advantages of using this approach are as follows: It provides a faster test setup as there are no I/O operations involved in searching, locating, and reading JSON files.It can easily handle parallel test execution as there is no conflict of test data between concurrent tests.It helps in easy maintenance as there is no need for manual updating of the test data. The Builder Pattern with Datafaker can be implemented using the following steps: Step 1: Generate a POJO for the Request Body The following is the schema of the request body of the POST /addOrder API: JSON [ { "user_id": "string", "product_id": "string", "product_name": "string", "product_amount": 0, "qty": 0, "tax_amt": 0, "total_amt": 0 } ] Let’s create a new Java class for POJO and name it OrderData. We will use Lombok in this POJO as it helps in reducing boilerplate code, such as getters, setters, and builders. By using annotations like @Builder, @Getter and @Setter, the class can be made concise, readable, and easier to maintain. Java @Getter @Setter @Builder @JsonPropertyOrder ({ "user_id", "product_id", "product_name", "product_amount", "qty", "tax_amt", "total_amt" }) public class OrderData { @JsonProperty ("user_id") private String userId; @JsonProperty ("product_id") private String productId; @JsonProperty ("product_name") private String productName; @JsonProperty ("product_amount") private int productAmount; private int qty; @JsonProperty ("tax_amt") private int taxAmt; @JsonProperty ("total_amt") private int totalAmt; } The field name of the JSON request body has a “_” in between them, and as per Java standard conventions, we follow the camelCase pattern. So, to mitigate this issue, we can make use of the @JsonProperty annotation by the Jackson DataBind library and provide the actual field name in the annotation over the respective Java variable names. The order of the JSON fields can be preserved by using the @JsonProperOrder annotation and passing the field names as per the required order. Step 2: Create a Builder Class for Generating Data at Runtime With Datafaker In this step, we will create a new Java class, OrderDataBuilder, for generating test data at runtime using the Datafaker library. Java public class OrderDataBuilder { public static OrderData getOrderData () { Faker faker = new Faker (); int productAmount = (faker.number () .numberBetween (1, 1999)); int qty = faker.number () .numberBetween (1, 10); int grossAmt = qty * productAmount; int taxAmt = (int) (grossAmt * 0.10); int totalAmt = grossAmt + taxAmt; return OrderData.builder () .userId (String.valueOf (faker.number () .numberBetween (301, 499))) .productId (String.valueOf (faker.number () .numberBetween (201, 533))) .productName (faker.commerce () .productName ()) .productAmount (productAmount) .qty (qty) .taxAmt (taxAmt) .totalAmt (totalAmt) .build (); } } A static method getOrderData() has been created inside the class that implements the Datakaker library and builds the OrderData for generating the request body in JSON format at runtime. The Faker class from the Datafaker library is instantiated first, which will be further used for creating fake data at runtime. It provides various methods to generate the required data, such as names, numbers, company names, product names, addresses, etc., at runtime. Using the OrderData POJO, we can populate the required fields through Java’s Builder design pattern. Since we have already applied the @Builder annotation from Lombok, it automatically enables an easy and clean way to construct OrderData objects. Step 3: Write the POST API Request Test Let’s create a new Java class, TestPostRequestWithBuilderPattern, for implementing the test. Java public class TestPostRequestWithBuilderPattern { @Test public void testCreateOrders () { List<OrderData> orderDataList = new ArrayList<> (); for (int i = 0; i < 4; i++) { orderDataList.add (getOrderData ()); } given ().contentType (ContentType.JSON) .when () .log () .all () .body (orderDataList) .post ("http://localhost:3004/addOrder") .then () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")); } } The request body requires the data to be sent in a JSON Array with multiple JSON objects. The OrderDataBuilder class will generate the JSON objects; however, the JSON Array can be handled in the test. Java List<OrderData> orderDataList = new ArrayList<> (); for (int i = 0; i < 4; i++) { orderDataList.add (getOrderData ()); } This code generates four unique order records using the getOrderData() method and adds them to a list named orderDataList. Once the loop completes, the list holds four unique OrderData objects, each representing a new order ready to be included in the test request. The POST test request is finally sent to the server, where it is executed, and the code checks for a status code of 201 and asserts the response body with the text “Orders added successfully!” Performing Assertions With the Builder Pattern When the request body and its data are generated dynamically, a common question arises: “Can we perform assertions on this dynamically created data?” The answer is “Yes.” In fact, it is much easier and quicker to perform the assertions with the request data generated using the Builder pattern and the Datafaker library. The following is the response body generated after successful order creation using the POST /addOrder API: Java { "message": "Orders fetched successfully!", "orders": [ { "id": 1, "user_id": "412", "product_id": "506", "product_name": "Enormous Wooden Watch", "product_amount": 323, "qty": 7, "tax_amt": 226, "total_amt": 2487 }, { "id": 2, "user_id": "422", "product_id": "447", "product_name": "Ergonomic Marble Shoes", "product_amount": 673, "qty": 2, "tax_amt": 134, "total_amt": 1480 }, { "id": 3, "user_id": "393", "product_id": "347", "product_name": "Fantastic Bronze Plate", "product_amount": 135, "qty": 9, "tax_amt": 121, "total_amt": 1336 }, { "id": 4, "user_id": "398", "product_id": "526", "product_name": "Incredible Leather Bottle", "product_amount": 1799, "qty": 4, "tax_amt": 719, "total_amt": 7915 } ] } Let’s say we need to perform the assertion for the user_id field in the second order and the total_amt field of the fourth order in the response. We can write the assertions with REST Assured as follows: Java given ().contentType (ContentType.JSON) .when () .log () .all () .body (orderDataList) .post ("http://localhost:3004/addOrder") .then () .statusCode (201) .and () .assertThat () .body ("message", equalTo ("Orders added successfully!")) .and () .assertThat () .body ("orders[1].user_id", equalTo (orderDataList.get (1) .getUserId ()), "orders[3].total_amt", equalTo (orderDataList.get (3) .getTotalAmt ())); The order array in the response holds all the data related to the orders. Using the JSONPath “orders[1].user_id”, the user_id of the second order will be retrieved. Similarly, the total amount of the fourth order can be fetched using the JSONPath orders[3].total_amt. The Builder design pattern comes in handy for comparing the expected values, where we can use the code orderDataList.get(1).getUserId and orderDataList.get(3).getTotalAmt to get the dynamic value of user_id (second order) and total_amount (fourth order) generated and used in the request body for creating orders at runtime. Summary The REST Assured framework provides flexibility to post the request body in the POST API requests. The request body can be posted using a String, JSON Object, or JSON Array, Java Collections such as List and Map, JSON files, and POJOs. The Builder design pattern in Java can be combined with the Datafaker library to generate a dynamic request body at runtime. Based on my experience, using the Builder Pattern in Java provides several advantages over other approaches for creating request bodies. It allows dynamic values to be easily generated and asserted, making test verification and validation more efficient and reliable. More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #375

Cloud-Native Application Security Patterns and Anti-Patterns

By Samir Behara

Mastering Fluent Bit: Top 3 Telemetry Pipeline Filters for Developers (Part 11)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit filters for developers. In case you missed the previous article, check out three tips for using telemetry pipeline multiline parsers, where you explore how to handle complex multiline log messages. This article will be a hands-on exploration of filters that help you, as a developer, test out your Fluent Bit pipelines. We'll take a look at the top three filters you'll want to know about when building your telemetry pipeline configurations in Fluent Bit. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at the top three filters that will help you with your local development testing of Fluent Bit pipelines. Filtering in a Telemetry Pipeline See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the filters that can be of great help in managing our telemetry data during testing in our inner developer loop. Below, in the figure, you see the phases of a telemetry pipeline. The third phase is filter, which is where we can modify, enrich, or drop records based on specific criteria. Filters in Fluent Bit are powerful tools that operate on records after they've been parsed but before they reach their destination. Unlike processors that work on raw data streams, filters work on structured records, giving you the ability to manipulate individual fields, add metadata, remove sensitive information, or exclude records entirely based on conditions. In production environments, you need full control of the data you're collecting. Filtering lets you alter the collected data before delivering it to a destination. Each available filter can be used to match, exclude, or enrich your logs with specific metadata. Fluent Bit supports many filters, and understanding the most useful ones will dramatically improve your development experience. Now, let's look at the most interesting filters that developers will want to know more about. 1. Modify Filter One of the most versatile filters for telemetry pipelines that developers will encounter is the Modify filter. The Modify filter allows you to change records using rules and conditions, giving you the power to add new fields, rename existing ones, remove unwanted data, and conditionally manipulate your telemetry based on specific criteria. To provide an example, we start by creating a test configuration file called fluent-bit.yaml that demonstrates the Modify filter's capabilities: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"environment":"dev","level":"info","message":"Application started","memory_mb":512}' filters: - name: modify match: '*' add: - service_name my-application - version 1.2.3 - processed true rename: - environment env - memory_mb mem_usage remove: - level outputs: - name: stdout match: '*' format: json_lines Our configuration uses the modify filter with several different operations. The add operation inserts new fields into the record. This is extremely useful for adding metadata that your observability backend expects, such as service names, versions, or deployment information. The rename operation changes field names to match your preferred naming conventions or to comply with backend requirements. The remove operation strips out fields you don't want to send to your destination, which can reduce storage costs and improve query performance. Let's run this configuration to see the Modify filter in action: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 14:23:45.678901","env":"dev","message":"Application started","mem_usage":512,"service_name":"my-application","version":"1.2.3","processed":"true"} {"date":"2025-12-05 14:23:46.789012","env":"dev","message":"Application started","mem_usage":512,"service_name":"my-application","version":"1.2.3","processed":"true"} ... Notice how the output has been transformed? The original environment field is now env, memory_mb is now mem_usage, the level field has been removed entirely, and we've added three new fields: service_name, version, and processed. This kind of transformation is essential when you're working with multiple services that produce logs in different formats but need to be standardized before sending to your observability backend. The Modify filter also supports conditional operations using the Condition parameter. This allows you to apply modifications only when specific criteria are met. Let's extend our example to demonstrate conditional modifications: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"environment":"production","level":"error","message":"Database connection failed","response_time":5000}' - name: dummy tag: app.logs dummy: '{"environment":"dev","level":"info","message":"Request processed","response_time":150}' filters: - name: modify match: '*' condition: - key_value_equals environment production add: - priority high - alert true - name: modify match: '*' condition: - key_value_equals level error add: - severity critical outputs: - name: stdout match: '*' format: json_lines Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 14:30:12.345678","environment":"production","level":"error","message":"Database connection failed","response_time":5000,"priority":"high","alert":"true","severity":"critical"} {"date":"2025-12-05 14:30:13.456789","environment":"dev","level":"info","message":"Request processed","response_time":150} ... The first record matches both conditions (production environment AND error level), so it gets priority, alert, and severity fields added. The second record doesn't match any conditions, so it passes through unchanged. This conditional logic is incredibly powerful for implementing routing rules, prioritizing certain types of logs, or adding context based on the content of your telemetry data. 2. Grep Filter Another essential filter that developers need in their telemetry toolkit is the Grep filter. The Grep filter allows you to match or exclude specific records based on regular expression patterns, giving you fine-grained control over which events flow through your pipeline. This is particularly useful during development when you want to focus on specific types of logs or exclude noisy events that aren't relevant to your current debugging session. To demonstrate the power of the Grep filter, let's create a configuration that filters application logs: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"level":"DEBUG","message":"Processing request 12345","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"ERROR","message":"Failed to connect to database","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"INFO","message":"Request completed successfully","service":"api"}' - name: dummy tag: app.logs dummy: '{"level":"WARN","message":"High memory usage detected","service":"api"}' filters: - name: grep match: '*' regex: - level ERROR|WARN outputs: - name: stdout match: '*' format: json_lines Our configuration uses the grep filter with a regex parameter to keep only records where the level field matches either ERROR or WARN. This kind of filtering is invaluable when you're troubleshooting production issues and need to focus on problematic events while ignoring routine informational logs. Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 15:10:23.456789","level":"ERROR","message":"Failed to connect to database","service":"api"} {"date":"2025-12-05 15:10:24.567890","level":"WARN","message":"High memory usage detected","service":"api"} ... Notice that only the ERROR and WARN level logs appear in the output. The DEBUG and INFO logs have been filtered out completely. This dramatically reduces the volume of logs you need to process during development and testing. The Grep filter also supports excluding records using the exclude parameter. Let's modify our configuration to demonstrate this: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"User login successful","user":"[email protected]"}' - name: dummy tag: app.logs dummy: '{"message":"Health check passed","endpoint":"/health"}' - name: dummy tag: app.logs dummy: '{"message":"Database query executed","query":"SELECT * FROM users"}' - name: dummy tag: app.logs dummy: '{"message":"Metrics endpoint called","endpoint":"/metrics"}' filters: - name: grep match: '*' exclude: - message /health|/metrics outputs: - name: stdout match: '*' format: json_lines Let's run this updated configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 15:20:34.567890","message":"User login successful","user":"[email protected]"} {"date":"2025-12-05 15:20:35.678901","message":"Database query executed","query":"SELECT * FROM users"} ... The health check and metrics endpoint logs have been excluded from the output. This is extremely useful for filtering out routine monitoring traffic that generates high volumes of logs but provides little value during debugging. By combining regex to include specific patterns and exclude to filter out unwanted patterns, you can create sophisticated filtering rules that give you exactly the logs you need. An important note about the Grep filter is that it supports matching nested fields using the record accessor format. For example, if you have JSON logs with nested structures like {"kubernetes":{"pod_name":"my-app-123"}, you can use $kubernetes['pod_name'] as the key to match against nested values. 3. Record Modifier Filter The third essential filter for developers is the Record Modifier filter. While the Modify filter focuses on adding, renaming, and removing fields using static values, the Record Modifier filter excels at appending fields with dynamic values, such as environment variables, and removing or allowing specific keys using pattern matching. This makes it ideal for injecting runtime context into your logs. Let's create a configuration that demonstrates the Record Modifier filter: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"Application event","request_id":"req-12345","response_time":250,"internal_debug":"sensitive data","trace_id":"trace-abc"}' filters: - name: record_modifier match: '*' record: - hostname ${HOSTNAME} - pod_name ${POD_NAME} - namespace ${NAMESPACE} remove_key: - internal_debug outputs: - name: stdout match: '*' format: json_lines Our configuration uses the record_modifier filter with several powerful features. The record parameter adds new fields with values from environment variables. This is incredibly useful in containerized environments where hostname, pod names, and namespace information are available as environment variables but need to be injected into your logs for proper correlation and filtering in your observability backend. The remove_key parameter strips out sensitive fields that shouldn't be sent to your logging destination. Let's run this configuration with some environment variables set: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm \ -e HOSTNAME=dev-server-01 \ -e POD_NAME=my-app-pod-abc123 \ -e NAMESPACE=production \ fb ... {"date":"2025-12-05 16:15:45.678901","message":"Application event","request_id":"req-12345","response_time":250,"trace_id":"trace-abc","hostname":"dev-server-01","pod_name":"my-app-pod-abc123","namespace":"production"} ... Notice how the environment variables have been injected into the log record, and the internal_debug field has been removed. This pattern is essential for enriching your logs with contextual information that helps you understand where the logs originated in your distributed system. The Record Modifier filter also supports the allowlist_key parameter (and its legacy alias whitelist_key), which works inversely to remove_key. Instead of specifying which fields to remove, you specify which fields to keep, and all others are removed: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"User action","user_id":"12345","email":"[email protected]","password_hash":"abc123","session_token":"xyz789","action":"login","timestamp":"2025-12-05T16:20:00Z"}' filters: - name: record_modifier match: '*' allowlist_key: - message - user_id - action - timestamp outputs: - name: stdout match: '*' format: json_lines Let's run this configuration: YAML # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-12-05 16:20:01.234567","message":"User action","user_id":"12345","action":"login","timestamp":"2025-12-05T16:20:00Z"} ... The sensitive fields (email, password_hash, session_token) have been completely stripped out, leaving only the allowlisted fields. This approach is particularly useful when you're dealing with logs that might contain sensitive information, and you want to take a cautious approach by explicitly defining what's safe to send to your logging backend. Another powerful feature of the Record Modifier filter is the ability to generate UUIDs for each record. This is invaluable for tracking and correlating individual log entries across your distributed system: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"message":"Processing request","service":"api"}' filters: - name: record_modifier match: '*' uuid_key: event_id outputs: - name: stdout match: '*' format: json_lines When you run this configuration, each record will have a unique event_id field added automatically, making it easy to reference specific log entries in your observability tools. This covers the top three filters for developers getting started with Fluent Bit while trying to transform and filter their telemetry data effectively and speed up their inner development loop. More in the Series In this article, you learned about three powerful Fluent Bit filters that improve the inner developer loop experience. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring Fluent Bit routing, as there are new ways for developers to leverage this feature.

By Eric D. Schabell

CORE

Secrets in Code: Understanding Secret Detection and Its Blind Spots

In a world where attackers routinely scan public repositories for leaked credentials, secrets in source code represent a high-value target. But even with the growth of secret detection tools, many valid secrets still go unnoticed. It’s not because the secrets are hidden, but because the detection rules are too narrow or overcorrect in an attempt to avoid false positives. This creates a trade-off between wasting development time investigating false signals and risking a compromised account. This article highlights research that uncovered hundreds of valid secrets from various third-party services publicly leaked on GitHub. Responsible disclosure of the specific findings is important, but the broader learnings include which types of secrets are common, the patterns in their formatting that cause them to be missed, and how scanners work so that their failure points can be improved. Further, for platforms that are accessed with secrets, there are actionable improvements that can better protect developer communities. What Are “Secrets” in Source Code? When we say “secrets,” we’re not only talking about API tokens. Secrets include any sensitive value that, if exposed, could lead to unauthorized access, account compromise, or data leakage. This includes: API Keys: Tokens issued by services like OpenAI, GitHub, Stripe, or Gemini.Cloud Credentials: Access keys for managing AWS cloud resources or infrastructure.JWT Signing Keys: Secrets used to sign or verify JSON Web Tokens, often used in authentication logic.Session Tokens or OAuth Tokens: Temporary credentials for session continuity or authorization.One-Time Use Tokens: Password reset tokens, email verification codes, or webhook secrets.Sensitive User Data: Passwords or user attributes included in authentication payloads. Secrets can be hardcoded, generated dynamically, or embedded in token structures like JWTs. Regardless of the specific form, the goal is always to keep them out of source control management systems. How Secret Scanners Work Secret scanners generally detect secrets using patterns. For example, a GitHub Personal Access Token (PAT) like: JavaScript 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n might be matched by a regex rule such as: JavaScript ghp_[A-Za-z0-9]{36} To reduce false positives that string literal matching alone might flag, scanners often rely on: Validation: Once a match is found, some tools will try to validate the secret is in fact a secret and not a placeholder example. This can be done by contacting its respective service. Making an authentication request to an API and interpreting the response code would let the scanner know if it is an active credential.Word Boundaries: Ensure the pattern is surrounded by non-alphanumeric characters (e.g. \bghp_...\b), to avoid matching base64 blobs or gibberish.Keywords: Contextual terms nearby (e.g. “github” or “openai”) can better infer the token’s source or use. This works well for many credential-like secrets, but for some tools this isn’t done in a way that is much more clever than running grep. Take another example: JavaScript const s = "h@rdc0ded-s3cr3t"; const t = jwt.sign(payload, s); There’s no unique prefix in cases like this. No format. But it’s still a secret, and if leaked, it could let an attacker forge authentication tokens. Secret scanners that only look for credential-shaped strings would miss this entirely. A Few Common Secret Blind Spots 1. Hardcoded JWT Secrets In a review of over 2,000 Node.js modules using popular JWT libraries, many hardcoded JWT secrets were found: JavaScript const opts = { secretOrKey: "hardcoded-secret-here" }; passport.use(new JwtStrategy(opts, verify)); These are not always caught by conventional secret scanners, because they don’t follow known token formats. If committed to source control, they can be exploited to sign or verify forged JWTs. The semantic data flow of a hardcoded secret to an authorization function can lead to much better results. 2. JWTs With Sensitive Payloads A subtle but serious risk occurs when JWTs are constructed with entire user objects, including passwords or admin flags: JavaScript const token = jwt.sign(user, obj); This often happens when working with ORM objects like Mongoose or Sequelize. If the model evolves over time to include sensitive fields, they may inadvertently end up inside issued tokens. The result: passwords, emails, or admin flags get leaked in every authentication response. 3. Secrets Hidden by Word Boundaries In a separate research survey project, hundreds of leaks were detected from overfitting word boundaries. Word boundaries (\b) in regex patterns are used to reduce noise by preventing matches inside longer strings. But they also miss secrets embedded in HTML, comments, or a misplaced paste: JavaScript {/* <CardComponent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nents> */} Scanners requiring clean boundaries around the token will miss this even if the secret is valid. Similarly, URL-encoded secrets (like in logs or scripts) are frequently overlooked: JavaScript %22Bearer%20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n%22 Scanning GitHub Repos and Finding Missed Secrets We wanted to learn how to better tune a tool and make adjustments for non-word boundary checks so tested it with the best secret scanning tools on the market for strengths and weaknesses: GitHub, GitGuardian, Kingfisher, Semgrep, and Trufflehog. The main tokens discovered across a wide number of open-source projects were GitHub classic and fine-grained PATs, in addition to AI services such as OpenAI, Anthropic, Gemini, Perplexity, Huggingface, xAI, and Langsmith. Less common but also discovered were email providers and developer platform keys. We found that few providers we tested detected the valid tokens associated with GitHub.GitHub’s default secret scanning did not detect OpenAI tokens within word-boundaries, this includes push protection and once leaked within a repository. The other tokens varied per-provider; some detected or missed Anthropic, Gemini, Perplexity, Huggingface, xAI, Deepseek and others. The keys were missed due to either overly strict non-word boundaries or looking for specific keywords that either were in the wrong place or did not exist in the file. Some of the common problem classes with non-word boundaries include: unintentional placement, terminal output, encodings and escape formats, non-word character end-lines, unnecessary boundaries, or generalized regex. Common Token Prefixes and Pattern Examples Here's a sampling of secret token formats that scanners might detect or miss. The reasons for this include the word boundary problems but also the non-unique prefixes can prevent the ability to validate against an authorization endpoint as a true secret that has been leaked. Service Providerpatternsrisk factorsGitHubghp_ github_pat_ gho_ ghu_ ghr_ ghs_Multiple formats to look for. Often can be missed if embedded in strings or URL-encoded.OpenAIsk-Using a hyphen can break some boundary-based detection methods. Ambiguity due to overlap with DeepSeek, but inclusion of T3BlbkFJ pattern in some formats can be a signal, but not consistently used.DeepSeeksk-Using a hyphen can break some boundary-based detection methods. Easily misclassified as OpenAI without additional hints.Anthropicsk-ant- Using a hyphen can break some boundary-based detection methods. End pattern of AA and ant- helps with unique identifiecation.Stripesk_live_ sk_test_Shares prefix with other service providers creating collisions for auth validation when discovered.APIDecksk_live_ sk_test_Shares prefixes with Stripe which makes validation difficult.Groqgsk_Similar format but has slightly different identifier which can help with uniqueness.Notionsecret_Common prefix for many services increases prevalence of false positives by not being able to validate authentication.ConvertAPIsecret_Common prefix for many services increases prevalence of false positives by not being able to validate authentication.LaunchDarklyapi-Common prefix for many services increase prevalence of false positives by not being able to validate authentication.Robinhoodapi-Common prefix for many services increase prevalence of false positives by not being able to validate authentication.Nvidianvapi-Allows string to end in a hyphen (-) which can break some boundary-based detection methods. This is just a sample of the many platforms that have secrets. To help safeguard them it is important to distinguish between an example placeholder and the real thing, so being able to uniquely identify the source becomes challenging. Improving Secret Detection To improve the accuracy and completeness of secret detection, consider the following strategies: For Development Teams Avoid hardcoded secrets. Use environment variables or secret managers even if only meant to be a placeholder example because it can fire false positives and risk missing true positives when they occur.Use static analysis. Catch patterns like string literals in crypto functions but also data flow patterns that can cross between files (inter-file) to expose secrets in unexpected ways that can be caught.Automate checking your codebase. Use tools that continuously monitor source code check-ins, preferably through pre-commit hooks to identify whenever secrets are accidentally introduced into the code base. Relying on your SCM provider to do this is not often enough. For Service Providers Use unique, identifiable prefixes for secrets. It helps with detection.Document exact token formats because the transparency makes it easier for tools to catch it. Offer validation endpoints so that development teams can be confident in any findings being true positives.Expire or encourage rotating tokens automatically to minimize damage. Conclusion Secrets aren’t always easy to spot. They’re not always wrapped in clear delimiters, and they don’t always look like credentials. Sometimes they hide in authentication logic, passed into token payloads, or hardcoded during development. We explained how secret detection works, where it falls short, and how real-world leaks occur in ways many scanners don’t expect. From hardcoded JWT secrets to misplaced token strings, the cost of undetected secrets is high but preventable.

By Jayson DeLancey

Mastering Fluent Bit: 3 Tips for Telemetry Pipeline Multiline Parsers for Developers (Part 10)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit multiline parsers for developers. In case you missed the previous article, check out using telemetry pipeline processors, where you explore the top three telemetry data processors for developers. This article will be a dive into parsers that help developers test Fluent Bit pipelines when dealing with difficult and long multiline log messages. We'll take a look at using multiline parsers for your telemetry pipeline configuration in Fluent Bit. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at the three tips for multiline parsers and how they help you manage complex log entries during your local development testing. Multiline Parsing in a Telemetry Pipeline See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the multiline parsers that can be of great help in managing our telemetry data during testing in our inner developer loop. Below, in the figure, you see the phases of a telemetry pipeline. The second phase is the parser, which is where unstructured input data is turned into structured data. Note that in this article, we explore Fluent Bit using multiline parsers that we can configure to process data in the input of our telemetry pipeline, but this is shown here as a separate phase. The challenge developers often face is that real-world applications don't always log messages on a single line. Stack traces, error messages, and debug output frequently span multiple lines. These multiline messages need to be concatenated before they can be properly parsed and processed. Fluent Bit provides multiline parsers to solve this exact problem. A multiline parser can recognize when multiple lines of log data belong together and concatenate them into a single event before further processing. An example of multiline log data that developers encounter daily would be a Java stack trace: Shell Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) Without multiline parsing, each line would be treated as a separate log entry. With multiline parsing, all these lines are correctly concatenated into a single structured event that maintains the complete context of the error. The Fluent Bit multiline parser engine exposes two ways to configure the feature: Built-in multiline parsersConfigurable multiline parsers Fluent Bit provides pre-configured built-in parsers for common use cases such as: docker – Process log entries generated by Docker container engine.cri – Process log entries generated by CRI-O container engine.go – Process log entries from Go applications.python – Process log entries from Python applications.ruby – Process log entries from Ruby applications.java – Process log entries from Java applications. For cases where the built-in parsers don't fit your needs, you can define custom multiline parsers. These custom parsers use regular expressions and state machines to identify the start and continuation of multiline messages. Let's look at how to configure a custom multiline parser that developers will want to know more about. Now, let's look at the most interesting tips for multiline parsers that developers will want to know more about. 1. Configurable Multiline Parser One of the more common use cases for telemetry pipelines that developers will encounter is dealing with stack traces and error messages that span multiple lines. These multiline messages need special handling to ensure they are concatenated properly before being sent to their destination. To provide an example, we start by creating a test log file called test.log with multiline Java stack trace data: Shell single line... Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) another line... Next, let's create a multiline parser configuration. We create a new file called parsers_multiline.yaml in our favorite editor and add the following configuration: Shell parsers: - name: multiline-regex-test type: regex flush_timeout: 1000 rules: - state: start_state regex: '/([a-zA-Z]+ \d+ \d+\:\d+\:\d+)(.*)/' next_state: cont - state: cont regex: '/^\s+at.*/' next_state: cont Let's break down what this multiline parser does: name – We give our parser a unique name, multiline-regex-test.type – We specify the type as regex for regular expression-based parsing.flush_timeout – After 1000ms of no new matching lines, the buffer is flushed.rules – We define the state machine rules that control multiline detection. The rules section is where the magic happens. A multiline parser uses states to determine which lines belong together: The start_state rule matches lines that begin a new multiline message. In our case, the pattern matches a timestamp followed by any text, which identifies the first line of our Java exception.The cont (continuation) rule matches lines that are part of the multiline message. Our pattern matches lines starting with whitespace followed by "at", which identifies the stack trace lines.Each rule specifies a next_state, which tells Fluent Bit what state to transition to after matching. This creates a state machine that can handle complex multiline patterns. When the parser sees a line matching start_state, it begins a new multiline buffer. Any subsequent lines matching the cont pattern are appended to that buffer. When a line doesn't match either pattern, or when the flush timeout expires, the complete multiline message is emitted as a single event. Now let's create our main Fluent Bit configuration file, fluent-bit.yaml, that uses this multiline parser: Shell service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on parsers_file: parsers_multiline.yaml pipeline: inputs: - name: tail path: test.log read_from_head: true multiline.parser: multiline-regex-test outputs: - name: stdout match: '*' Note several important configuration points here: We include the parsers_file in the service section to load our multiline parser definitionsWe use the tail input plugin to read from our test log fileWe set read_from_head: true to read the entire file from the beginningMost importantly, we specify multiline.parser: multiline-regex-test to apply our multiline parser The multiline parser is applied at the input stage, which is the recommended approach. This ensures that lines are concatenated before any other processing occurs. Let's run this configuration to see the multiline parser in action: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... [0] tail.0: [[1750332967.679671000, {}], {"log"=>"single line... "}] [1] tail.0: [[1750332967.679677000, {}], {"log"=>"Dec 14 06:41:08 Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) "}] [2] tail.0: [[1750332967.679677000, {}], {"log"=>"another line... ... Notice how the output shows three distinct events: The single-line message passes through unchanged.The entire stack trace is concatenated into one event, preserving the complete error context.The final single-line message passes through unchanged. This is exactly what we want. The multiline parser successfully identified the start of the Java exception and concatenated all the stack trace lines into a single structured event. 2. Extracting Structured Data From Multiline Messages Once you have your multiline messages properly concatenated, you'll often want to extract specific fields from them. Fluent Bit supports this through the parser filter, which can be applied after multiline parsing. Let's extend our example to extract the date and message components from the concatenated stack trace. First, we'll add a regular expression parser to our parsers_multiline.yaml file: Shell parsers: - name: multiline-regex-test type: regex flush_timeout: 1000 rules: - state: start_state regex: '/([a-zA-Z]+ \d+ \d+\:\d+\:\d+)(.*)/' next_state: cont - state: cont regex: '/^\s+at.*/' next_state: cont - name: named-capture-test format: regex regex: '/^(?<date>[a-zA-Z]+ \d+ \d+\:\d+\:\d+)\s+(?<message>(.|\n)*)$/m' The new named-capture-test parser uses named capture groups to extract: date - The timestamp at the start of the messagemessage - The remaining content, including all newlines Note the /m modifier at the end of the regex, which enables multiline mode where . (dot) can match newline characters. Now we update our main configuration to apply this parser using the parser filter: Shell service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on parsers_file: parsers_multiline.yaml pipeline: inputs: - name: tail path: test.log read_from_head: true multiline.parser: multiline-regex-test filters: - name: parser match: '*' key_name: log parser: named-capture-test outputs: - name: stdout match: '*' We've added a parser filter that: Matches all events with match: '*'Looks at the log field with key_name: logApplies the named-capture-test parser to extract structured fields Running this enhanced configuration produces: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... [0] tail.0: [[1750333602.460984000, {}], {"log"=>"single line... "}] [1] tail.0: [[1750333602.460998000, {}], {"date"=>"Dec 14 06:41:08", "message"=>"Exception in thread "main" java.lang.RuntimeException: Something has gone wrong, aborting! at com.myproject.module.MyProject.badMethod(MyProject.java:22) at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) at com.myproject.module.MyProject.someMethod(MyProject.java:10) at com.myproject.module.MyProject.main(MyProject.java:6) "}] [2] tail.0: [[1750333602.460998000, {}], {"log"=>"another line... "}] ... Now the multiline Java exception event contains structured fields: date contains the timestampmessage contains the complete exception and stack trace This structured format makes it much easier to query, analyze, and alert on these error events in your observability backend. 3. Important Considerations for Multiline Parsers When working with multiline parsers, keep these important points in mind: Apply multiline parsing at the input stage. While you can apply multiline parsing using the multiline filter, the recommended approach is to configure it directly on the input plugin using multiline.parser. This ensures lines are concatenated before any other processing.Understand flush timeout behavior. The flush_timeout parameter determines how long Fluent Bit waits for additional matching lines before emitting the multiline buffer. Set this value based on your application's logging patterns. Too short and you might break up valid multiline messages. Too long and you'll introduce unnecessary latency.Use specific state patterns. Make your regular expressions as specific as possible to avoid false matches. The start_state pattern should uniquely identify the beginning of a multiline message, and continuation patterns should only match valid continuation lines.Be aware of resource implications. Multiline parsers buffer lines in memory until the complete message is ready. For applications with very large multiline messages (like huge stack traces), this can consume significant memory. The multiline parser bypasses the buffer_max_size limit to ensure complete messages are captured.Test with real data. Always test your multiline parser configurations with actual log data from your applications. Edge cases in log formatting can cause unexpected parsing behavior. This covers the three tips for developers getting started with Fluent Bit multiline parsers while trying to handle complex multiline log messages and speed up their inner development loop. More in the Series In this article, you learned how to use Fluent Bit multiline parsers to properly handle log messages that span multiple lines. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring some of the more interesting Fluent Bit filters for developers.

By Eric D. Schabell

CORE

A Guide for Deploying .NET 10 Applications Using Docker's New Workflow

Container deployment has become the cornerstone of scalable, repeatable application delivery. .NET 10 represents the latest evolution of Microsoft's cloud-native framework, offering exceptional performance, deep cross-platform support, and tight integration with modern DevOps practices. Developing with .NET 10 offers incredible performance and cross-platform capability. When paired with Docker, .NET 10 applications become truly portable artifacts that run identically across development laptops, CI/CD pipelines, staging environments, and production infrastructure — whether on-premises, cloud-hosted, or hybrid. This comprehensive guide walks you through a professional-grade containerization workflow using the .NET CLI and Docker's automated tooling, taking you from a fresh project scaffold to a production-ready, optimized container image. The next logical step is to deploy that application using Docker, which ensures that your code runs identically everywhere — from your local machine to any cloud environment. This guide outlines the most efficient process for containerizing any new .NET 10 web application using the integrated docker init tool. Why Docker and .NET 10 Are the Perfect Match The promise of containerization is straightforward in theory but demanding in practice: write once, deploy everywhere. .NET 10 and Docker together fulfill this promise with remarkable elegance. Reproducibility is the first pillar. Every developer, CI agent, and production server running your Docker image is executing identical bytecode in an identical runtime environment. No more "works on my machine" frustrations. Configuration drift — where servers gradually diverge due to manual patches, version mismatches, or environment-specific tweaks — becomes moot when your entire runtime is packaged as code. Portability extends beyond reproducibility. A .NET 10 Docker image can run anywhere Docker is supported: Linux and Windows containers, on-premises data centers, every major cloud provider (AWS ECS, Azure Container Instances, Google Cloud Run), Kubernetes clusters, edge devices, or developer workstations. Your investment in containerization unlocks unprecedented deployment flexibility. You're no longer locked into a single platform or hosting provider. Performance is where .NET 10 shines. The latest framework includes performance improvements across the runtime, IL compiler, and garbage collector. Combining this with Docker's efficient resource isolation means your containerized .NET 10 applications run lean and fast, scaling efficiently under load. Security and isolation are architectural benefits of containerization. Your application runs in a lightweight, isolated sandbox. Changes to one container don't cascade to others. Updates to your base image can be published centrally and adopted across your entire fleet without rewriting application code. This decoupling of application and infrastructure is essential for modern security practices. From a team perspective, Docker provides a shared contract between developers and the Operations team. Developers focus on code and dependencies within the Dockerfile; infrastructure teams focus on orchestration, networking, and resource allocation at the container level. This separation of concerns accelerates both development velocity and operational reliability. Setting Up Your Development Environment Prerequisites 1. Install .NET 10 SDK Download and install the .NET 10 SDK from dotnet.microsoft.com. Choose the installer for your operating system (Windows, macOS, or Linux). Verify installation: Shell dotnet --version dotnet --list-sdks You should see version 10.0.x listed. 2. Install Docker Desktop Download Docker Desktop from docker.com and run the installer for your operating system. Start Docker Desktop after installation. Verify installation: Shell docker --version Creating a New Web Application Using CLI, create a new web application and make sure to set the target framework to .NET 10.0. Shell dotnet new webapp -f net10.0 -o webapplication1 The -f net10.0 flag explicitly targets to create the project with .NET 10.0 as the target framework, as shown in the figure below. Once scaffolded, your project contains: Program.cs: The entry point, where you configure services and middlewareWebApplication1.csproj: The project file defining dependencies and build configurationProperties/launchSettings.json: Development launch profiles, including port mappings and loggingStandard folders like Pages, wwwroot, and others depending on your template choice Build and Test Your Application Locally Before moving to containers, verify the application runs correctly on your host: Shell dotnet run The CLI compiles your project, restores NuGet packages (if necessary), and starts the Kestrel web server. You should see output similar to the following. C# info: Microsoft.Hosting.Lifetime[14] Now listening on: http://localhost:5172 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down. Open a browser and navigate to the HTTPS URL (in this example, https://localhost:5172). You should see the default template page. If you're using a self-signed development certificate, your browser will warn you about the certificate; this is expected and safe to bypass during local development. This smoke test confirms that your application compiles, the Kestrel server starts correctly, and the basic request/response cycle works. Any configuration issues, missing dependencies, or logic errors will surface immediately. Catching these now saves time later in the Docker build pipeline. Containerizing With Docker Init Docker's init command is a game-changer for .NET developers. It analyzes your project structure and generates a production-grade Docker configuration tailored to your tech stack, eliminating tedious manual Dockerfile authoring for the common case. Make sure you complete the prerequisites above and ensure Docker Desktop is running. From your project root folder, run the command below: Shell docker init The command prompts you with a series of questions: Application platform: Select .NET (or .NET ASP.NET Core if more specific)Version: It will auto-detect .NET 10 from your project filePort: Enter the port your application should listen on (default is often 8080) After responding to the prompts, docker init generates three critical files as shown in the figure below. Dockerfile The Dockerfile is the recipe for building your container image. For .NET 10, Docker Init typically generates a multi-stage build file as shown below. Dockerfile # syntax=docker/dockerfile:1 # Comments are provided throughout this file to help you get started. # If you need more help, visit the Dockerfile reference guide at # https://docs.docker.com/go/dockerfile-reference/ # Want to help us make this template better? Share your feedback here: https://forms.gle/ybq9Krt8jtBL3iCk7 ################################################################################ # Learn about building .NET container images: # https://github.com/dotnet/dotnet-docker/blob/main/samples/README.md # Create a stage for building the application. FROM --platform=$BUILDPLATFORM mcr.microsoft.com/dotnet/sdk:10.0-alpine AS build COPY . /source WORKDIR /source # This is the architecture you're building for, which is passed in by the builder. # Placing it here allows the previous steps to be cached across architectures. ARG TARGETARCH # Build the application. # Leverage a cache mount to /root/.nuget/packages so that subsequent builds don't have to re-download packages. # If TARGETARCH is "amd64", replace it with "x64" - "x64" is .NET's canonical name for this and "amd64" doesn't # work in .NET 6.0. RUN --mount=type=cache,id=nuget,target=/root/.nuget/packages \ dotnet publish -a ${TARGETARCH/amd64/x64} --use-current-runtime --self-contained false -o /app # If you need to enable globalization and time zones: # https://github.com/dotnet/dotnet-docker/blob/main/samples/enable-globalization.md ################################################################################ # Create a new stage for running the application that contains the minimal # runtime dependencies for the application. This often uses a different base # image from the build stage where the necessary files are copied from the build # stage. # # The example below uses an aspnet alpine image as the foundation for running the app. # It will also use whatever happens to be the most recent version of that tag when you # build your Dockerfile. If reproducibility is important, consider using a more specific # version (e.g., aspnet:7.0.10-alpine-3.18), # or SHA (e.g., mcr.microsoft.com/dotnet/aspnet@sha256:f3d99f54d504a21d38e4cc2f13ff47d67235efeeb85c109d3d1ff1808b38d034). FROM mcr.microsoft.com/dotnet/aspnet:10.0-alpine AS final WORKDIR /app # Copy everything needed to run the app from the "build" stage. COPY --from=build /app . # Switch to a non-privileged user (defined in the base image) that the app will run under. # See https://docs.docker.com/go/dockerfile-user-best-practices/ # and https://github.com/dotnet/dotnet-docker/discussions/4764 USER $APP_UID ENTRYPOINT ["dotnet", "WebApplication1.dll"] Multi-stage builds are the cornerstone of this Dockerfile. They solve a critical problem: if you built your image using only the SDK stage, the final image would be over 2 GB, containing the entire .NET SDK, build tools, source code, and intermediate artifacts. None of these are needed at runtime; they're build-time concerns only. The multi-stage approach separates concerns: Stage 1 (build): Starts from the full .NET SDK (mcr.microsoft.com/dotnet/sdk:10.0), which includes compilers, build tools, and everything needed to compile C#.Stage 2 (publish): Runs dotnet publish, which compiles the application in Release mode and packages only the runtime-necessary binaries into an /app/publish folder. Source code is not included.Stage 3 (runtime): Starts from a lean ASP.NET Core runtime image (mcr.microsoft.com/dotnet/aspnet:10.0), which contains only the .NET runtime, without the SDK or build tools. The COPY --from=publish instruction brings only the published binaries from Stage 2. The result: a final image of roughly 150–300 MB (depending on your application), down from over 2 GB — an 80%+ reduction. This has cascading benefits: faster builds, quicker deployments, lower storage and bandwidth costs, and a smaller attack surface for security. Layer caching is another critical optimization baked into this structure. Docker caches each layer (each line in the Dockerfile). When you change your C# code, Docker rebuilds only the layers after the change, reusing earlier cached layers. By copying *.csproj and running dotnet restore early, you maximize cache hits. If only your code changes (not your dependencies), the restore layer is skipped, and the build is much faster. .dockerignore This file tells Docker which files to exclude when building the image context. Excluding bin and obj folders is important as these folders contain compiled binaries from your host machine, and they are not needed within the Docker context. The build would happen inside the container to generate new binaries. Similarly, all irrelevant files or folders are not needed and are added to the .dockerignore file. Dockerfile **/.git **/.gitignore **/.vs **/.vscode **/bin **/obj **/node_modules ...... Compose.yaml This file orchestrates the local containerized development and is shown below. Dockerfile services: server: build: context: . target: final ports: - 8080:8080 Visual Studio Code and Visual Studio are smart enough to provide an easy way to run these services by automatically creating a "Run all Services" button. Let's look at each section services: Defines services in your stack.build: Specifies how to build the image. context: . means "use the current directory as the build context."ports: Maps container ports to host ports. "8080:8080" means "forward host port 8080 to container port 8080." When you access localhost:8080 on your development machine, traffic is routed to port 8080 inside the container. Compose.yaml is your main starting point to run your application inside Docker as a container. Depending on your application, you can make adjustments to the compose.yaml file, and there are clear comments provided in the auto-generated file to give you more knowledge about how to add other services, like adding PostgreSQL or any other dependencies that your application can use. Readme.Docker.md These files provide detailed instructions on how to build and run your application, as shown below. Let's use these instructions to build and run your application. It also provides instructions on how to deploy your application to the cloud. Building and Running Your Application as a Container Inside Docker Once you have adjusted the configuration as per your project needs, you can build and run your application by running the following Docker command from the terminal: Shell docker compose up --build The docker compose up command starts all services defined in your compose.yaml. Instead of executing the command, you can also click on the "Run All Services" button within your Visual Studio editor. Depending on the size of your base images in your Docker file, it can take a few minutes to build and run your application, as shown in the image below. Once it's completed, you can navigate to your application by opening the URL https://localhost:8080 in your browser, and you can also verify the Docker image by navigating to the Images tab in Docker Desktop, as shown below. You can also view application logs directly within your Docker Desktop. Before pushing this to the Docker Hub repository or running it in production, scan for any vulnerabilities by using the Docker Scout command, which is built into the Docker CLI. Conclusion Containerizing .NET 10 applications with Docker transforms development workflow and deployment reliability. The docker init tool streamlines the process, generating multi-stage Dockerfiles that produce lean, efficient images. Combined with Docker Compose for local development or managed container services for production, this workflow delivers reproducibility, portability, and operational excellence. From local development to global deployment, your .NET 10 application now runs consistently, scales elastically, and integrates seamlessly with modern cloud-native infrastructure. The investment in containerization pays dividends in deployment velocity, infrastructure cost, and team productivity.

By Naga Santhosh Reddy Vootukuri

CORE

How Migrating to Hardened Container Images Strengthens the Secure Software Development Lifecycle

Container images are the key components of the software supply chain. If they are vulnerable, the whole chain is at risk. This is why container image security should be at the core of any Secure Software Development Lifecycle (SSDLC) program. The problem is that studies show most vulnerabilities originate in the base image, not the application code. And yet, many teams still build their containers on top of random base images, undermining the security practices they already have in place. The result is hundreds of CVEs in security scans, failed audits, delayed deployments, and reactive firefighting instead of a clear vulnerability-management process. To establish reliable and efficient SSDLC processes, you need a solid foundation. This is where hardened base images enter the picture. This article explores the concept of hardened container images; how they promote SSDLC by helping teams reduce the attack surface, shift security left, and turn CVE management into a repeatable, SLA-backed workflow; and what measurable outcomes you can expect after switching to a hardened base. How the Container Security Issue Spirals Out of Control Across SSDLC Just as the life of an application starts with its programming language, the life of a container begins with its base image. Hence, the problem starts here and can be traced back as early as the requirements analysis stage of the SSDLC. This is because the requirements for selecting a base image — if they exist at all — rarely include security considerations. As a result, it is common for teams to pick a random base image. Such images often contain a full OS with numerous unnecessary components and may harbor up to 600 known vulnerabilities (CVEs) at once. Later, when the containerized application undergoes a security scan at the deployment stage, the results show hundreds of vulnerabilities. Most of them originate from the base image, not the application code, framework, or libraries. And yet, the security team must waste time addressing these flaws instead of focusing on application security. As a result: Vulnerabilities are ignored and make their way to production, orDeployments are delayed because of critical vulnerabilities, orThe team spends hours trying to patch the image. Sometimes, all three happen — if you are especially ‘lucky.’ When the container image finally reaches production, the risks associated with the existing CVEs grow as new critical CVEs appear. The team then scrambles to patch the base image, rebuild, and redeploy, hoping nothing breaks. But the problem doesn’t stop there. During preparation for a security audit, it may turn out that the base image lacks provenance data required by regulations, such as a software bill of materials (SBOM), a digital signature, or a strict update schedule. This makes it difficult for the team to meet audit requirements and may result in more than a fine for noncompliance. The presence of a package manager in the base image can worsen the problem, because the image may contain not only essential packages but many others. It is easy to add additional packages, but not as easy to trace their origin or determine whether they are required — especially when a package contains a critical CVE and you must act quickly. To summarize: a base image is not the only container security concern. However, it is the foundation of the container image — and often contains more security flaws than the application itself. This places unnecessary operational burden on the team and pulls their attention away from what truly requires strengthening and enhancement: the application. Hardened Container Images as an SSDLC Control Point If the foundation is rotten, the building won’t last long. Therefore, you fix the foundation. In the case of container images, you replace the underlying base image. What the team needs is not just another base image but a hardened container image that prevents the issues described above. So, what is a hardened container image? It is a strictly defined, minimal set of components required to run the application, which cannot be changed or inspected externally due to the absence of a package manager. This set of components is: Free from known CVEs from the start, guaranteeing a minimal attack surface throughout the lifecycleInventoried in an SBOM and signed with a digital signature, providing comprehensive security metadataContinuously monitored and patched by the vendor under an SLA, so the SRE and security teams can rely on a defined patch cadence Free from unnecessary packages and known vulnerabilities, a hardened container image reduces the attack surface of production containers immediately. But the image hardening is not just about reducing components — it is about helping teams establish a clear CVE management process where all components are listed, tracked, and continuously patched. As a result, hardened container images integrate naturally into the SSDLC program. Enhancing Secure SDLC Workflow with Hardened Images Thanks to the features described above, hardened container images can be smoothly integrated into SSDLC processes, allowing teams to shift security left without slowing down the release cadence or increasing developers' workload. If teams previously used random base images and dealt with patches and security audits reactively, hardened container images change the game from the start. According to the new workflow: The platform team selects a set of hardened container images as the only allowed bases at the planning stage.These hardened images are enforced during the build stage with CI templates and policies.Security scanners don’t choke on hundreds of CVEs during the testing stage; instead, scan results show only issues that matter.Immutable containers with a drastically reduced attack surface run in production; rolling updates are driven by business needs and base image updates, not manual patching.SBOMs, digital signatures, and SLA-backed patch timelines ensure compliance and simplify security audits.When a critical CVE appears, the vendor updates the hardened image, you rebuild your image on top of it, and the security team closes the ticket — now in days instead of weeks. At the same time, the developers’ workflow barely changes: they simply switch the base image and stop wasting time patching code that isn’t theirs. DIY vs. Vendor-Backed Hardened Images Creating and maintaining your own hardened container images is theoretically possible, but it imposes a tremendous operational burden on your team, effectively requiring them to become Linux and runtime maintainers. This requires: Deep knowledge of OS/runtime intrinsicsContinuous CVE monitoring and triageSigning, versioning, and SBOM policies But building a hardened base image is only part of the task. You must also patch it continuously, which requires: Monitoring security advisories for your distribution and runtime(s)Determining which CVEs matter to your environmentRebuilding images, running tests, coordinating rolloutsCommunicating breaking changes to all teams Therefore, maintaining your own hardened base implies high costs, resulting from engineering time spent maintaining the foundation instead of improving the product. Metaphorically, you must run an ultramarathon while maintaining sprinter speed. Fortunately, there is no need to hire a dedicated team solely for base images. Several reliable vendors — including BellSoft, Chainguard, and Docker — provide ready-made hardened container images for various runtimes. This means you can outsource the hard work of maintaining secure base images to experts who do it full-time. When selecting a vendor that ships hardened container images, make sure they provide: Teams focused on OS security, packaging, and complianceSigned images and standard attestationsSBOMs out of the boxRegularly updated images with tested patchesAn SLA for patchesOS and runtime built from source in every image, guaranteeing that no third-party binary — unknown CVEs or irregular update schedules — is included The full set of features depends on the vendor, so study their offerings carefully and select the base images that best fits your needs. This enables a centralized vulnerability-management process built around a trusted solution and allows engineers to focus on the product. Measurable Outcomes of Migrating to Hardened Container Images Migrating to hardened container images is not just about the abstract notion of "improved security." It’s about transforming the chaos of unmanaged base images and unmanageable CVEs into something measurable and controllable. The table below summarizes key areas where you can track improvements driven by hardened container images: Area/metric Result CVEs per image Low to Zero Scanner integration Major vulnerability scanners support base images; Base OS package ecosystem provides a scanner package Scanner noise Meaningful results, no false-positive alerts Package management Reliable ecosystem of verified packages Mean Time to Patch Days Compliance & Audit SBOMs, standardized images, documented patch flow and SLA Operational burden Low, base image patching is handled by the vendor Conclusion A secure software development lifecycle depends on the integrity of every layer in the stack. Hardened container images form the foundation of this stack and represent one of its key control points. Studies show that the majority of vulnerabilities in containerized workloads originate in the base image. Standardizing on hardened, minimal, vendor-supported base images reduces this risk, improves the signal quality of security scanners, and helps create a clear and auditable patching process. Importantly, migrating to hardened images is not difficult — and, surprisingly, hardened images can even be found for free. Therefore, migrating to hardened container images aligns day-to-day engineering practices with security and compliance objectives, shortens response times to critical vulnerabilities, and reduces the operational overhead of managing CVEs at scale — all without affecting product delivery timelines.

By Catherine Edelveis

Why Senior Developers Are Actually Less Productive with AI Copilot (And What That Tells Us)

I watched the tech lead spend forty-five minutes wrestling with GitHub Copilot suggestions for an API endpoint. The same task would have taken fifteen minutes without the AI assistant. That situation was not an isolated case. Across the organization, we started to notice a pattern: experienced developers were slower when using AI coding assistants than junior developers. This pattern made us rethink how we use these tools. While AI coding assistants slowed down experienced developers, junior developers maintained their momentum. Data from multiple organizations confirms what many of us are experiencing firsthand. While junior developers see productivity gains of 30-40% with AI assistants, senior developers often experience productivity decreases of 10-15%. This counterintuitive finding reveals something profound about expertise, trust, and the future of software development. The Trust Tax: When Verification Costs More Than Creation The main problem is not a technical one; it is psychological. Senior developers spend years building mental models of how systems work, gathering hard-earned knowledge about edge cases, performance implications, and architecture tradeoffs. When AI Copilot suggests code, they cannot simply accept it. Their expertise forces them to verify every line. A junior developer looks at AI-generated code and asks: "Does this work?" A senior developer looks at the same code and asks: "Does this work?""Is it optimal?""Are there edge cases?""What are the security implications?""How does this scale?" "What's the memory footprint?" "Are we introducing technical debt?" This verification tax is substantial. In a recent study of 250 developers across five organizations, senior developers spent an average of 4.3 minutes reviewing each AI suggestion compared to 1.2 minutes for junior developers. When you're reviewing dozens of suggestions per day, this adds hours to your workload. The Pattern Recognition Problem Here's where it gets interesting. Senior developers have honed their pattern recognition through years of debugging production incidents, seeing firsthand the consequences of code that looks harmless. When Copilot suggests using a simple map operation on a large dataset, a junior developer sees elegant functional code. A senior developer sees a potential memory spike during peak traffic because they've been paged at 3 AM for exactly this kind of issue before. The AI doesn't know about the time your service crashed because someone mapped over a million-item array. You do. Real-World Example: At a company I consulted with, a junior developer accepted an AI-generated authentication function that looked clean and passed all tests. A senior developer caught that it was vulnerable to timing attacks—a subtle security flaw that wouldn't show up in standard testing but could leak information about valid usernames. The junior developer didn't know to look for this. The senior developer couldn't not see it. The False Positive Burden I've watched senior developers struggle with a higher rate of false positives because of their heightened skepticism. They actively look for potential problems and sometimes find issues that aren't actually problems in the specific context. This often leads to unnecessary refactoring and over-engineering of AI-generated code. Senior developers sometimes reject AI suggestions because the code feels wrong based on patterns that don't match the current use case. They trust their gut-level instincts, which sometimes help but can slow down work when applied indiscriminately. Context Windows and Architectural Thinking The second major factor is how senior developers think about code. They don't focus solely on the immediate problem; instead, they consider broader system design, maintainability, and future extensibility. AI coding assistants excel at local optimization. They're remarkably good at solving the specific problem right in front of them, but they struggle to understand the architectural implications of their suggestions. A senior developer looks at AI-generated code and asks questions the AI cannot answer: "How does this fit with our service mesh architecture?" "Does it follow our team's coding standards?" "Will the next developer who touches this code understand the intent?" "Does it create coupling that will make future changes harder?" These are not just academic concerns. In complex systems, local optimizations can create global problems. A function that's perfect in isolation might introduce subtle dependencies that could cause issues months later. The Automation Irony There's an irony at play here. The tasks where AI assistants provide the most help are precisely the tasks that senior developers have already automated away in their minds. After years of experience, routine coding becomes muscle memory — you're barely thinking about it. When a junior developer writes a CRUD endpoint, it's a careful step-by-step process that requires focus. When a senior developer writes the same endpoint, it's largely a matter of typing speed. AI assistance makes junior developers work faster, but it doesn't significantly impact senior developers, since they were already working at or near optimal speed for routine tasks. Where AI could help senior developers — the genuinely novel problems, the complex architectural decisions, the subtle bug fixes — these are exactly the areas where current AI tools are weakest. As a result, senior devs get slowed down on routine tasks (because of verification overhead) without corresponding gains on complex tasks. What This Tells Us About the Future This productivity paradox reveals several important truths about AI-assisted development and the nature of software expertise: Expertise Is More Than Speed We've measured productivity in various ways, but the lines-of-code-per-day metric has always been flawed. AI assistants make that flaw more obvious. A senior developer who spends an hour thinking about architecture before writing twenty lines of code is more valuable than a developer who writes two hundred lines of AI-generated code that creates technical debt. Senior developers bring value not through their typing speed or raw problem-solving velocity but through their judgment, ability to see ripple effects, and wisdom about what not to build. Trust Calibration Is the New Skill The developers who will thrive with AI assistants will be neither those who accept every suggestion without question nor those who reject them all. The successful developers will build mental models that help them determine when to trust AI assistants and when to dig deeper. This requires a new kind of expertise: understanding the AI's strengths and weaknesses well enough to allocate verification effort efficiently. Some senior developers are learning to treat AI suggestions with the same calibrated skepticism they apply to code from junior team members — enough scrutiny to catch problems, but not so much that it becomes counterproductive. Emerging Best Practice The most effective senior developers I've seen aren't trying to verify everything AI-generated code does. Instead, they've developed heuristics for what to check carefully — security, performance, architectural fit — versus what to accept with minimal review — straightforward implementations of well-understood patterns). They're essentially building a "threat model" for AI code. The Context Problem Won't Solve Itself AI coding assistants operate with limited context. They can see the file you're working on and a few related files, but they don't truly understand your architecture, your team's conventions, your performance requirements, or your technical debt situation. Improving this will require more than just larger context windows. It requires AI systems capable of building and maintaining genuine architectural understanding — something that's still largely beyond current capabilities. Until then, the gap between "code that works" and "code that fits" will remain wide. Practical Implications for Teams Rethinking Code Review Teams need to evolve their code review practices for the AI era. The question is not just whether the code is correct, but also whether it was AI-generated and whether the developer properly verified it. I've seen some teams require developers to flag AI-generated code in pull requests—not to ban it, but to ensure appropriate scrutiny. In my view, AI assistants fundamentally change the economics of code creation. When they make code generation trivially easy, the bottleneck shifts to verification and integration. This makes code review more critical, and the skills required for effective review become more valuable. Training and Skill Development Junior developers who learn primarily with AI assistance face a real risk: they may never develop the deep understanding that comes from writing code the hard way. It's like a cook who learns with a chef who does all the prep work—they can still make meals, but they never develop essential knife skills. Organizations should consider having junior developers work without AI assistants for their first six months to a year, just as we don't let new drivers use autopilot before they've learned to drive manually. The goal isn't to make them suffer, but to ensure they build the foundational understanding that makes AI assistance valuable rather than just fast. The Meta-Lesson: Tools Shape Thinking The senior developer productivity paradox reveals the deep connection between tools and thought. Senior developers are slower with AI, not despite their expertise, but because of it. The verification overhead they experience stems from the tool not aligning with their mental model of how development should work. Junior developers are still building their mental models, so they adapt more easily to AI-assisted workflows. Senior developers, however, rely on approaches honed through years of experience, and AI assistants often work against these approaches rather than complementing them. This isn't a criticism of either group. It's an observation about how expertise works. Actual expertise isn't just knowledge—it's intuition, pattern recognition, and deeply internalized workflows. Any tool that disrupts those workflows will face resistance, and that resistance often reflects genuine wisdom rather than mere stubbornness. Looking Forward The productivity paradox we're seeing today isn't permanent. As AI coding assistants improve, they'll develop better contextual awareness and respect for coding conventions. They'll provide the kind of high-level assistance that senior developers actually need. However, we shouldn't expect the gap to close completely. The tension between AI's suggestions and human judgment will likely always exist, and that tension is healthy. The goal is not to eliminate verification but to make it more efficient. Meanwhile, we should resist the temptation to measure developer productivity solely by output velocity. The fact that senior developers are slower with AI assistants doesn't mean they're less valuable. It often means they're doing exactly what we need them to do: applying judgment, considering implications, and protecting the codebase from well-intentioned but ultimately problematic suggestions. Key Takeaway: The senior developer productivity paradox isn't a bug in how experienced developers use AI—it's a feature of expertise itself. The verification overhead they experience is the cost of judgment, and that judgment is precisely what makes them senior developers in the first place. Conclusion: Redefining Productivity We're in the middle of a fundamental shift in how software is built. AI coding assistants are potent tools, but like all transformative technologies, they bring complexity. The fact that they make senior developers slower in the short term tells us something important — we're not measuring what matters. The value of software development has never been in raw coding speed. It's in thoughtfulness, judgment, design insight, and the ability to anticipate problems. If AI assistants help junior developers become more productive while making senior developers more deliberate, that may not be a productivity loss at all. It might represent a shift in where the bottleneck lies — from creation to curation, from typing to thinking. In the long run, this shift could be exactly what the industry needs. We've built too much software with too little thought. If AI assistants force us to be more intentional about what we build, even if they slow the building process slightly, we may end up with better systems. The question isn't whether senior developers should use AI assistants — that decision has already been made by the market. The question is how we adapt our workflows, metrics, and expectations to a world in which the relationship between experience and productivity has fundamentally changed. Those who figure this out first will have a significant advantage in the AI-augmented development landscape we're entering.

By Dinesh Elumalai

Mastering Fluent Bit: Top 3 Telemetry Pipeline Processors for Developers (Part 9)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, using Fluent Bit processors for developers. In case you missed the previous article, check out the top tips on using telemetry pipeline parsers for developers, where you get tips on cleaning up your telemetry data for better developer experiences. This article will be a hands-on tour of the things that help you as a developer testing out your Fluent Bit pipelines. We'll take a look at the top three processors you'll want to know about when building your telemetry pipeline configurations in Fluent Bit. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Shell # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.0.8 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's look at the top three processors that will help you with your local development testing of Fluent Bit pipelines. Processing in a Telemetry Pipeline See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the processors that can be of great help in managing our telemetry data during testing in our inner developer loop. Processors in Fluent Bit are powerful components that sit between the input and output phases of your telemetry pipeline. They allow you to manipulate, transform, and enrich your telemetry data before it reaches its destination. Unlike filters, which operate on records, processors work on the raw data stream level, giving you fine-grained control over how your data flows through the pipeline. The processor phase happens after data is ingested but before it's formatted for output. This makes processors ideal for operations that need to happen at scale across your entire data stream, such as content modification, metrics extraction, and data aggregation. Keeping all of this in mind, let's look at the most interesting processors that developers will want to know more about. 1. Content Modifier Processor One of the most common use cases for telemetry pipelines that developers will encounter is the need to add, modify, or remove fields from their telemetry data. The Content Modifier processor gives you the ability to manipulate the structure and content of your events as they flow through the pipeline. To provide an example, we start with a simple Fluent Bit configuration file fluent-bit.yaml containing a configuration using the dummy plugin to generate events that we'll then modify: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"environment":"dev","message":"Application started"}' processors: logs: - name: content_modifier action: insert key: pipeline_version value: "1.0.0" - name: content_modifier action: insert key: processed_timestamp value: "${HOSTNAME}" - name: content_modifier action: rename renamed_key: env key: environment outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Our configuration uses the content_modifier processor three times to demonstrate different actions. First, we insert a new field called pipeline_version with a static value. Second, we insert a processed_timestamp field that references an environment variable. Third, we rename the environment field to env for consistency. Let's run this to confirm our working test environment: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-10-26 20:45:12.123456","env":"dev","message":"Application started","pipeline_version":"1.0.0","processed_timestamp":"localhost"} {"date":"2025-10-26 20:45:13.234567","env":"dev","message":"Application started","pipeline_version":"1.0.0","processed_timestamp":"localhost"} ... Note how each event now contains the additional fields we configured, and the original environment field has been renamed to env. This processor is invaluable for standardizing your telemetry data before it reaches your backend systems. 2. Metrics Selector Processor Another critical use case for developers working with telemetry data is the ability to extract and select specific metrics from your event streams. The Metrics Selector processor allows you to filter and route metrics based on their labels and values, giving you precise control over which metrics flow to which destinations. To demonstrate this, we'll create a configuration that generates different types of metrics and uses the metrics selector to route them appropriately: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: metrics.cpu dummy: '{"metric":"cpu_usage","value":75.5,"host":"server01","env":"production"}' - name: dummy tag: metrics.memory dummy: '{"metric":"memory_usage","value":82.3,"host":"server01","env":"production"}' - name: dummy tag: metrics.disk dummy: '{"metric":"disk_usage","value":45.2,"host":"server02","env":"staging"}' processors: logs: - name: metrics_selector metric_name: cpu_usage action: include label: env operation_type: prefix_match match: prod outputs: - name: stdout match: 'metrics.cpu' format: json_lines json_date_format: java_sql_timestamp - name: stdout match: 'metrics.*' format: json_lines json_date_format: java_sql_timestamp Our configuration generates three different metric types and uses the metrics_selector processor to filter CPU metrics that match production environments. This allows you to create sophisticated routing rules based on your metric characteristics. Let's run this configuration: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-10-26 21:10:33.456789","metric":"cpu_usage","value":75.5,"host":"server01","env":"production"} {"date":"2025-10-26 21:10:33.567890","metric":"memory_usage","value":82.3,"host":"server01","env":"production"} {"date":"2025-10-26 21:10:33.678901","metric":"disk_usage","value":45.2,"host":"server02","env":"staging"} ... The metrics selector processor helps you focus on the metrics that matter most during development and testing, reducing noise and improving the signal-to-noise ratio in your telemetry data. 3. OpenTelemetry Envelope Processor The third essential processor that developers need to understand is the OpenTelemetry Envelope processor. This processor transforms your Fluent Bit telemetry data into the OpenTelemetry protocol format, enabling seamless integration with the broader OpenTelemetry ecosystem. As organizations increasingly adopt OpenTelemetry as their standard for observability data, this processor becomes critical for ensuring your Fluent Bit pipelines can communicate effectively with OpenTelemetry collectors and backends. The OpenTelemetry Envelope processor wraps your telemetry data in the standard OpenTelemetry format, preserving all the semantic conventions and structures that make OpenTelemetry powerful. This includes proper handling of resource attributes, instrumentation scope, and the telemetry signal types that are core to OpenTelemetry. For comprehensive coverage of integrating Fluent Bit with OpenTelemetry, I highly recommend exploring these detailed articles: Telemetry Pipelines: Integrating Fluent Bit with OpenTelemetry, Part 1 – This article covers the fundamentals of integrating Fluent Bit with OpenTelemetry, including configuration patterns and best practices for getting started.Integrating Fluent Bit with OpenTelemetry, Part 2 – This follow-up article dives deeper into advanced integration scenarios, troubleshooting tips, and real-world use cases for production deployments. To demonstrate how the OpenTelemetry Envelope processor works, let's create a configuration that wraps application logs in OpenTelemetry format: YAML service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: dummy tag: app.logs dummy: '{"level":"info","service":"user-api","message":"User login successful","user_id":"12345"}' - name: dummy tag: app.logs dummy: '{"level":"error","service":"payment-api","message":"Payment processing failed","transaction_id":"tx-9876"}' processors: logs: - name: opentelemetry_envelope resource: service_name: my-application service_version: 1.2.3 deployment_environment: production instrumentation_scope: name: fluent-bit version: 4.2.0 outputs: - name: stdout match: '*' format: json_lines json_date_format: java_sql_timestamp Our configuration uses the opentelemetry_envelope processor to wrap each log entry with OpenTelemetry metadata. The resource section adds attributes that describe the source of the telemetry data, such as the service name and deployment environment. The instrumentation_scope section identifies the tool that collected the data, which is essential for proper attribution in OpenTelemetry systems. Let's run this configuration to see the OpenTelemetry envelope in action: Shell # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.1.0 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile $ podman run --rm fb ... {"date":"2025-10-26 22:15:30.123456","resource":{"service_name":"my-application","service_version":"1.2.3","deployment_environment":"production"},"instrumentation_scope":{"name":"fluent-bit","version":"4.1.0"},"level":"info","service":"user-api","message":"User login successful","user_id":"12345"}{"date":"2025-10-26 22:15:31.234567","resource":{"service_name":"my-application","service_version":"1.2.3","deployment_environment":"production"},"instrumentation_scope":{"name":"fluent-bit","version":"4.1.0"},"level":"error","service":"payment-api","message":"Payment processing failed","transaction_id":"tx-9876"} ... Notice how each log entry now includes the OpenTelemetry resource attributes and instrumentation scope information. This standardized format ensures that when your telemetry data reaches an OpenTelemetry collector or backend, it will be properly categorized and can be correlated with other telemetry signals like traces and metrics from your distributed system. This covers the top three processors for developers getting started with Fluent Bit while trying to leverage processors to transform and enrich their telemetry data quickly and speed up their inner development loop. More in the Series In this article, you learned about three powerful Fluent Bit processors that improve the inner developer loop experience. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, exploring some of the more interesting Fluent Bit filters for developers.

By Eric D. Schabell

CORE

Agile Is Dead, Long Live Agility

TL; DR: Why the Brand Failed While the Ideas Won Your LinkedIn feed is full of it: Agile is dead. They’re right. And, at the same time, they’re entirely wrong. The word is dead. The brand is almost toxic in many circles; check the usual subreddits. But the principles? They’re spreading faster than ever. They just dropped the name that became synonymous with consultants, certifications, transformation failures, and the enforcement of rituals. You all know organizations that loudly rejected “Agile” and now quietly practice its core ideas more effectively than any companies running certified transformation programs. The brand failed. The ideas won. So why are we still fighting about the label? How Did We Get Here? Let’s trace Agile’s trajectory: From 2001 to roughly 2010, Agile was a practitioner movement. Seventeen people wrote a one-page manifesto with four values and twelve principles. The ideas spread through communities of practice, conference hallways, and teams that tried things and shared what worked. The word meant something specific: adaptive, collaborative problem-solving over rigid planning and process compliance. Then came corporate capture. From 2010 to 2018, enterprises discovered Agile and sought to adopt it at scale. Scaling frameworks emerged. Consultancies noticed new markets for their change management practices and built transformation practices. The word shifted: no longer a set of principles but a product to be purchased, a transformation to be managed, a maturity level to be assessed. The final phase completed the inversion. The major credentialing bodies have now issued millions of certifications. “Agile coaches” who’ve never created software in complex environments advise teams on how to ship software, clinging to their tribe’s gospel. Transformation programs run for years without arriving anywhere. The Manifesto warned against this: “Individuals and interactions over processes and tools.” The industry inverted it. Processes and tools became the product. (Admittedly, they are also easier to budget, procure, KPI, and track.) The word “Agile” now triggers eye-rolls from practitioners who actually deliver. It signals incoming consultants, mandatory training, and new rituals that accomplish practically nothing that could not have been done otherwise. The term didn’t become unsalvageable because the ideas failed. It became unsalvageable because the implementation industry hollowed it out. The Victory Nobody Talks About However, the “Agile is dead” crowd stops too early. Yes, the brand is probably toxic by now. But look at what’s actually happening. Look at startups that never adopted the terminology. They run rapid experiments, ship incrementally, learn from customers, and adapt continuously. Nobody calls it Agile. They call it “how we work.” Look at enterprises that “moved past Agile” into product operating models. What do these models emphasize? Autonomous teams. Outcome orientation. Continuous discovery. Customer feedback loops. Iterative delivery. Read that list again. Those are the Manifesto’s principles with a fresh coat of paint and, critically, without the baggage of failed transformation programs. You can watch this happen in real time. A client told me this year, “We don’t do Agile anymore. We do product discovery and continuous delivery.” I asked what that looked like. He described Scrum without ever using the word. That organization is more agile than most “Agile transformations” I’ve seen. And now AI accelerates this further. Pattern analysis surfaces customer insights faster. Vibe coding produces working prototypes in hours rather than weeks, dramatically compressing learning loops. Teams can test assumptions at speeds that would have seemed impossible five years ago. None of this requires the word “Agile.” All of it embodies what the Agile Manifesto was actually about. The principles won by shedding their label. The Losing Battle Some practitioners still fight to rehabilitate the term. They write articles explaining what “real Agile” means. They distinguish between “doing Agile” and “being Agile.” They insist that failed transformations weren’t really Agile at all, which reminds me of the old joke that “Communism did not fail; it has never been tried properly.” At some point, if every implementation fails, the distinction between theory and practice stops mattering. This discussion is a losing battle. Worse, it’s the wrong battle. When you fight for terminology, you fight for something that doesn’t matter. The goal was never the adoption of a word. The goal was to solve customer problems through adaptive, collaborative work. Suppose that is happening without the label? I would call it “mission accomplished.” If it’s not happening with the label, mission failed, regardless of how many certifications the organization purchased. The energy spent defending “Agile” as a term could be spent actually helping teams deliver value. The debates about what counts as “true Agile” could be debates about what actually works in this specific context for this particular problem. Language evolves. Words accumulate meaning through use, and sometimes that meaning becomes toxic. “Agile” joined “synergy,” “empowerment,” and “best practices” in the graveyard of terms that meant something important until they didn’t. Fighting to resurrect a word while the ideas thrive elsewhere is nostalgia masquerading as principle. What Agile Is Dead Means for You Stop defending “Agile” as a brand. Start demonstrating value through results. This suggestion isn’t about abandoning the community you serve. Agile practitioners remain a real audience with real problems worth solving. The shift is about where you direct your energy. Defending the brand is a losing game. Helping practitioners deliver outcomes isn’t. When leadership asks whether your team is “doing Scrum correctly,” redirect: “We’re delivering solutions customers use. Here’s what we learned this Sprint and what we’re changing based on that learning.” When transformation programs demand compliance metrics, offer outcome metrics instead. And accept this: the next generation of practitioners may never use the word “Agile.” They’ll talk about product operating models, continuous discovery, outcome-driven teams, and AI-assisted development. They’ll practice everything the Manifesto advocated without ever reading it. That’s fine. The ideas won. The word was only ever a vehicle. The Bottom Line We were never paid to practice Agile. Read that again. No one paid us to practice Scrum, Kanban, SAFe, or any other framework. We were paid to solve our customers’ problems within given constraints while contributing to our organization’s sustainability. If the label now obstructs that goal, discard the label. Keep the thinking. Conclusion: Agile Is Dead, or the Question You’re Avoiding If “Agile” disappeared from your vocabulary tomorrow, would your actual work change? If not, you’ve already moved on. You’re already practicing the principles without needing the brand. You are already focusing on what matters. So act like it: “Le roi est mort, vive le roi!” What’s your take? Is there still something worth saving, or is it time to let the brand go? I’m genuinely curious.

By Stefan Wolpers

CORE

An Analysis of Modern Distributed SQL

Editor’s Note: The following is an article written for and published in DZone’s 2025 Trend Report, Database Systems: Fusing Transactional Speed and Analytical Insight in Modern Data Ecosystems. Distributed SQL merges traditional RDBMS reliability with cloud-native elasticity. The approach combines ACID semantics, SQL interface, and relational integrity with multi-region resilience, disaggregated compute-storage, and adaptive sharding. This article examines distributed SQL from a practitioner’s perspective. It evaluates consensus algorithms, partitioning strategies, serverless implementations, vector integration, and cross-region routing techniques. The State of Consensus Consensus algorithms form the foundation of distributed SQL reliability guarantees: They ensure a majority of replicas agree on operation order before acknowledging writes. Without consensus, distributed databases cannot commit transactions across nodes, handle leader failures, or maintain consistent data views during network partitions. Consensus Algorithms Paxos provides theoretical correctness guarantees, but it is difficult to understand and implement correctly. Multi-Paxos handles sequences of decisions and addresses some practical limitations but is still opaque to most engineers. Raft solves the same problem, with understandability as its explicit design goal. It decomposes consensus into three sub-problems: leader election (selecting one node to coordinate writes), log replication (distributing operations to replicas), and safety (preventing replica divergence). The majority of modern distributed SQL systems adopt Raft, with only legacy architectures retaining Paxos variants. Raft’s leader-based model maps naturally to SQL transactional semantics. A write becomes durable once a majority of replicas acknowledge it, delivering strong consistency without complex coordination protocols. Operational Complexity vs. Performance Trade-Offs Consensus creates operational overhead mainly across three areas: Leader elections – When a leader node becomes unreachable, the cluster elects a replacement. This process spans milliseconds to seconds depending on heartbeat and timeout settings. Writes stall during election windows because no leader exists to coordinate them. This is mitigated by tuning heartbeat intervals and distributing replicas across independent failure domains (racks, zones, regions).Write amplification – Every write requires acknowledgment from a majority of replicas before commit. A typical three-replica setup generates 2 to 3x the network traffic and disk I/O of a single-node database. Cross-region deployments multiply this overhead when replicas span continents.Tail latency under contention – Multiple transactions competing for the same key range force the leader to serialize commits for consistency. This bottlenecks write throughput at the leader’s capacity. Adding replicas does not help in this situation. Systems offload reads to follower replicas, but write-heavy workloads with hotspots degrade performance significantly. Where Consensus Fits and Where It Breaks Managed consensus services abstract implementation complexity behind cloud APIs and deliver strong resilience with automated failovers. However, this also brings along issues tied to provider architectural decisions: Auto-scaling operations may spike latency unpredictably, misconfigured network policies could render entire regions unwritable, and multi-partition transactions demand additional coordination overhead. For most workloads, network latency, query planning, and inefficient indexing are far less concerning than consensus overhead. The consensus “cost” is often overestimated without accounting for read scalability and fault tolerance gains. Consensus bottlenecks emerge in specific scenarios such as extreme write throughput demands (tens of thousands of writes per second per range) and latency-sensitive workloads where milliseconds matter. The consensus layer establishes a reliability floor but does not dictate the performance ceiling. Partitioning and Sharding in the Real World Consensus determines how distributed SQL systems replicate data safely, and partitioning determines how they distribute it efficiently. Poor partitioning strategies transform horizontal scale into a liability. Partitioning Strategies and Their Trade-Offs Serious workloads demand an understanding of partitioning trade-offs. The table below summarizes the core characteristics of each partitioning strategy: strategyPrimary strengthprimary weaknessBest-fit workloadoperational complexity Hash-based Uniform distribution eliminates write hotspots Range scans hit all partitions Write-heavy with point lookups, key-value access patterns Low: fixed partition count, predictable behavior Range-based Preserves order for efficient range scans Creates hotspots with skewed data (timestamps, high-value keys) Time series, analytical queries, sequential access Medium: requires ongoing monitoring and boundary tuning Hybrid (range within hash, geo-partitioning) Combines benefits: locality and distribution Multiple failure modes, complex mid-migration states Multi-tenant SaaS, data residency requirements High: demands deep access pattern understanding Hash-based partitioning uses hashing functions to distribute rows uniformly across partitions without manual tuning. This trade-off is evident in query patterns. Analytical queries performing range scans (WHERE created_at > '2024-01-01') turn into scatter-gather operations and end up hitting every partition. This makes cross-tenant aggregations and time series analysis inefficient. Range-based partitioning performs optimally when data distribution aligns naturally with query patterns. This could be time series data partitioned by month or multi-tenant systems partitioned by customer ID. A single high-value customer or recent timestamp range may end up creating hot partitions. Hybrid schemes succeed when teams thoroughly understand access patterns and possess engineering resources to maintain partition metadata, monitor split/merge operations, and handle failure modes that simpler strategies avoid. Global Tables, Schema Changes, and Rebalancing Most distributed SQL systems support global or reference tables: small, read-heavy tables replicated fully to every node to avoid cross-partition joins. Since every update propagates cluster-wide, it could transform a 10 MB table into a 10 GB problem when replicated across 1,000 nodes. Similar issues are associated with schema evolution. Adding columns, creating indexes, or altering constraints becomes a distributed transaction coordinating across all partitions — all this while serving production traffic. This takes hours for large tables, during which queries reconcile multiple schema versions. Another common concern is rebalancing overhead, a by-product of automatic scaling and sharding. Adding nodes triggers data redistribution, which is competing with production traffic for network, disk, and CPU. When partitions hit size thresholds after traffic spikes, they split, move to new nodes, and trigger further splits as the load redistributes. This can hurt performance as the system spends more time rebalancing than serving queries. Academic Designs vs. Production Stability Distributed systems research explores many partitioning schemes such as adaptive partitioning, automatically adjusting boundaries based on access patterns, and learned partitioning, using ML models to predict data distribution. But these schemes often face practical challenges when implemented in production. Adaptive schemes create unpredictable behavior when workloads shift, complicating capacity planning. ML-driven approaches complicate debugging since operators interpret model outputs rather than review configuration files. Production systems favor predictability. It’s easier to reason about hash partitioning with fixed counts, range partitioning with manually reviewed boundaries, and hybrid schemes with explicit geo-pinning. Building debuggable systems that work for real workloads requires upfront schema design and continuous monitoring, as opposed to relying on theoretical claims. Serverless and Autoscaling Claims Serverless distributed SQL separates stateless compute (query execution, transaction coordination) from stateful storage (consensus, persistence), allowing compute to scale independently or down to zero without moving data. This separation introduces a performance trade-off where queries cross the compute-storage boundary over the network rather than reading from local storage. Scaling, Storage Separation, and Cold-Start Realities Serverless databases balance fast scaling against cost savings. Systems maintaining warm compute pools scale quickly by adding pre-provisioned nodes, while true cold-start provisioning faces significant delays that create unacceptable latency for user-facing applications. Industry implementations converge on warm-start optimizations rather than true zero-capacity scaling. Most systems keep compute nodes idle but provisioned to reduce start-up latency. Production teams running latency-sensitive workloads configure minimum compute thresholds to maintain always-warm capacity, undermining the cost savings of scaling to zero. Serverless delivers value for bursty workloads like nightly ETL jobs or end-of-month reporting, where teams pay for compute during active periods rather than running a 24/7 cluster. Always-on workloads with occasional spikes often cost more than right-sized provisioned clusters due to serverless pricing and warm pool overhead. Serverless provides fast scaling for anticipated load but struggles with unanticipated spikes. On the other hand, over-provisioning warm pools reintroduces the fixed costs that serverless was designed to eliminate. What Serverless Actually Delivers Serverless distributed SQL delivers value in specific scenarios but faces practical constraints. Systems separating compute from storage scale query layers independently without eliminating operational complexity. The term “serverless” is associated with consumption-based pricing (pay for actual usage), managed operations (abstracted infrastructure), and elastic scaling (dynamic resource adjustment), but implementations vary significantly in resource allocation, scaling speed, and performance isolation. Scaling operates within capacity boundaries rather than infinitely. Systems maintain resource pools to reduce startup latency. Workloads with predictable patterns and acceptable latency variance benefit most from serverless architectures. Those requiring consistent sub-millisecond performance or sustained high throughput find provisioned clusters more suitable. When evaluating serverless options, examine scaling speed under load, latency penalties during scaling events, throttling behavior under resource pressure, and whether operational simplifications justify the performance trade-offs. The Vector Era: Indexing for Embeddings Generative AI has pushed distributed SQL systems to support high-dimensional vector embeddings alongside traditional relational data. SQL engines optimize for exact matches and structured queries, while vector search relies on approximate nearest neighbor (ANN) algorithms that fit unnaturally into relational query planning. This creates performance and integration challenges that teams evaluate against unified data platform convenience. Distributed SQL systems integrate vector search through extensions like pgvector or native implementations. Common indexing algorithms include Hierarchical Navigable Small World (HNSW) for graph-based approximate search, Inverted File with Product Quantization (IVF-PQ) for clustering-based approaches, and flat indexes for exact search. Distributed query execution scatters vector similarity searches across shards and merges top-k results at the coordinator. Performance Bottlenecks Vector search in distributed SQL encounters bottlenecks that stem from fundamental mismatches between ANN algorithms and traditional SQL query execution models: Index construction overhead – Building vector indexes is computationally intensive and competes with production traffic. Distributed environments compound this by fragmenting indexes across partitions, requiring result merging that degrades recall.Query planning limitations – SQL optimizers lack statistics to efficiently plan queries that combine vector similarity with traditional predicates. Systems struggle to determine optimal execution order, often defaulting to strategies that perform poorly for certain access patterns.Cross-partition execution costs – Vector queries require scatter-gather operations across all partitions, with distance recalculation at the coordinator. This doubles computational work and scales latency with partition count. Inside or Beside: The Architectural Debate Integrated vector support succeeds when consistency and operational simplicity matter more than raw performance, making distributed SQL viable for moderate-scale workloads without adding another system. The separation becomes necessary when scale demands specialized optimizations, similar to how teams use dedicated search engines for full-text queries. Most production deployments adopt a hybrid approach where SQL remains the source of truth while vector databases handle high-throughput similarity searches, trading consistency and operational overhead for performance where it matters most. Cross-Region Latency and Smart Routing Multi-region deployments expose fundamental limitations imposed by network latency. Cross-region round-trips add measurable overhead that consensus algorithms and caching strategies cannot eliminate. Mature systems provide explicit controls for balancing consistency, locality, and latency per query, while simpler implementations rely on fixed defaults that work for common cases but lack the flexibility for edge scenarios. Latency Mitigation Techniques Three techniques dominate cross-region optimization, each addressing latency through different trade-offs: Follower reads route queries to local replicas instead of distant leaders, reducing latency at the cost of serving slightly stale data. This performs well for read-heavy workloads like dashboards and analytics, but it requires careful handling for read-modify-write patterns where stale reads cause data inconsistencies.Regional replicas (geo-partitioning) pin data to specific regions based on locality, keeping queries within a single region fast, while cross-region transactions still face full latency costs. This approach aligns well with data residency requirements but does not eliminate cross-region coordination entirely.Adaptive routing attempts to optimize query placement dynamically based on current latency and load conditions, but most production systems rely on simpler static routing rules because they offer greater predictability and easier debugging. Common Production Practices and How To Strike a Balance Most deployments start single-region, add read replicas for disaster recovery, then enable active-active writes only when necessary. Active-active multi-region is fitting for applications that need global writes. The fundamental challenge is not eliminating cross-region latency but deciding where to accept it. Systems differ in how they distribute costs between write latency, read consistency, and operational complexity. Single-region leaders keep reads fast through follower replicas while penalizing cross-region writes, whereas multi-region write capabilities reduce regional write latency but add coordination overhead for consistency. Production-ready systems make these trade-offs transparent through documented performance characteristics, explicit configuration options for staleness tolerance, and detailed metrics that cover query routing and replication behavior. Observability is key to successful deployments. Teams test failover procedures regularly since disaster recovery configurations often fail during actual outages due to DNS propagation delays or misconfigured routing. Cross-region bandwidth costs drive design choices that pricing calculators obscure. A Rubric for Future-Proofing Distributed SQL Production-ready implementations require evaluation against multiple criteria beyond ACID compliance and horizontal scalability claims: Observability and operational maturity – Mature systems expose metrics for consensus health, partition-level query rates, and transaction coordination, and provide snapshot backups with automated failover capabilities.Elasticity and resource sharing – Scaling capabilities range from manual node addition with slow rebalancing to automatic scale-out. Multi-tenancy provides cost efficiency at the expense of workload isolation; single-tenancy provides isolation at a higher cost.Consistency guarantees – Strong consistency delivers traditional RDBMS correctness with a latency cost, particularly across regions. Many systems allow per-query configuration with options like follower reads and bounded staleness for workloads that are tolerating slight data lag.Vector support for AI workloads – Mature implementations provide native vector types and indexing algorithms like HNSW or IVF. Some systems explore ML-driven query planning to optimize execution paths for hybrid vector and relational queries.Community and ecosystem – Strong ecosystems include wide ranges of client libraries, monitoring tools, and operational documentation beyond vendor materials. Evaluate through third-party conference talks, active community channels, and contributor diversity, not just GitHub star counts. Guidance for Teams Modernizing From a Monolithic or Legacy RDBMS Single-node best practices like joins, secondary indexing, and schema flexibility become distributed anti-patterns where cross-partition joins are expensive, indexes multiply write amplification, and schema changes coordinate across hundreds of nodes. The lowest-risk path starts with distributed SQL as a read layer: Keep the monolith authoritative for writes, replicate to a distributed cluster, and route reads there for immediate scalability. Migrate writes incrementally, starting with partition-friendly workloads. Schema must be partition-aligned early by replacing auto-incrementing IDs with composite keys like (tenant_id, user_id) or uniformly distributed UUIDs, and ensuring that frequent queries include partition keys in WHERE clauses. Multi-table updates that are trivial in single-node databases become expensive distributed transactions spanning partitions. Identify early whether they can be denormalized, made asynchronous via event-driven architectures, or batched to reduce coordination overhead. Budget sufficient time for phased migration since moving from monolithic SQL to distributed SQL is more of an architectural transformation than just lift-and-shift. Conclusion Distributed SQL has matured from research concepts into production-ready systems. While partitioning schemes and consensus algorithms are established, standards for emerging capabilities still require careful evaluation. Prioritize systems with proven architectures (strong consistency, partition-aligned schemas, predictable behavior) before adopting features that introduce new complexity. Evaluate each against actual requirements rather than marketing claims. The convergence of distributed SQL with AI infrastructure will reshape query optimization and indexing strategies as vector embeddings and traditional relational data increasingly coexist. Additional resources: Designing Data-Intensive Applications by Martin KleppmannJepsen analysis reports – rigorous fault-injection testing exposing consistency gapsGoogle Site Reliability Engineering principlesANN Benchmarks – comparative analysis of HNSW, IVF, and indexing algorithmspgvector documentationOpenTelemetry documentation This is an excerpt from DZone’s 2025 Trend Report, Database Systems: Fusing Transactional Speed and Analytical Insight in Modern Data Ecosystems.Read the Free Report

By Abhishek Gupta

CORE

How to Prevent Quality Failures in Enterprise Big Data Systems

Problem Modern enterprises run on data pipelines, and the quality of these pipelines directly determines the quality of business decisions. Many organizations, a critical flaw persists: data quality checks still happen at the very end, after data has already passed through multiple systems, transformations, and dashboards. By the time issues finally surface, they have already spread across layers and become much harder to diagnose. This systemic lag directly undermines the reliability of mission-critical decisions. Solution Medallion architecture (Bronze, Silver, Gold), shown in the diagrams, has become a preferred approach for building reliable pipelines. The true power of this architecture is the opportunity it creates for predictable data quality checkpoints. By embedding specific quality checks early and consistently, data teams can catch issues immediately and explain changes to prevent bad data from moving downstream. I will explain how to execute these critical quality controls, walking through three essential quality checkpoints: Completeness checks in BronzeTransformation integrity checks in SilverEconciliation tests in Gold I'll also discuss where these checks naturally fit into pipeline execution using PySpark examples and real-world failure scenarios. The diagrams included highlight both pre-production and production flows, helping you understand where these checks naturally fit. Our ultimate goal is straightforward: Build observable pipelines that catch data problems early, long before they reach dashboards or impact decision-makers. The Silent Data Failure Most data quality failures go undetected until they reach the dashboard. A PySpark job aggregates daily trading positions. The job runs successfully — no errors. Three days later, risk officers notice portfolio positions are 8% understated. Investigation reveals a join condition silently excluded records due to a schema mismatch. Nobody caught it because the job didn't crash. The data was wrong, but invisible. This happens at scale because data problems compound downstream. One bad record in Bronze becomes 100 bad records in Gold after joins and aggregations. By the time it reaches dashboards, the damage is exponential. The solution isn't better dashboards. It's predictable validation checkpoints embedded in the pipeline architecture. This is the medallion architecture. Pre-Production Data Quality Flow Production Data Quality Flow Pre-Production vs. Production Strategy Three Checkpoints With PySpark DQ Check 1: Bronze Completeness What it validates: Row count comparison. Expected 50,000 records, got only 47,000. Python from pyspark.sql.functions import count, col, lag, current_date from pyspark.sql.window import Window # Read Bronze layer bronze_df = spark.read.table("bronze.orders") # Calculate row counts with comparison to previous day window_spec = Window.orderBy("ingestion_date") check_1 = (bronze_df .filter(col("ingestion_date") >= current_date() - 1) .groupBy("ingestion_date") .agg(count("*").alias("rows_loaded")) .withColumn("yesterday_count", lag("rows_loaded").over(window_spec)) .withColumn("pct_change", ((col("rows_loaded") - col("yesterday_count")) / col("yesterday_count") * 100)) .withColumn("status", when(col("pct_change") < -5, "FAIL: >5% drop") .when(col("rows_loaded") == 0, "FAIL: No data") .otherwise("PASS"))) check_1.show() # Alert if status = FAIL Real-world pattern: IoT sensor ingestion dropping to 25% volume. DQ Check 1 fired immediately. Root cause: upstream API rate limiting. Team adjusted connection pooling and circuit breaker patterns within 30 minutes. Without this check, downstream analytics would show incorrect sensor data for days. DQ Check 2: Silver Transformation Integrity What it validates: Data loss during transformation. If 5,000 records are removed, the audit table explains why. Python from pyspark.sql.functions import count, when, col, isnan, isnull from pyspark.sql import functions as F # Read Bronze and Silver bronze_df = spark.read.table("bronze.customers") silver_df = spark.read.table("silver.customers") bronze_count = bronze_df.count() silver_count = silver_df.count() # Log what was removed removed_df = (bronze_df .join(silver_df, "customer_id", "anti") # Records in Bronze but not in Silver .withColumn("removal_reason", when(~col("email").rlike(r"^[^\s@]+@[^\s@]+\.[^\s@]+$"), "Invalid email format") .when(col("age") < 0, "Negative age") .when(col("age") > 150, "Unrealistic age") .otherwise("Duplicate customer_id"))) audit_summary = (removed_df .groupBy("removal_reason") .agg(count("*").alias("removal_count")) .withColumn("pct_of_total", col("removal_count") / bronze_count * 100) .orderBy("removal_count", ascending=False)) # Write to audit table audit_summary.write.mode("append").option("mergeSchema", "true").saveAsTable("silver.audit_log") # Check if loss is reasonable (pre-prod >5%, prod >15%) loss_pct = (bronze_count - silver_count) / bronze_count * 100 status = "PASS" if loss_pct < 0.05 else "FAIL: Unexpected data loss" print(f"Bronze: {bronze_count}, Silver: {silver_count}, Loss: {loss_pct}%, Status: {status}") Real-world pattern: Email validation transformation silently dropped 12% of customer records. Audit table showed "Invalid email format: 1,000 rows removed." The investigation revealed that the regex pattern changed during the library dependency upgrade. Caught in 5 minutes via audit trail instead of 5 days of incorrect customer analytics. DQ Check 3: Gold Reconciliation What it validates: Aggregations in Gold reconcile to Silver. If Silver shows $1M but Gold shows $950K, something's broken. Python from pyspark.sql.functions import sum as spark_sum, count, col, abs, current_date from pyspark.sql import functions as F # Read Silver transactions silver_df = spark.read.table("silver.transactions").filter(col("transaction_date") >= current_date() - 7) # Silver totals silver_totals = (silver_df .groupBy("transaction_date", "region_id") .agg( spark_sum("transaction_amount").alias("silver_revenue"), count("*").alias("silver_records"), countDistinct("customer_id").alias("silver_customers"))) # Read Gold aggregations gold_df = spark.read.table("gold.daily_revenue").filter(col("report_date") >= current_date() - 7) gold_totals = (gold_df .select( col("report_date").alias("transaction_date"), "region_id", col("total_revenue").alias("gold_revenue"), col("transaction_count").alias("gold_records"), col("unique_customers").alias("gold_customers"))) # Reconcile reconciliation = (silver_totals .join(gold_totals, ["transaction_date", "region_id"], "full") .withColumn("revenue_variance", abs(col("silver_revenue") - col("gold_revenue"))) .withColumn("variance_pct", (col("revenue_variance") / col("silver_revenue") * 100)) .withColumn("status", when(col("gold_revenue").isNull(), "FAIL: Missing in Gold") .when(col("variance_pct") > 1, "FAIL: Revenue variance > 1%") .when(col("silver_records") != col("gold_records"), "FAIL: Record count mismatch") .otherwise("PASS"))) # Show failures only failures = reconciliation.filter(col("status") != "PASS") failures.show() # Write to monitoring table reconciliation.write.mode("append").option("mergeSchema", "true").saveAsTable("monitoring.dq_check_3") Real-world pattern: The credit risk dashboard showed a 2% variance between Silver transaction totals and Gold metrics. Reconciliation check flagged immediately. Root cause: LEFT JOIN excluding records with null counterparty IDs, silently underreporting portfolio exposure. Fix: FULL OUTER JOIN with explicit NULL handling. Prevented incorrect risk metrics from reaching stakeholders. Statistical Monitoring: Catching Silent Issues Python from pyspark.sql.functions import col, avg, stddev_pop, abs as spark_abs, lag, current_date from pyspark.sql.window import Window # Read Gold revenue data gold_df = spark.read.table("gold.daily_revenue").filter(col("report_date") >= current_date() - 90) # Define window for 30-day statistics window_30d = Window.orderBy("report_date").rangeBetween(-30*24*3600, 0) # Calculate statistical anomalies monitoring = (gold_df .withColumn("avg_30d", avg("total_revenue").over(window_30d)) .withColumn("stddev_30d", stddev_pop("total_revenue").over(window_30d)) .withColumn("std_devs_from_avg", spark_abs(col("total_revenue") - col("avg_30d")) / col("stddev_30d")) .withColumn("anomaly_flag", when(col("std_devs_from_avg") > 3, "ANOMALY: 3+ std devs") .when(col("total_revenue") < col("avg_30d") * 0.85, "WARNING: 15% below average") .otherwise("NORMAL"))) # Show anomalies anomalies = monitoring.filter(col("anomaly_flag") != "NORMAL") anomalies.show() # Write monitoring results monitoring.write.mode("append").option("mergeSchema", "true").saveAsTable("monitoring.statistical_checks") This catches silent failures: data that passes threshold checks but is statistically wrong. The join silently, excluding 8% of records, passes row count checks but fails statistical monitoring. Implementation Roadmap Week 1: Set up Bronze monitoring table, implement DQ Check 1 with PySpark.Week 2: Implement DQ Check 2 (transformation audit) with removal tracking.Week 3: Implement DQ Check 3 (reconciliation), comparing Silver to Gold.Week 4: Deploy to production with conservative thresholds (>10% variance). Quick Reference CheckpointPre-Production ThresholdProduction ThresholdBronze (Row count)>1% variance>10% varianceSilver (Data loss)>5% unexplained>15% unexplainedGold (Reconciliation)>0.5% variance>1% variance Conclusion Bad data problems often appear quietly and are usually found too late when dashboards show incorrect figures. When this happens, the error has already moved through different steps, making it tough to figure out what went wrong and causing problems for important business decisions. To fix this, the medallion architecture (which uses layers called Bronze, Silver, and Gold) is a good way to build reliable data systems. This design sets up important checkpoints to check the data quality. These checkpoints help teams catch problems quickly, explain changes clearly, and keep bad data from going any further. The main checks include completeness checks in the Bronze layer, checks to ensure data changes are applied correctly in the Silver layer, and reconciliation tests in the Gold layer. The simple goal is to build systems where data issues "fail fast," meaning they stop quickly and never reach the people making decisions. By making data quality a basic part of the system's structure, organisations make sure they are running on trustworthy data.

By Ram Ghadiyaram

CORE