Data Engineering Resources

DZone's Featured Data Engineering Resources

The Role of DQ Checks in Data Pipelines

By Ajay Krishnan Prabhakaran

Overview One of the key principles of writing a good data pipeline is ensuring accurate data is loaded into the target table. We have no control over the quality of the upstream data we read from, but we can have a few data quality (DQ) checks in our pipeline to ensure any bad data would be caught early on without letting it propagate downstream. DQ checks are critical in making sure the data that gets processed every day is reliable, and that downstream tables can query them safely. This will save a lot of time and resources, as we will be able to halt the data flow, giving us some time to do RCA and fix the issue rather than pass incorrect data. The biggest challenge with large data warehouses with multiple interdependent pipelines is that we would have no idea about the data issue if bad data gets introduced in one of the pipelines, and sometimes, it could take days, even before it's detected. Even though DQ check failures could cause some temporary delay in landing the data, it's much better than customers or users reporting data quality issues and then having to backfill all the impacted tables. Some of the common data quality issues that could occur are: Duplicate rows – a table at user grain (which means there can only be one row per user), having duplicates0 or null values – you expect certain critical columns not to have any null or 0 values, e.g., SSN, age, country columnsAbnormal row count – the overall row count of the table suddenly increases or drops compared to the historical valuesAbnormal metric value – a specific metric, say '# of daily user logins' suddenly spikes or drops compared to historical values Note: The operators we will be referencing below are part of the Dataswarm and Presto tech stack, which are a proprietary data pipeline building tool and an SQL query engine, respectively, developed at Facebook. Importance of Signal Tables It's a good practice to publish signal tables, which should serve as the source for downstream pipelines. These are essentially linked views that can be created on top of any table. Since they are views, they don’t take up any storage, so there is no reason not to build them. These should be created only after the DQ checks pass, and downstream pipelines should wait for these signal tables to be available rather than waiting for the source table directly, as these would have been vetted for any data anomalies. Building the Right DAG In the data lineage flow below, if bad data gets loaded into table1, then without DQ checks, they would get passed on to table2 and table3 as there is no way for pipelines2 and 3 to know of any data issues, as all they do is simply check if the table1 data has landed. But if DQ checks had been implemented, then it would fail the job/pipeline, and the table1_sigal wouldn’t have been created; thus, the downstream WaitForOperators would still be waiting, stopping the propagation of bad data. Types of DQ Failures to Enforce Hard failure. If these DQ checks fail, the job will fail and notify the oncall or table owner, so the signal table will not be created. These could potentially cause downstream pipelines to be delayed and could be an issue if they have tighter Service Level Agreements (SLAs). But for critical pipelines, this might be worth it, as sending bad data could have catastrophic ripple effects.Soft failure. If these fail, the oncall and table owner would be notified, but the job won't fail, so the signal table would still get published, and the data would get loaded and propagated downstream. For cases where the data quality loss is tolerable, this can be used. Setting Up DQ Checks We will go over some examples of how we can set up the different DQ checks and some simplified trigger logic behind each of the DQ operators. Some things to know beforehand: '<DATEID>' is a macro that will resolve to the date the Dataswarm pipeline is scheduled to run (e.g., when the job runs on Oct 1, 2020, it will resolve to ds = '2020-10-01').The output of presto_api will be an array of dictionaries, e.g., [{'ds': '2020-10-01', 'userID': 123, ‘country’: ‘US’}, {'ds': '2020-10-01', 'userID': 124, ‘country’: ‘CA’}, {...}], where each dictionary value represents the corresponding row values of the table being queried, and the key is the column name. Below would be the table representation of the data, Duplicate Rows We can simply aggregate by the key column (e.g., userID) specified by the user and check if there are any duplicate rows present by peforming a simple GROUP BY with a HAVING clause, and limiting to just 1 row. The presto_results variable should be empty ([]); if not, then there are duplicates present in the table. Python # output will be an array of dict representing reach row in the table # eg [{'ds': '2020-10-01', 'userID': 123}, {...}] presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT useriID FROM table WHERE ds = '<DATEID>' GROUP BY 1 HAVING SUM(1) > 1 LIMIT 1 ''' ) if len(presto_results) > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS 0 or Null Values We can check if any of the specified columns have any invalid values by leveraging count_if presto UDF. Here, the output, if there are no invalid values, should be [{'userid_null_count': 0}]. Python presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT COUNT_IF( userid IS NULL OR userid = 0 ) AS userid_null_count FROM table WHERE ds = '<DATEID>' ''' ) if presto_results[0]['userid_null_count'] > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS Abnormal Row Count To get a sense of what the normal/expected row count is for a table on a daily basis, we can do a simple 7-day average of the previous 7 days, and if today's value deviates too much from that, we can trigger the alert. The thresholds can be either: Static – a fixed upper and lower threshold that is always static. Every day, the operator checks if today’s row count is either over or below the thresholds.Dynamic – use a +x% and -x% threshold value (you can start with, say, 15%, and adjust as needed), and if today's value is greater than the 7d avg + x% or lower than the 7d avg - x%, then trigger the alert. Python dq_insert_operator = PrestoInsertOperator( input_data = {"in": "source_table"}, output_data = {"out": "dq_check_output_table"}, select = """ SELECT SUM(1) AS row_count FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace = 'namespace_name', sql = ''' SELECT ds, row_count FROM dq_check_output_table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) # we will loop through the dq_row_check_result object, which will have 8 values # where we will find the average between DATEID-7 and DATEID-1 and compare against DATEID x = .15 # threshold prev_7d_list = dq_row_check_result[0:7] prev_7d_sum = sum([prev_data['row_count'] for prev_data in prev_7d_list]) prev_7d_avg = prev_7d_sum/7 today_value = dq_row_check_result[-1]['row_count'] upper_threshold = prev_7d_avg * (1 + x) lower_threshold = prev_7d_avg * (1 - x) if today_value > upper_threshold or today_value < lower_threshold: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS So, every day, we calculate the sum of the total row count and load it into a dq_check_output_table (a temporary intermediate table that is specially used for storing DQ aggregated results). Then, we query the last 7 days and today's data from that table and store the values in an object, which we then loop through to calculate the upper and lower thresholds and check if today's value is violating either of them. Abnormal Metric Value If there are specific metrics that you want to track to see if there are any anomalies, you can set them up similarly to the above 'abnormal row count' check. Python dq_insert_operator = PrestoInsertOperator( input_data={"in": "source_table"}, output_data={"out": "dq_check_output_table"}, select=""" SELECT APPROX_DISTINCT(userid) AS distinct_user_count, SUM(cost) AS total_cost, COUNT_IF(has_login = True) AS total_logins FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace='namespace_name', sql=''' SELECT ds, distinct_user_count, total_cost, total_logins FROM table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) Here, we calculate the distinct_user_count, total_cost, and total_logins metric and load it into a dq_check_output_table table, which we will query to find the anomalies. Takeaways You can extend this to any kind of custom checks/alerts like month-over-month value changes, year-over-year changes, etc. You can also specify GROUP BY clauses, for example, track the metric value at the interface or country level over a period of time. You can set up a DQ check tracking dashboard, especially for important metrics, to see how they have been behaving over time. In the screenshot below, you can see that there have been DQ failures for two of the dates in the past, while for other days, it has been within the predefined range. This can also be used to get a sense of how stable the upstream data quality is. They can save a lot of time as developers would be able to catch issues early on and also figure out where in the lineage the issue is occurring.Sometimes, the alerts could be false positive (FP) (alerts generated not due to bad/incorrect data, but maybe due to seasonality/new product launch, there could be a genuine volume increase or decrease). We need to ensure such edge cases are handled correctly to avoid noisy alerts. There is nothing worse than oncall being bombarded with FP alerts, so we want to be mindful of the thresholds we set and tune them as needed periodically. More

Exploring Operator, OpenAI’s New AI Agent

By Kailash Pathak

CORE

Testing is a critical yet often time-consuming process. Ensuring that every feature, flow, and edge case works as intended can take up significant resources — both in terms of time and manpower. Manual testing, while thorough, is prone to human error and inefficiency, especially when dealing with repetitive tasks or complex workflows. OpenAI recently introduced an advanced AI agent that would enhance our approach to software testing. In this article, we’ll explore what Operator is, how it functions, and, most importantly, how it can drastically reduce manual testing time for developers and QA teams. We’ll also walk through some real-world examples to demonstrate its potential impact on testing various application flows and some potential limitations. What Is Operator? Operator is an AI-powered agent designed to interact with digital systems in a way that mimics human behavior. Unlike traditional automation tools that require explicit scripting and predefined rules, Operator leverages natural language processing (NLP) and machine learning to understand instructions and execute actions dynamically. It’s like having a virtual assistant that can navigate applications, perform tasks, and even troubleshoot issues — all without requiring extensive coding knowledge. The key features of Operator include: Natural language understanding. You can provide instructions in plain English, such as "Log into the app using test credentials" or "Verify if the payment gateway redirects correctly."Dynamic adaptability. Operator adapts to changes in UI elements, making it more resilient than static scripts.Task automation. From filling out forms to simulating multi-step user journeys, Operator handles repetitive tasks effortlessly.Error detection. The agent can identify anomalies during execution and flag them for review. These capabilities make Operator particularly well-suited for automating end-to-end testing scenarios, where flexibility and adaptability are crucial. Why Manual Testing Still Dominates and Its Challenges Despite advances in automated testing frameworks, many organizations still rely heavily on manual testing for several reasons: Complex workflows. Some applications have intricate user paths that are difficult to script.Frequent updates. Agile development cycles mean frequent updates, rendering pre-written scripts obsolete quickly.Edge cases. Identifying and testing rare but critical edge cases requires creativity and intuition, which scripted tests lack. However, manual testing comes with its own set of challenges: Time-consuming. Repetitive tasks eat up valuable hours that could be spent on innovation.Human error. Even experienced testers can miss subtle bugs due to fatigue or oversight.Scalability issues. As projects grow larger, scaling manual efforts becomes impractical. This is where Operator shines — it combines the precision of automation with the adaptability of human-like interaction, addressing these pain points effectively. Reducing Manual Testing Time With Operator Let’s dive into a practical example to illustrate how Operator can streamline testing processes and save time. Imagine you’re working on an e-commerce platform with the following core functionalities: User registration and loginProduct search and filteringAdding items to the cartCheckout process, including payment integration Each of these steps involves multiple sub-tasks, validations, and possible error conditions. Let’s see how Operator can help automate the testing of these flows. Scenario 1: Testing User Registration and Login Traditional Approach A manual tester would need to: Create new accounts repeatedly with different datasets (valid emails, invalid formats, duplicate entries)Test password strength requirementsAttempt logins with correct/incorrect credentialsCheck email verification links. This process could easily take 1–2 hours per round of testing, depending on the number of variations. With Operator: You simply instruct Operator in natural language: Prompt Create five new user accounts with valid details, one account with an invalid email format, and another with a weak password. Then, attempt to log in with each set of credentials and verify error messages. Operator will: Generate test data automaticallyExecute registration attempts across all specified scenariosLog in with each credential combinationValidate responses against expected outcomes What once took hours now takes mere minutes, freeing up your team to focus on higher-value activities. Scenario 2: Testing Product Search and Filtering Traditional Approach Testers manually search for products using various keywords, filters (price range, category), and sorting options. They must ensure results align with expectations and handle cases where no matches exist. With Operator Provide a simple command: Prompt Search for 'laptop' and apply filters: price between $100–$1000, brand='Apple', sort by relevance. Repeat with non-existent product names like 'unicorn laptop.' Operator will: Perform searches and apply filters systematicallyCompare actual results with expected outputsFlag discrepancies, such as incorrect filter applications or missing items Scenario 3: End-to-End Checkout Process Traditional Approach Manually adding items to the cart, entering shipping details, selecting payment methods, and verifying confirmation pages is tedious. Any change in the checkout flow necessitates retesting everything from scratch. With Operator Use a straightforward instruction: Prompt Add three random products to the cart, proceed to checkout, enter dummy shipping info, select PayPal as the payment method, and confirm the order. Operator will: Automate the entire checkout journeyHandle both successful and failure scenariosEnsure error messages appear appropriately and transactions reflect accurately Benefits Beyond Time Savings While reducing manual testing time is a significant advantage, Operator offers additional benefits that enhance the overall testing process: Improved accuracy. Operator eliminates human errors associated with repetitive tasks, leading to more reliable results.Enhanced collaboration. Since Operator uses natural language, non-technical stakeholders can easily participate in defining test scenarios.Cost efficiency. Automating routine tests reduces dependency on large QA teams, lowering operational costs.Focus on innovation. Freed from manual tasks, testers can dedicate more time to exploratory testing and creative problem-solving. Potential Limitations and Considerations While Operator holds immense promise, it’s essential to acknowledge certain limitations: Learning curve. Teams must learn to phrase test requirements effectively for the AI.Complex UI interactions. Highly dynamic interfaces (e.g., games, AR apps) may still require human intervention.Ethical oversight. Over-reliance on AI could lead to complacency. Human review remains essential for critical systems. That said, these challenges are outweighed by the long-term gains in efficiency and reliability. Conclusion As software complexity continues to rise, so does the demand for smarter, faster, and more adaptable testing solutions. Operator represents a paradigm shift in how we approach quality assurance, bridging the gap between human expertise and machine efficiency. With Operator, development teams can significantly cut down on manual testing time, achieve broader test coverage, and deliver high-quality products at a faster pace. In my next blog, I will provide a live example and explain it in greater detail. More

SOC 2 Made Simple: Your Guide to Certification

By Roman Misyurin

Build a Stateless Microservice With GitHub Copilot in VSCode

By Sibasis Padhi

A Guide to Automating AWS Infrastructure Deployment

By Praveen Kumar Thopalle

A View on Understanding Non-Human Identities Governance

Can an identity exist without being referenced by another identity? How would we know? That might seem a bit philosophical for a security tech article, but it is an important point to keep in mind when tackling the subject of non-human identities. A better question around security would actually be, "Should an identity exist if it can not be interacted with?" We might not be able to reach the answer to that first question, as proving the nature of reality is a little out of scope for computer science. However, a lot of folks have been hard at work building the NHI Governance tools to determine if a machine identity exists, why it exists, and answer the question of whether it should exist. The future of eliminating secrets sprawl means getting a handle on the lifecycles and interdependencies of the non-human identities that rely on secrets. But why now? Let's step back and re-examine some of our assumptions about NHIs and their existence. What Are Non-Human Identities? Before we proceed, let's define NHI in the context of this conversation. In the simplest terms, a non-human identity, also commonly referred to as a machine identity or a workload identity, is any entity that is not human and can perform an action within your system, most commonly interacting exclusively with other non-humans. This could be a Kubernetes pod that needs to interact with a data source and send the processed data to a reporting system. This could be an Internet of Things (IoT) sensor feeding data to a central server. This could be a Slack-based chatbot. If no human input is directly needed after the initial creation for the entity to get work done, then we should consider that identity 'non-human.' The one thing all these examples have in common is that they interact with another system. If we want them to communicate with the entire world, that is easy, as we simply point to the other non-human identities and programmatically describe how they should interact. However, we most likely want these systems to communicate securely, only authorizing specific identities under specific circumstances. This has driven the evolution of secrets for access management, from simple username/password pairs to API keys to certificates. Admittedly, that is a broad definition of NHI. However, we can narrow down what we care about with machine identities by stepping back and considering how these entities relate to one another through the lens of their secrets, allowing access and communication. All NHIs Connect to Other Systems Can you build a stand-alone application that does not take in any input, produce any output, and has no addressable interface? Does such an application exist outside of a thought experiment? While fun to think about, the reality is that all NHIs we care about exist to communicate with other identities. NHIs inherently require connections to other systems and services to fulfill their purpose. This interconnectivity means every NHI becomes a node in a web of interdependencies. From an NHI governance perspective, this necessitates maintaining an accurate and dynamic inventory of these connections to manage the associated risks. For example, if a single NHI is compromised, what does it connect to, and what would an attacker be able to access to laterally move into? Proper NHI governance must include tools to map and monitor these relationships. While there are many ways to go about this manually, what we actually want is an automated way to tell what is connected to what, what is used for what, and by whom. When thinking in terms of securing our systems, we can leverage another important fact about all NHIs in a secured application to build that map, they all, necessarily, have secrets. All Secure NHIs Must Have a Secret In order to establish trusted communication between any two NHIs, a unique secret, such as an API key, token, or certificate, must exist for those entities to authenticate. We can use the secret to prove an NHI's identity and map it in the ecosystem. The question becomes, where do we look for these secrets? In the modern enterprise, especially larger ones, there are essentially only two places a secret can live. Your first option is the best practice and safest option: a secrets management system, such as CyberArk's Conjur, Vault by HashiCorp, or AWS Secrets Manager. The other option is much less secure but, unfortunately, all too common: outside of a vault, in code, or configuration in plaintext. Enterprise secrets management platforms, often referred to as vaults, are critical for storing and protecting secrets used by NHIs. Vaults can provide a single source of truth for all secrets, ensuring they are encrypted at rest, tightly access-controlled, and monitored for unauthorized access attempts. This assumes you have standardized on a single enterprise secret management platform. Most organizations actually have many vaults in use at the same time, making synchronization between all vaults an additional challenge. Teams can map all existing machine identities based on the existence of these secrets. For enterprises with multiple secret management solutions in place, you need to know which vaults do and do not contain a secret and to reduce the overhead of storing the same key redundantly across several vaults. All NHI Secrets Have an Origin Story Machines can't grant themselves permissions and access. Every machine identity was created by or represents a human identity. Governance of NHIs must include secret creation tracking to ensure every secret is traceable to its origin, securely distributed, and linked to a legitimate identity. While this aspect could be accounted for with the proper use of a secret management platform, our data keeps telling us that a certain percentage of secrets leak year after year because we are not consistently using these vault solutions. We know from years of experience helping teams remediate incidents that the creator of a secret will almost always be the person who first introduces the credential into an ecosystem. We can also tell from the code history or other system timestamp information when this was first seen, which is the most probable time for it to be created or at least come into existence in a meaningful way. This is a critical detail that might never have been properly logged or documented anywhere else. Once you understand who created a secret to be able to leverage an NHI, then you truly understand the beginning of our NHI lifecycle. All NHI Secrets Must Grant Some Set of Permissions When created, every NHI secret must be granted a certain set of permissions. The scope determines what actions an identity can perform and on which systems. This makes permission scoping and enforcement crucial components of governance. Essentially, two risks make understanding the scope of a secret critical for enterprise security. First is that misconfigured or over-privileged secrets can inadvertently grant access to sensitive data or critical systems, significantly increasing the attack surface. Imagine accidentally giving write privileges to a system that can access your customer's PII. That is a ticking clock waiting for a threat actor to find and exploit it. Also, just as troubling is that when a secret is leaked or compromised, a team can not replace it until they first understand how those permissions were configured. For example, suppose you know a mission-critical microservice's secret was accidentally pushed to a public GitHub repo. In that case, it is only a matter of time before it will be discovered and used by someone outside of your organization. In our recent Voice of the Practitioner report, IT decision-makers admitted it took, on average, 27 days to rotate these critical secrets. Teams should be able to act in seconds or minutes, not days. Tools that provide additional context about detected secrets, including their roles and permissions, are needed. Rapidly understanding what assets are exposed when a leak occurs and what potential damage can be inflicted by a threat actor goes a long way when responding to an incident. Knowing exactly how to replace it from a dashboard view or API call can mean the difference between a breach and a frustrated attacker finding the key they have is invalid. All NHI Secrets Need to be Rotated A machine identity can, and likely should, have many secrets in its lifetime. If credentials are left to live for months or years, or in the worst case, forever, NHI secrets exposure or compromise becomes increasingly likely. Manual rotation is error-prone and operationally taxing, particularly in environments with thousands of NHIs. Automating the secret rotation process is a cornerstone of NHI governance, ensuring that secrets are refreshed before they expire or are leaked. For any of the secrets in your vaults, rotation should be a simple matter of scripting. Most secret management platforms provide scripts or some other mechanism to handle the delicate dance of safely replacing and revoking the old secret. But what about the NHI secrets that live outside of these vaults, or perhaps the same secret that is spread across multiple vaults? A good secret scanning platform needs seamless integration with these vaults so that your team can more easily find and safely store these secrets in the secrets manager and prepare the way for automated rotation. GitGuardian's reference implementation with CyberArk's Conjur goes into more detail on how you can fully automate the entire storage and rotation process. By identifying all the NHIs and knowing when they were created, we can also predict when they need to be rotated. While every team will judge exactly how long each secret should live, any secrets that have never been rotated after creation are ripe to be replaced. Any secret older than a year, or for some mission-critical systems, a few days should also be prioritized for rotation asap. All NHIs Will Have an End-of-Life NHIs, like their human counterparts, have finite lifecycles. They may be decommissioned when a service is retired, replaced, or no longer needed. Without addressing the deactivation and cleanup of NHIs to prevent the persistence of unused secrets or stale connections, we are creating security blind spots. But how do we know when we are at the end of the road for an NHI, especially if its secret remains valid? One answer is that it should no longer exist when an NHI no longer connects to another active system. This ensures attackers cannot exploit defunct NHI secrets to gain a foothold in your environment. Remember that attackers do not care how a secret should be appropriately used; they only care about what they can do with it. By mapping all the relationships an NHI's secrets allow, you can identify when a system is no longer connected to any other identity. Once there are no more ways for an identity to communicate, then it and its secrets should no longer exist. It also means the secret no longer needs to be stored in your secrets managers, giving you one less thing to store and manage. Understanding the World Around Your NHIs is Critical to Security In 2022, CyberArk's research showed that for every human identity in an environment, at least 45 non-human identities need to be managed. That ratio today is likely closer to 1 to 100 and is ever-increasing. The best time to come to terms with your NHI governance and lifecycle management was years ago. The next best time is right now. It is time for a full-cycle approach to non-human identity security, mapping out not just where your NHI secrets are but, just as importantly, what other NHIs are connected. We are overdue, across all industries, to implement NHI governance at scale. Finding and properly storing your secrets is just the beginning of the story. We must better document and understand the scope of NHI secrets, their age, who implemented them, and other contextual information, such as when they should be rotated. Even though machine identities outnumber human beings, there is no reason to work alone to solve this problem; we are all in it together.

By Dwayne McDaniel

Build a URL Shortener With Neon, Azure Serverless Functions

Neon is now available on the Azure marketplace. The new integration between Neon and Azure allows you to manage your Neon subscription and billing through the Azure portal as if Neon were an Azure product. Azure serverless and Neon are a natural combination — Azure serverless frees you from managing your web server infrastructure. Neon does the same for databases, offering additional features like data branching and vector database extensions. That said, let's try out this new integration by building a URL shortener API with Neon, Azure serverless, and Node.js. Note: You should have access to a terminal, an editor like VS Code, and Node v22 or later installed. Setting Up the Infrastructure We are going to have to do things a little backward today. Instead of writing the code, we will first first set up our serverless function and database. Step 1. Open up the Azure web portal. If you don’t already have one, you will need to create a Microsoft account. Step 2. You will also have to create a subscription if you don’t have one already, which you can do in Azure. Step 3. Now, we can create a resource group to store our serverless function and database. Go to Azure's new resource group page and fill out the form like this: This is the Azure Resource Group creation page with the resource group set to "AzureNeonURLShortener" and the location set to West US 3.In general, use the location closest to you and your users, as the location will determine the default placement of serverless functions and what areas have the lowest latency. It isn’t vital in this example, but you can search through the dropdown if you want to use another. However, note that Neon doesn’t have locations in all of these regions yet, meaning you would have to place your database in a region further from your serverless function. Step 4. Click "Review & Create" at the bottom to access a configuration review page. Then click "Create" again. Step 5. Now, we can create a serverless function. Unfortunately, it includes another form. Go to the Azure Flex consumption serverless app creation page and complete the form. Use the resource group previously created, choose a unique serverless function name, place the function in your resource group region, and use Node v20. Step 6. The name you choose for your serverless app will be the subdomain Azure gives you to access your API, so choose wisely. After you finish filling everything out, click "Review and Create" and finally, "Create." Azure should redirect you to the new app's page. Now we can set up Neon. Open the new Neon Resource page on the Azure portal, and, you guessed it, fill out the form. How to Create a Neon Database on Azure Step 1. Create a new Neon resource page with "AzureURLNeonShortener" as the resource group, "URLShortenerDB" as the resource name, and "West US 3" as the location. If the area you chose isn’t available, choose the next closest region. Once you complete everything, click "Review & Create" and then "Create," as you did for previous resources. Step 2. You might have to wait a bit for the Neon database to instantiate. Once it does, open its configuration page and click "Go To Neon." Step 3. You will be redirected to a login page. Allow Neon to access your Azure information, and then you should find yourself on a project creation page. Fill out the form below: The project and database name aren't significant, but make sure to locate the database in Azure West US 3 (or whatever region you choose). This will prevent database queries from leaving the data center, decreasing latency. Step 4. Click "Create" at the bottom of the page, keeping the default autoscaling configuration. You should now be redirected to a Neon database page. This page has our connection string, which we will need to connect to our database from our code. Click "Copy snippet" to copy the connection string. Make sure you don’t lose this, as we will need it later, but for now, we need to structure our database. Step 5. Click “SQL Editor” on the side navigation, and paste the following SQL in: SQL CREATE TABLE IF NOT EXISTS urls(id char(12) PRIMARY KEY, url TEXT NOT NULL); Then click "Run." This will create the table we will use to store URLs. The table is pretty simple: The primary key ID is a 12 — character random string that we will use to refer to URLs, and the URL is a variable-length string that will store the URL itself. Step 6. If you look at the Table view on the side navigation, you should see a “urls” table. Finally, we need to get our connection string. Click on “Dashboard” on the side nav, find the connection string, and click “Copy snippet.” Now, we can start writing code. Building the API Step 1. First, we must install Azure’s serverless CLI, which will help us create a project and eventually test/publish it. Open a terminal and run: Plain Text npm install -g azure-functions-core-tools --unsafe-perm true Step 2. If you want to use other package managers like Yarn or pnpm, just replace npm with your preferred package manager. Now, we can start on our actual project. Open the folder you want the project to be in and run the following three commands: Plain Text func init --javascript func new --name submit --template "HTTP trigger" func new --name url --template "HTTP trigger" npm install nanoid @neondatabase/serverless Now, you should see a new Azure project in that folder. The first command creates the project, the two following commands create our serverless API routes, and the final command installs the Neon serverless driver for interfacing with our database and Nano ID for generating IDs. We could use a standard Postgres driver instead of the Neon driver, but Neon’s driver uses stateless HTTP queries to reduce latency for one-off queries. Because we are running a serverless function that might only process one request and send one query, one-off query latency is important. You will want to focus on the code in src/functions, as that is where our routes are. You should see two files there: submit.js and redirect.js. submit.js submit.js will store the code we use to submit URLs. First, open submit.js and replace its code with the following: TypeScript import { app } from "@azure/functions"; import { neon } from "@neondatabase/serverless"; import { nanoid } from "nanoid"; const sql = neon("[YOUR_POSTGRES_CONNECTION_STRING]"); app.http("submit", { methods: ["GET"], authLevel: "anonymous", route: "submit", handler: async (request, context) => { if (!request.query.get("url")) return { body: "No url provided", status: 400, }; if (!URL.canParse(request.query.get("url"))) return { body: "Error parsing url", status: 400, }; const id = nanoid(12); await sql`INSERT INTO urls(id,url) VALUES (${id},${request.query.get( "url" )})`; return new Response(`Shortened url created with id ${id}`); }, }); Let’s break this down step by step. First, we import the Azure functions API, Neon serverless driver, and Nano ID. We are using ESM (ES Modules) here instead of CommonJS. We will need to make a few changes later on to support this. Next, we create the connection to our database. Replace [YOUR_POSTGRES_CONNECTION_STRING] with the string you copied from the Neon dashboard. For security reasons, you would likely want to use a service like Azure Key Vault to manage your keys in a production environment, but for now, just placing them in the script will do. Now, we are at the actual route. The first few properties define when our route handler should be triggered: We want this route to be triggered by a GET request to submit. Our handler is pretty simple. We first check if a URL has been passed through the URL query parameter (e.g., /submit?url=https://google.com), then we check whether it is a valid URL via the new URL.canParse API. Next, We generate the ID with Nano ID. Because our IDs are 12 characters long, we have to pass 12 to the Nano ID generator. Finally, we insert a new row with the new ID and URL into our database. The Neon serverless driver automatically parameterizes queries, so we don’t need to worry about malicious users passing SQL statements into the URL. redirect.js redirect.js is where our actual URL redirects will happen. Replace its code with the following: TypeScript import { app } from "@azure/functions"; import { neon } from "@neondatabase/serverless"; const sql = neon("[YOUR_POSTGRES_CONNECTION_STRING]"); app.http("redirect", { methods: ["GET"], authLevel: "anonymous", route: "{id:length(12)}", handler: async (request, context) => { const url = await sql`SELECT * FROM urls WHERE urls.id=${request.params.id}`; if (!url[0]) return new Response("No redirect found", { status: 404 }); return Response.redirect(url[0].url, 308); }, }); The first section of the script is the same as submit.js. Once again, replace it \[YOUR\_POSTGRES\_CONNECTION\_STRING\] with the string you copied from the Neon dashboard. The route is where things get more interesting. We need to accept any path that could be a redirect ID, so we use a parameter with the constraint of 12 characters long. Note that this could overlap if you ever have another 12-character route. If it does, you can rename the redirect route to start with a Z or other alphanumerically greater character to make Azure serverless load the redirect route after. Finally, we have our actual handler code. All we need to do here is query for a URL matching the given ID and redirect to it if one exists. We use the 308 status code in our redirect to tell browsers and search engines to ignore the original shortened URL. Config Files There are two more changes we need to make. First, we don’t want a /api prefix on all our functions. To remove this, open host.json, which should be in your project directory, and add the following: TypeScript "extensions": { "http": { "routePrefix": "" } } This allows your routes to operate without any prefixes. The one other thing we need to do is switch the project to ES Modules. Open package.json and insert the following at the end of the file: Plain Text "type": "module" That’s it! Testing and Deploying Now, you can try testing locally by running func start. You can navigate to http://localhost:7071/submit?url=https://example.com, then use the ID it gives you and navigate to http://localhost:7071/[YOUR_ID]. You should be redirected to example.com. Of course, we can’t just run this locally. To deploy, we need to install the Azure CLI, which you can do with one of the following commands, depending on your operating system: macOS (Homebrew) Plain Text brew install azure-cli Windows (WPM) Plain Text winget install -e --id Microsoft.AzureCLI Linux Plain Text curl -L <https://aka.ms/InstallAzureCli> | bash Now, restart the terminal, log in by running az login, and run the following in the project directory: Plain Text func azure functionapp publish [FunctionAppName] Replace [FunctionAppName] with whatever you named your function earlier. Now, you should be able to access your API at [FunctionAppName].azurewebsites.net. Conclusion You should now have a fully functional URL Shortener. You can access the code here and work on adding a front end. If you want to keep reading about Neon and Azure’s features, we recommend checking out Branching. Either way, I hope you learned something valuable from this guide.

By Bobur Umurzokov

Relational DB Migration to S3 Data Lake Via AWS DMS, Part I

AWS Database Migration Service is a cloud service that migrates relational databases, NoSQL databases, data warehouses, and all other types of data stores into AWS Cloud or between cloud and on-premises setups efficiently and securely. DMS supports several types of source and target databases such as Oracle, MS SQL Server, MySQL, Postgres SQL, Amazon Aurora, AWS RDS, Redshift, and S3, etc. Observations During the Data Migration We worked on designing and creating an AWS S3 data lake and data warehouse in AWS Redshift with the data sources from on-premises for Oracle, MS SQL Server, MySQL, Postgres SQL, and MongoDB for relational databases. We used AWS DMS for the initial full load and daily incremental data transfer from these sources into AWS S3. With this series of posts, I want to explain the various challenges faced during the actual data migration with different relational databases. 1. Modified Date Not Populated Properly at the Source AWS DMS is used for full load and change data capture from source databases. AWS DMS captures changed records based on the transaction logs, but a modified date column updated properly can help to apply deduplication logic, and extract the latest modified record for a given row on the target in S3. In case modified data is not available for a table or it is not updated properly, AWS DMS provides an option of transformation rules to add a new column while extracting data from the source database. Here, the AR_H_CHANGE_SEQ header helps to add a new column with value as a unique incrementing number from the source database, which consists of a timestamp and an auto-incrementing number. The below code example adds a new column as DMS_CHANGE_SEQ to the target, which has a unique incrementing number from the source. This is a 35-digit unique number with the first 16 digits for the timestamp and the next 19 digits for the record ID number incremented by the database. JSON { "rule-type": "transformation", "rule-id": "2", "rule-name": "2", "rule-target": "column", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "add-column", "value": "DMS_CHANGE_SEQ", "expression": "$AR_H_CHANGE_SEQ", "data-type": { "type": "string", "length": 100 } } 2. Enabling Supplemental Logging for Oracle as a Source For Oracle as a source database, to capture ongoing changes, AWS DMS needs minimum supplemental logging to be enabled on the source database. Accordingly, this will include additional information and columns in the redo logs to identify the changes at the source. Supplemental logging can be enabled for primary, unique keys, sets of columns, or all the columns. Supplemental logging for all columns captures all the columns for the tables in the source database and helps to overwrite the complete records in the target AWS S3 layer. Supplemental logging of all columns will increase the redo logs size, as all the columns for the table are logged into the logs. One needs to configure, redo, and archive logs accordingly to consider additional information in them. 3. Network Bandwidth Between Source and Target Databases Initial full load from the on-premises sources for Oracle, MS SQL Server, etc., worked fine and changed data capture, too, for most of the time. There used to be a moderate number of transactions most of the time of the day in a given month, except for the end-of-business-day process, daily, post-midnight, and month-end activities. We observed DMS migration tasks were out of sync or failed during this time. We reviewed the source, target, and replication instance metrics in the logs and found the following observations: CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance.CDCIncomingchanges – the total number of change events at a point in time that is waiting to be applied to the target. This increases from zero to thousands during reconciliation activities in the early morning.CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance. This increases from zero to a few thousand up to 10-12K seconds during daily post-midnight reconciliation activities. This value was up to 40K during month-end activities. Upon further logs analysis and reviewing other metrics, we observed that: AWS DMS metrics NetworkReceiveThroughput is to understand the incoming traffic on the DMS Replication instance for both customer database and DMS traffic. These metrics help to understand the network-related issues, if any, between the source database and the DMS replication instance. It was observed that the network receive throughput was up to 30MB/s, i.e., 250Mb/s, due to the VPN connection between the source and AWS, which was also shared for other applications. The final conclusion to this issue is that connectivity between source and target databases is critical for successful data migration. You should ensure sufficient bandwidth between on-premises or other cloud source databases and the AWS environment is set up before the actual data migration. A VPN tunnel such as AWS Site-to-Site VPN or Oracle Cloud Infrastructure (OCI) Site-to-Site VPN (Oracle AWS) can provide a throughput of up to 1.25 Gbps. This would be sufficient for small tables migration or tables with less DML traffic migration.For large data migrations with heavy transactions per second on the tables, you should consider AWS Direct Connect. It provides an option to create a dedicated private connection with 1 Gbps, 10 Gbps, etc. bandwidth supported. Conclusion This is Part I of the multi-part series for the relational databases migration challenges using AWS DMS and their solutions implemented. Most of these challenges mentioned in this series could happen during the database migration process and these solutions can be referred.

By Vijay Bhosale

Control Your Services With OTEL, Jaeger, and Prometheus

Let's discuss an important question: how do we monitor our services if something goes wrong? On the one hand, we have Prometheus with alerts and Kibana for dashboards and other helpful features. We also know how to gather logs — the ELK stack is our go-to solution. However, simple logging isn’t always enough: it doesn’t provide a holistic view of a request’s journey across the entire ecosystem of components. You can find more info about ELK here. But what if we want to visualize requests? What if we need to correlate requests traveling between systems? This applies to both microservices and monoliths — it doesn’t matter how many services we have; what matters is how we manage their latency. Indeed, each user request might pass through a whole chain of independent services, databases, message queues, and external APIs. In such a complex environment, it becomes extremely difficult to pinpoint exactly where delays occur, identify which part of the chain acts as a performance bottleneck, and quickly find the root cause of failures when they happen. To address these challenges effectively, we need a centralized, consistent system to collect telemetry data — traces, metrics, and logs. This is where OpenTelemetry and Jaeger come to the rescue. Let's Look at the Basics There are two main terms we have to understand: Trace ID A Trace ID is a 16-byte identifier, often represented as a 32-character hexadecimal string. It’s automatically generated at the start of a trace and stays the same across all spans created by a particular request. This makes it easy to see how a request travels through different services or components in a system. Span ID Every individual operation within a trace gets its own Span ID, which is typically a randomly generated 64-bit value. Spans share the same Trace ID, but each one has a unique Span ID, so you can pinpoint exactly which part of the workflow each span represents (like a database query or a call to another microservice). How Are They Related? Trace ID and Span ID complement each other. When a request is initiated, a Trace ID is generated and passed to all involved services. Each service, in turn, creates a span with a unique Span ID linked to the Trace ID, enabling you to visualize the full lifecycle of the request from start to finish. Okay, so why not just use Jaeger? Why do we need OpenTelemetry (OTEL) and all its specifications? That’s a great question! Let’s break it down step by step. Find more about Jaeger here. TL;DR Jaeger is a system for storing and visualizing distributed traces. It collects, stores, searches, and displays data showing how requests “travel” through your services.OpenTelemetry (OTEL) is a standard (and a set of libraries) for collecting telemetry data (traces, metrics, logs) from your applications and infrastructure. It isn’t tied to any single visualization tool or backend. Put simply: OTEL is like a “universal language” and set of libraries for telemetry collection.Jaeger is a backend and UI for viewing and analyzing distributed traces. Why Do We Need OTEL if We Already Have Jaeger? 1. A Single Standard for Collection In the past, there were projects like OpenTracing and OpenCensus. OpenTelemetry unifies these approaches to collecting metrics and traces into one universal standard. 2. Easy Integration You write your code in Go (or another language), add OTEL libraries for auto-injecting interceptors and spans, and that’s it. Afterward, it doesn’t matter where you want to send that data—Jaeger, Tempo, Zipkin, Datadog, a custom backend—OpenTelemetry takes care of the plumbing. You just swap out the exporter. 3. Not Just Traces OpenTelemetry covers traces, but it also handles metrics and logs. You end up with a single toolset for all your telemetry needs, not just tracing. 4. Jaeger as a Backend Jaeger is an excellent choice if you’re primarily interested in distributed tracing visualization. But it doesn’t provide the cross-language instrumentation by default. OpenTelemetry, on the other hand, gives you a standardized way to collect data, and then you decide where to send it (including Jaeger). In practice, they often work together: Your application uses OpenTelemetry → communicates via OTLP protocol → goes to the OpenTelemetry Collector (HTTP or grpc) → exports to Jaeger for visualization. Tech Part System Design (A Little Bit) Let's quickly sketch out a couple of services that will do the following: Purchase Service – processes a payment and records it in MongoDBCDC with Debezium – listens for changes in the MongoDB table and sends them to KafkaPurchase Processor – consumes the message from Kafka and calls the Auth Service to look up the user_id for validationAuth Service – a simple user service In summary: 3 Go servicesKafkaCDC (Debezium)MongoDB Code Part Let’s start with the infrastructure. To tie everything together into one system, we’ll create a large Docker Compose file. We’ll begin by setting up telemetry. Note: All the code is available via a link at the end of the article, including the infrastructure. YAML services: jaeger: image: jaegertracing/all-in-one:1.52 ports: - "6831:6831/udp" # UDP port for the Jaeger agent - "16686:16686" # Web UI - "14268:14268" # HTTP port for spans networks: - internal prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro ports: - "9090:9090" depends_on: - kafka - jaeger - otel-collector command: --config.file=/etc/prometheus/prometheus.yml networks: - internal otel-collector: image: otel/opentelemetry-collector-contrib:0.91.0 command: ['--config=/etc/otel-collector.yaml'] ports: - "4317:4317" # OTLP gRPC receiver volumes: - ./otel-collector.yaml:/etc/otel-collector.yaml depends_on: - jaeger networks: - internal We’ll also configure the collector — the component that gathers telemetry. Here, we choose gRPC for data transfer, which means communication will happen over HTTP/2: YAML receivers: # Add the OTLP receiver listening on port 4317. otlp: protocols: grpc: endpoint: "0.0.0.0:4317" processors: batch: # https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/memorylimiterprocessor memory_limiter: check_interval: 1s limit_percentage: 80 spike_limit_percentage: 15 extensions: health_check: {} exporters: otlp: endpoint: "jaeger:4317" tls: insecure: true prometheus: endpoint: 0.0.0.0:9090 debug: verbosity: detailed service: extensions: [health_check] pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] Make sure to adjust any addresses as needed, and you’re done with the base configuration. We already know OpenTelemetry (OTEL) uses two key concepts — Trace ID and Span ID — that help track and monitor requests in distributed systems. Implementing the Code Now, let’s look at how to get this working in your Go code. We need the following imports: Go "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" "go.opentelemetry.io/otel/sdk/resource" "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.17.0" Then, we add a function to initialize our tracer in main() when the application starts: Go func InitTracer(ctx context.Context) func() { exp, err := otlptrace.New( ctx, otlptracegrpc.NewClient( otlptracegrpc.WithEndpoint(endpoint), otlptracegrpc.WithInsecure(), ), ) if err != nil { log.Fatalf("failed to create OTLP trace exporter: %v", err) } res, err := resource.New(ctx, resource.WithAttributes( semconv.ServiceNameKey.String("auth-service"), semconv.ServiceVersionKey.String("1.0.0"), semconv.DeploymentEnvironmentKey.String("stg"), ), ) if err != nil { log.Fatalf("failed to create resource: %v", err) } tp := trace.NewTracerProvider( trace.WithBatcher(exp), trace.WithResource(res), ) otel.SetTracerProvider(tp) return func() { err := tp.Shutdown(ctx) if err != nil { log.Printf("error shutting down tracer provider: %v", err) } } } With tracing set up, we just need to place spans in the code to track calls. For example, if we want to measure database calls (since that’s usually the first place we look for performance issues), we can write something like this: Go tracer := otel.Tracer("auth-service") ctx, span := tracer.Start(ctx, "GetUserInfo") defer span.End() tracedLogger := logging.AddTraceContextToLogger(ctx) tracedLogger.Info("find user info", zap.String("operation", "find user"), zap.String("username", username), ) user, err := s.userRepo.GetUserInfo(ctx, username) if err != nil { s.logger.Error(errNotFound) span.RecordError(err) span.SetStatus(otelCodes.Error, "Failed to fetch user info") return nil, status.Errorf(grpcCodes.NotFound, errNotFound, err) } span.SetStatus(otelCodes.Ok, "User info retrieved successfully") We have tracing at the service layer — great! But we can go even deeper, instrumenting the database layer: Go func (r *UserRepository) GetUserInfo(ctx context.Context, username string) (*models.User, error) { tracer := otel.Tracer("auth-service") ctx, span := tracer.Start(ctx, "UserRepository.GetUserInfo", trace.WithAttributes( attribute.String("db.statement", query), attribute.String("db.user", username), ), ) defer span.End() var user models.User // Some code that queries the DB... // err := doDatabaseCall() if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "Failed to execute query") return nil, fmt.Errorf("failed to fetch user info: %w", err) } span.SetStatus(codes.Ok, "Query executed successfully") return &user, nil } Now, we have a complete view of the request journey. Head to the Jaeger UI, query for the last 20 traces under auth-service, and you’ll see all the spans and how they connect in one place. Now, everything is visible. If you need it, you can include the entire query in the tags. However, keep in mind that you shouldn’t overload your telemetry — add data deliberately. I’m simply demonstrating what’s possible, but including the full query, this way isn’t something I’d generally recommend. gRPC client-server If you want to see a trace that spans two gRPC services, it’s quite straightforward. All you need is to add the out-of-the-box interceptors from the library. For example, on the server side: Go server := grpc.NewServer( grpc.StatsHandler(otelgrpc.NewServerHandler()), ) pb.RegisterAuthServiceServer(server, authService) On the client side, the code is just as short: Go shutdown := tracing.InitTracer(ctx) defer shutdown() conn, err := grpc.Dial( "auth-service:50051", grpc.WithInsecure(), grpc.WithStatsHandler(otelgrpc.NewClientHandler()), ) if err != nil { logger.Fatal("error", zap.Error(err)) } That’s it! Ensure your exporters are configured correctly, and you’ll see a single Trace ID logged across these services when the client calls the server. Handling CDC Events and Tracing Want to handle events from the CDC as well? One simple approach is to embed the Trace ID in the object that MongoDB stores. That way, when Debezium captures the change and sends it to Kafka, the Trace ID is already part of the record. For instance, if you’re using MongoDB, you can do something like this: Go func (r *mongoPurchaseRepo) SavePurchase(ctx context.Context, purchase entity.Purchase) error { span := r.handleTracing(ctx, purchase) defer span.End() // Insert the record into MongoDB, including the current span's Trace ID _, err := r.collection.InsertOne(ctx, bson.M{ "_id": purchase.ID, "user_id": purchase.UserID, "username": purchase.Username, "amount": purchase.Amount, "currency": purchase.Currency, "payment_method": purchase.PaymentMethod, // ... "trace_id": span.SpanContext().TraceID().String(), }) return err } Debezium then picks up this object (including trace_id) and sends it to Kafka. On the consumer side, you simply parse the incoming message, extract the trace_id, and merge it into your tracing context: Go // If we find a Trace ID in the payload, attach it to the context newCtx := ctx if traceID != "" { log.Printf("Found Trace ID: %s", traceID) newCtx = context.WithValue(ctx, "trace-id", traceID) } // Create a new span tracer := otel.Tracer("purchase-processor") newCtx, span := tracer.Start(newCtx, "handler.processPayload") defer span.End() if traceID != "" { span.SetAttributes( attribute.String("trace.id", traceID), ) } // Parse the "after" field into a Purchase struct... var purchase model.Purchase if err := mapstructure.Decode(afterDoc, &purchase); err != nil { log.Printf("Failed to map 'after' payload to Purchase struct: %v", err) return err } Go // If we find a Trace ID in the payload, attach it to the context newCtx := ctx if traceID != "" { log.Printf("Found Trace ID: %s", traceID) newCtx = context.WithValue(ctx, "trace-id", traceID) } // Create a new span tracer := otel.Tracer("purchase-processor") newCtx, span := tracer.Start(newCtx, "handler.processPayload") defer span.End() if traceID != "" { span.SetAttributes( attribute.String("trace.id", traceID), ) } // Parse the "after" field into a Purchase struct... var purchase model.Purchase if err := mapstructure.Decode(afterDoc, &purchase); err != nil { log.Printf("Failed to map 'after' payload to Purchase struct: %v", err) return err } Alternative: Using Kafka Headers Sometimes, it’s easier to store the Trace ID in Kafka headers rather than in the payload itself. For CDC workflows, this might not be available out of the box — Debezium can limit what’s added to headers. But if you control the producer side (or if you’re using a standard Kafka producer), you can do something like this with Sarama: Injecting a Trace ID into Headers Go // saramaHeadersCarrier is a helper to set/get headers in a Sarama message. type saramaHeadersCarrier *[]sarama.RecordHeader func (c saramaHeadersCarrier) Get(key string) string { for _, h := range *c { if string(h.Key) == key { return string(h.Value) } } return "" } func (c saramaHeadersCarrier) Set(key string, value string) { *c = append(*c, sarama.RecordHeader{ Key: []byte(key), Value: []byte(value), }) } // Before sending a message to Kafka: func produceMessageWithTraceID(ctx context.Context, producer sarama.SyncProducer, topic string, value []byte) error { span := trace.SpanFromContext(ctx) traceID := span.SpanContext().TraceID().String() headers := make([]sarama.RecordHeader, 0) carrier := saramaHeadersCarrier(&headers) carrier.Set("trace-id", traceID) msg := &sarama.ProducerMessage{ Topic: topic, Value: sarama.ByteEncoder(value), Headers: headers, } _, _, err := producer.SendMessage(msg) return err } Extracting a Trace ID on the Consumer Side Go for message := range claim.Messages() { // Extract the trace ID from headers var traceID string for _, hdr := range message.Headers { if string(hdr.Key) == "trace-id" { traceID = string(hdr.Value) } } // Now continue your normal tracing workflow if traceID != "" { log.Printf("Found Trace ID in headers: %s", traceID) // Attach it to the context or create a new span with this info } } Depending on your use case and how your CDC pipeline is set up, you can choose the approach that works best: Embed the Trace ID in the database record so it flows naturally via CDC.Use Kafka headers if you have more control over the producer side or you want to avoid inflating the message payload. Either way, you can keep your traces consistent across multiple services—even when events are asynchronously processed via Kafka and Debezium. Conclusion Using OpenTelemetry and Jaeger provides detailed request traces, helping you pinpoint where and why delays occur in distributed systems. Adding Prometheus completes the picture with metrics — key indicators of performance and stability. Together, these tools form a comprehensive observability stack, enabling faster issue detection and resolution, performance optimization, and overall system reliability. I can say that this approach significantly speeds up troubleshooting in a microservices environment and is one of the first things we implement in our projects. Links Infra codeOTEL Getting StartedOTEL SQLOTEL CollectorGo gRPCGo ReflectionKafkaDebezium MongoDB Connector DocsUnwrap MongoDB SMT Example

By Ilia Ivankin

Community Over Code Keynotes Stress Open Source's Vital Role

At the ASF's flagship Community Over Code North America conference in October 2024, keynote speakers underscored the vital role of open-source communities in driving innovation, enhancing security, and adapting to new challenges. By highlighting the Cybersecurity and Infrastructure Security Agency's (CISA) intensified focus on open source security, citing examples of open source-driven innovation, and reflecting on the ASF's 25-year journey, the keynotes showcased a thriving but rapidly changing ecosystem for open source. Opening Keynote: CISA's Vision for Open Source Security Aeva Black from CISA opened the conference with a talk about the government's growing engagement with open source security. Black, a long-time open source contributor who helps shape federal policy, emphasized how deeply embedded open source has become in critical infrastructure. To help illustrate open source's pervasiveness, Black noted that modern European cars have more than 100 computers, "most of them running open source, including open source orchestration systems to control all of it." CISA's open-source roadmap aims to "foster an open source ecosystem that is secure, sustainable and resilient, supported by a vibrant community." Black also highlighted several initiatives, including new frameworks for assessing supply chain risk, memory safety requirements, and increased funding for security tooling. Notably, in the annual Administration Cybersecurity Priorities Memo M-24-14, the White House has encouraged Federal agencies to include budget requests to establish Open Source Program Offices (OSPOs) to secure their open source usage and develop contribution policies. Innovation Showcase: The O.A.S.I.S Project Chris Kersey delivered a keynote demonstrating the O.A.S.I.S Project, an augmented-reality helmet system built entirely with open-source software. His presentation illustrated how open source enables individuals to create sophisticated systems by building upon community-maintained ecosystems. Kersey's helmet integrates computer vision, voice recognition, local AI processing, and sensor fusion - all powered by open source. "Open source is necessary to drive this level of innovation because none of us know all of this technology by ourselves, and by sharing what we know with each other, we can build amazing things," Kersey emphasized while announcing the open-sourcing of the O.A.S.I.S Project. State of the Foundation: Apache at 25 David Nalley, President of the Apache Software Foundation (ASF), closed the conference with the annual 'State of the Foundation' address, reflecting on the ASF's evolution over 25 years. He highlighted how the foundation has grown from primarily hosting the Apache web server to becoming a trusted home for hundreds of projects that "have literally changed the face of the (open source) ecosystem and set a standard that the rest of the industry is trying to copy." Nalley emphasized the ASF's critical role in building trust through governance: "When something carries the Apache brand, people know that means there's going to be governance by consensus, project management committees, and people who are acting in their capacity as an individual, not as a representative of some other organization." Looking ahead, Nalley acknowledged the need for the ASF to adapt to new regulatory requirements like Europe's Cyber Resiliency Act while maintaining its core values. He highlighted ongoing collaboration with other foundations like the Eclipse Foundation to set standards for open-source security compliance. "There is a lot of new work we need to do. We cannot continue to do the things that we have done for many years in the same way that we did them 25 years ago," Nalley noted while expressing confidence in the foundation's ability to evolve. Conclusion This year's Community Over Code keynotes highlighted a maturing open-source ecosystem tackling new challenges around security, regulation, and scalability while showing how community-driven innovation continues to push technical limits. Speakers stressed that the ASF's model of community-led development and strong governance is essential for fostering trust and driving innovation in today's complex technology landscape.

By Brian Proffitt

How Spring Boot Starters Integrate With Your Project

When developers set up and integrate services, they often face challenges that can take up a lot of time. Starters help simplify this process by organizing code and making it easier to manage. Let's take a look at creating two starters, configuring their settings automatically, and using them in a service. So, what are Spring Boot Starters, exactly? What benefits do they provide? Spring Boot Starters are like packages that streamline the process of incorporating libraries and components into Spring projects, making it simpler and more efficient to manage dependencies while cutting down development time significantly. Benefits of Using Spring Boot Starter Integration of Libraries Starters include all the dependencies needed for technologies. For example spring-boot-starter-web provides everything for building web applications, while spring-boot-starter-data-jpa helps with JPA database work.By adding these starters to a project, developers can start working with the desired technology without worrying about compatibility issues or version differences. Focus on Business Logic Developers can concentrate on creating business logic for dealing with infrastructure code.This approach speeds up development and feature deployment, ultimately boosting team productivity. Using Configurations Using predefined setups helps ensure consistency in setting up and organizing projects, making it easier to maintain and advance code. Moreover, it aids in onboarding team members to the project by offering a code structure and setup. Project Enhancements Furthermore, using starters that include known libraries simplifies updating dependencies and integrating Spring Boot versions.The support from the Spring team community linked with these starters also guarantees to resolve any questions or obstacles that might come up during development. Task Description In this article, we will address the issue of consolidating data from sources such as REST and GraphQL services. This problem is often encountered in projects with microservice architecture, where it is necessary to combine data coming from different services. When it comes to solutions in a microservices setup, it’s possible to establish microservices for each integration. This approach is justifiable when the integration is extensive, and there are resources for its maintenance. However, in scenarios like working with a monolith or lacking the resources for multiple microservices support, opting for starters could be more practical. The rationale behind selecting a library starter includes: Business logic segmentation. Starters facilitate the separation of business logic and integration configuration.Following the SOLID principles. Breaking down functionality into modules aligns with principles enhancing code maintainability and scalability.Simplified setup. Starters streamline the process of configuring services by minimizing the required amount of configuration code.Ease of use. Integrating a service becomes more straightforward by adding a dependency and configuring essential parameters. Our Scenario Let's illustrate the solution with an example involving a tour aggregator that gathers data from tour operators and merges them. To start off, we will develop two starters (tour-operator-one-starter and tour-operator-two-starter) both of which will use a shared module (common-model) containing fundamental models and interfaces. These starter libraries will connect to the aggregator service (tour-aggregator). Creating tour-operator-one-starter Starter is designed to integrate with the tour operator and fetch data via the REST API. All official starters use the naming scheme spring-boot-starter-*, where * denotes a specific type of application. Third-party starters should not start with spring-boot as it is reserved for official starters from the Spring team. Typically, third-party starters begin with the project name. For example, my starter will be named tour-operator-one-spring-boot-starter. 1. Create pom.xml Add dependencies. XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-configuration-processor</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>com.common.model</groupId> <artifactId>common-model</artifactId> <version>0.0.1-SNAPSHOT</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies> 2. Create TourOperatorOneProperties These are the properties we will set in tour-aggregator to configure our starter. XML @ConfigurationProperties(prefix = "tour-operator.one.service") public class TourOperatorOneProperties { private final Boolean enabled; private final String url; private final Credentials credentials; public TourOperatorOneProperties( Boolean enabled, String url, Credentials credentials) { this.enabled = enabled; this.url = url; this.credentials = credentials; } //getters public static class Credentials { private final String username; private final String password; public Credentials(String username, String password) { this.username = username; this.password = password; } //getters } } 3. Create TourOperatorOneAutoConfiguration @AutoConfiguration – indicates that this class is a configuration class for Spring Boot auto-configuration.@ConditionalOnProperty – activates the configuration if the property tour-operator.one.service.enabled is set to true. If the property is missing, the configuration is also activated due to matchIfMissing = true.@EnableConfigurationProperties(TourOperatorOneProperties.class) – enables support for @ConfigurationProperties annotations for the TourOperatorOneProperties class. XML @AutoConfiguration @ConditionalOnProperty(prefix = "tour-operator.one.service", name = "enabled", havingValue = "true", matchIfMissing = true) @EnableConfigurationProperties(TourOperatorOneProperties.class) public class TourOperatorOneAutoconfiguration { private static final Logger log = LoggerFactory.getLogger(TourOperatorOneAutoconfiguration.class); private final TourOperatorOneProperties properties; public TourOperatorOneAutoconfiguration(TourOperatorOneProperties properties) { this.properties = properties; } @Bean("operatorOneRestClient") public RestClient restClient(RestClient.Builder builder) { log.info("Configuration operatorRestClient: {}", properties); return builder .baseUrl(properties.getUrl()) .defaultHeaders(httpHeaders -> { if (null != properties.getCredentials()) { httpHeaders.setBasicAuth( properties.getCredentials().getUsername(), properties.getCredentials().getPassword()); } }) .build(); } @Bean("tourOperatorOneService") public TourOperatorOneServiceImpl tourOperatorService(TourOperatorOneProperties properties, @Qualifier("operatorOneRestClient") RestClient restClient) { log.info("Configuration tourOperatorService: {} and restClient: {}", properties, restClient); return new TourOperatorOneServiceImpl(restClient); } } In this example, I use @ConditionalOnProperty, but there are many other conditional annotations: @ConditionalOnBean – generates a bean when a specified bean exists in the BeanFactory@ConditionalOnMissingBean – facilitates creating a bean if a particular bean is not found in the BeanFactory@ConditionalOnClass – produces a bean when a specific class is present, in the classpath@ConditionalOnMissingClass – acts oppositely to @ConditionalOnClass You should choose what suits your needs best. You can learn more about conditional annotations here. 4. Create TourOperatorOneServiceImpl In this class, we implement the base interface and lay down the main business logic for retrieving data from the first tour operator and standardizing it according to the common interface. Plain Text public class TourOperatorOneServiceImpl implements TourOperatorService { private final RestClient restClient; public TourOperatorOneServiceImpl(@Qualifier("operatorOneRestClient") RestClient restClient) { this.restClient = restClient; } @Override public TourOperatorResponse makeRequest(TourOperatorRequest request) { var tourRequest = mapToOperatorRequest(request); // transformation of our request into the one that the tour operator will understand var responseList = restClient .post() .body(tourRequest) .retrieve() .toEntity(new ParameterizedTypeReference<List<TourProposition>>() { }); return TourOperatorResponse.builder() .deals(responseList .getBody() .stream() .map(ModelUtils::mapToCommonModel) .toList()) .build(); } } 5. Create Auto-Configuration File To register auto-configurations, we create the file resources/META-INF/spring/org.springframework.boot.autoconfigure.AutoConfiguration.imports . Plain Text com.tour.operator.one.autoconfiguration.TourOperatorOneAutoConfiguration This file contains a collection of configurations. In my scenario, one configuration is listed. If you have multiple configurations, make sure that each configuration is listed on a separate line. By creating this file, you are informing Spring Boot that it should load and utilize the TourOperatorOneAutoConfiguration class for setup when certain conditions specified by the @ConditionalOnProperty annotation are satisfied. Thus, we have established the setup for collaborating with the tour operator by developing configuration classes and beans and leveraging properties. Creating tour-operator-two-starter Up is creating tour-operator-two-starter a kit designed to integrate with the second tour operator and retrieve data from a GraphQL server through a straightforward HTTP request. Let's proceed with the process used for tour-operator-one-starter. 1. Create pom.xml Add dependencies. XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-configuration-processor</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>com.common.model</groupId> <artifactId>common-model</artifactId> <version>0.0.1-SNAPSHOT</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies> 2. Create TourOperatorTwoProperties These are the properties we will set in tour-aggregator to configure our starter. @ConfigurationProperties(prefix = "tour-operator.two.service") public class TourOperatorTwoProperties { private final Boolean enabled; private final String url; private final String apiKey; public TourOperatorTwoProperties( Boolean enabled, String url, String apiKey) { this.enabled = enabled; this.url = url; this.apiKey = apiKey; } //getters } 2. Create TourOperatorOneAutoConfiguration Java @AutoConfiguration @ConditionalOnProperty(prefix = "tour-operator.two.service", name = "enabled", havingValue = "true", matchIfMissing = true) @EnableConfigurationProperties(TourOperatorTwoProperties.class) public class TourOperatorTwoAutoconfiguration { private static final Logger log = LoggerFactory.getLogger(TourOperatorTwoAutoconfiguration.class); private final TourOperatorTwoProperties properties; public TourOperatorTwoAutoconfiguration(TourOperatorTwoProperties properties) { log.info("Configuration with: {}", properties); this.properties = properties; } @Bean("operatorTwoRestClient") public RestClient restClient(RestClient.Builder builder) { log.info("Configuration operatorRestClient: {}", properties); return builder .baseUrl(properties.getUrl()) .defaultHeaders(httpHeaders -> { httpHeaders.set("X-Api-Key", properties.getApiKey()); }) .build(); } @Bean("tourOperatorTwoService") public TourOperatorTwoServiceImpl tourOperatorService(TourOperatorTwoProperties properties, @Qualifier("operatorTwoRestClient") RestClient restClient) { log.info("Configuration tourOperatorService: {} and restClient: {}", properties, restClient); return new TourOperatorTwoServiceImpl(restClient); } } 3. Create TourOperatorOneServiceImpl Receiving data from the second tour operator. Java public class TourOperatorTwoServiceImpl implements TourOperatorService { private static final String QUERY = """ query makeTourRequest($request: TourOperatorRequest) { makeTourRequest(request: $request) { id startDate endDate price currency days hotel { hotelName hotelRating countryCode } } } """; private final RestClient restClient; public TourOperatorTwoServiceImpl(@Qualifier("operatorTwoRestClient") RestClient restClient) { this.restClient = restClient; } @Override public TourOperatorResponse makeRequest(TourOperatorRequest request) { var tourRequest = mapToOperatorRequest(request); var variables = Map.ofEntries(Map.entry("request", tourRequest)); var requestBody = Map.ofEntries( Map.entry("query", QUERY), Map.entry("variables", variables)); var response = restClient .post() .body(requestBody) .retrieve() .toEntity(QueryResponse.class); return TourOperatorResponse.builder() .deals(response.getBody() .data() .makeTourRequest() .stream() .map(ModelUtils::mapToCommonModel).toList()) .build(); } } Create Auto-Configuration File Create the file resources/META-INF/spring/org.springframework.boot. autoconfigure.AutoConfiguration.imports. Plain Text com.tour.operator.two.autoconfiguration.TourOperatorTwoAutoconfiguration Creating and Using the Aggregator Service An aggregator service is designed to gather data from tour operators. This involves linking starters, configuring parameters, and using beans with a shared interface. 1. Connect Starter Libraries Include dependencies for the two libraries in the pom.xml. XML <dependencies> ... <dependency> <groupId>com.tour.operator</groupId> <artifactId>tour-operator-one-spring-boot-starter</artifactId> <version>0.0.2-SNAPSHOT</version> </dependency> <dependency> <groupId>com.tour.operator</groupId> <artifactId>tour-operator-two-spring-boot-starter</artifactId> <version>0.0.1-SNAPSHOT</version> </dependency> ... </dependencies> Configure Parameters in application.yaml Specify the necessary data, such as URLs and connection parameters, in the application.yaml. YAML spring: application: name: tour-aggregator tour-operator: one: service: enabled: true url: http://localhost:8090/api/tours credentials: username: user123 password: pass123 two: service: enabled: true url: http://localhost:8091/graphql api-key: 11d1de45-5743-4b58-9e08-f6038fe05c8f Use Services We use the established beans, which implement the TourOperatorService interface within the TourServiceImpl class. This class outlines the process of retrieving and aggregating data from various tour operators. Java @Service public class TourServiceImpl implements TourService { private static final Logger log = LoggerFactory.getLogger(TourServiceImpl.class); private final List<TourOperatorService> tourOperatorServices; private final Executor tourOperatorExecutor; private final Integer responseTimeout; public TourServiceImpl(List<TourOperatorService> tourOperatorServices, @Qualifier("tourOperatorTaskExecutor") Executor tourOperatorExecutor, @Value("${app.response-timeout:5}") Integer responseTimeout) { this.tourOperatorServices = tourOperatorServices; this.tourOperatorExecutor = tourOperatorExecutor; this.responseTimeout = responseTimeout; } public List<TourOffer> getTourOffers(@RequestBody TourOperatorRequest request) { log.info("Send request: {}", request); var futures = tourOperatorServices.stream() .map(tourOperator -> CompletableFuture.supplyAsync(() -> tourOperator.makeRequest(request), tourOperatorExecutor) .orTimeout(responseTimeout, TimeUnit.SECONDS) .exceptionally(ex -> TourOperatorResponse.builder().deals(List.of()).build()) ) .toList(); var response = futures.stream() .map(CompletableFuture::join) .map(TourOperatorResponse::getDeals) .filter(Objects::nonNull) .flatMap(List::stream) .toList(); return response; } } Allocating Resources for Calls It’s good practice to allocate separate resources for calls, allowing better thread management and performance optimization. Java @Configuration public class ThreadPoolConfig { private final Integer threadCount; public ThreadPoolConfig(@Value("${app.thread-count:5}") Integer threadCount) { this.threadCount = threadCount; } @Bean(name = "tourOperatorTaskExecutor") public Executor tourOperatorTaskExecutor() { return Executors.newFixedThreadPool(threadCount); } } This code ensures efficient management of asynchronous tasks and helps avoid blocking the main thread, thereby improving overall system performance. Conclusion In this article, we’ve created two starters for reaching out to tour operators through REST and GraphQL technology interfaces. These steps include all the configurations and elements to simplify their usage. Afterward, we merged them into a system that communicates with them in an asynchronous manner and aggregates data. This approach solved several problems: Simplified integration and setup. By using auto-configuration and properties of coding, we saved time during development.Improved flexibility and usability. Separating functions into starters improved code structure and simplified maintenance.System flexibility. We can easily add new integrations without breaking the existing logic. Now, our system is better equipped to adapt and scale effortlessly while being easier to manage, leading to enhancements in its architecture and performance. Here’s the full code. I appreciate you reading this article. I look forward to hearing your thoughts and feedback!

By Taras Ivashchuk

Demystifying Sorting Assertions With AssertJ

There are times when a new feature containing sorting is introduced. Obviously, we want to verify that the implemented sorting works correctly. AssertJ framework provides first-class support for such tasks. This article shows how to write such tests. In this article, you will learn the following: Two main methods provided by Assert frameworks for sorting assertionHow to assert data sorted in ascending or descending wayHow to assert data sorted by multiple attributesHow to deal with nulls or case-insensitive sorting Introduction First of all, we need to know that AssertJ provides the AbstractListAssert abstract class for asserting any List argument. This class also contains the isSorted and isSortedAccordingTo methods for our purpose. The solution used there is based on the Comparator interface and its several static methods. Before moving to our examples, let's have a short introduction to the data and technology used in this article. Almost every case presented here is demonstrated with two examples. The simple one is always presented first, and it's based on a list of String values. The goal is to provide the simplest example which can be easily taken and tested. However, the second example is based on the DB solution introduced in the Introduction: Querydsl vs. JPA Criteria article. There, we use Spring Data JPA solution for the PDM model defined as: These tables are mapped to City and Country entities. Their implementation is not mentioned here, as it's available in the article mentioned earlier. Nevertheless, the data used in the examples below are defined like this: Plain Text [ City(id=5, name=Atlanta, state=Georgia, country=Country(id=3, name=USA)), City(id=13, name=Barcelona, state=Catalunya, country=Country(id=7, name=Spain)), City(id=14, name=Bern, state=null, country=Country(id=8, name=Switzerland)), City(id=1, name=Brisbane, state=Queensland, country=Country(id=1, name=Australia)), City(id=6, name=Chicago, state=Illionis, country=Country(id=3, name=USA)), City(id=15, name=London, state=null, country=Country(id=9, name=United Kingdom)), City(id=2, name=Melbourne, state=Victoria, country=Country(id=1, name=Australia)), City(id=7, name=Miami, state=Florida, country=Country(id=3, name=USA)), City(id=4, name=Montreal, state=Quebec, country=Country(id=2, name=Canada)), City(id=8, name=New York, state=null, country=Country(id=3, name=USA)), City(id=12, name=Paris, state=null, country=Country(id=6, name=France)), City(id=11, name=Prague, state=null, country=Country(id=5, name=Czech Republic)), City(id=9, name=San Francisco, state=California, country=Country(id=3, name=USA)), City(id=3, name=Sydney, state=New South Wales, country=Country(id=1, name=Australia)), City(id=10, name=Tokyo, state=null, country=Country(id=4, name=Japan)) ] Finally, it's time to move to our examples. Let's start with a simple isSorted method. isSorted Method AssertJ framework provides the isSorted method in order to verify values that implement the Comparable interface, and these values are in a natural order. The simplest usage can be seen in the dummyAscendingSorting test as: Define sorted values (line 3)Assert the correct order with isSorted method (line 5) Java @Test void dummyAscendingSorting() { var cities = List.of("Atlanta", "London", "Tokyo"); assertThat(cities).isSorted(); } Now, let's imagine a more real-like example where the data is provided by Spring Data JPA. This use case is demonstrated in the sortingByNameAscending test as: Define a pagination request for loading data. Here, we request data sorted just by city name and the page with the size of 5 (line 5).Load cities from cityRepository with the findAll method (line 5).Assert the loaded data as: Check the number of cities returned by the search (line 10) -> to be equal to the requested page size.Extract a name attribute from the City entity (line 11). This is necessary as our City entity doesn't implement the Comparable interface. More details are covered at the end of this article.Assert the correct order with isSorted (line 12) -> this is our goal. Java import static org.springframework.data.domain.Sort.Direction.ASC; @Test void sortingByNameAscending() { var pageable = PageRequest.of(0, 5, ASC, City_.NAME); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .hasSize(5) .map(City::getName) .isSorted(); } Reverse Order In some cases, we use descending order. The sorting assertion can be handled in a pretty similar way, but we need to use the isSortedAccordingTo method instead of the isSorted method. This method is used for the advanced sorting assertion. The simplest assertion for values sorted in descending ways can be seen in the dummyDescendingSorting test. This is the same as the usage of the isSorted method, but this time, we need to use the already-mentioned isSortedAccordingTo method with the Collections.reverseOrder comparator. Java import static java.util.Collections.reverseOrder; @Test void dummyDescendingSorting() { assertThat(List.of("Tokyo", "London", "Atlanta")).isSortedAccordingTo( reverseOrder() ); } The real-like solution is demonstrated by the sortingByNameDescending test. It's very similar to the previous sortingByNameAscending test, but this time, we use data loaded from DB. Java import static java.util.Collections.reverseOrder; import static org.springframework.data.domain.Sort.Direction.DESC; @Test void sortingByNameDescending() { var pageable = PageRequest.of(0, 5, DESC, City_.NAME); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .hasSize(5) .map(City::getName) .isSortedAccordingTo( reverseOrder() ); } Custom Comparator Sometimes, we use sorting by multiple attributes. Therefore, we cannot use the simple approach shown in previous examples. For asserting sorting by multiple attributes, we need to have a comparator. This case is demonstrated in the sortingByCountryAndCityNames test as: Define a pagination request with ascending sorting first by the country name and then by the city name (line 4). Now, we use a higher page size in order to load all available data.Assert the loaded data as: Assert the correct order by the country name (line 4) with the custom comparator implemented in the getCountryNameComparator method (lines 13-15).Assert the correct order by the city name (line 10) simply by providing the desired function to the Comparator.thenComparing method. Java @Test void sortingByCountryAndCityNames() { var countryNameSorting = City_.COUNTRY + "." + Country_.NAME; var pageable = PageRequest.of(0, 15, ASC, countryNameSorting, City_.NAME); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .isSortedAccordingTo( getCountryNameComparator() .thenComparing( City::getName )); } private Comparator<City> getCountryNameComparator() { return ( c1, c2 ) -> c1.getCountry().getName().compareTo(c2.getCountry().getName()); } In order to promote your understanding, the data by Spring Data JPA loaded in the sortingByCountryAndCityNames test is listed below as: Plain Text [ City(id=1, name=Brisbane, state=Queensland, country=Country(id=1, name=Australia)), City(id=2, name=Melbourne, state=Victoria, country=Country(id=1, name=Australia)), City(id=3, name=Sydney, state=New South Wales, country=Country(id=1, name=Australia)), City(id=4, name=Montreal, state=Quebec, country=Country(id=2, name=Canada)), City(id=11, name=Prague, state=null, country=Country(id=5, name=Czech Republic)), City(id=12, name=Paris, state=null, country=Country(id=6, name=France)), City(id=10, name=Tokyo, state=null, country=Country(id=4, name=Japan)), City(id=13, name=Barcelona, state=Catalunya, country=Country(id=7, name=Spain)), City(id=14, name=Bern, state=null, country=Country(id=8, name=Switzerland)), City(id=5, name=Atlanta, state=Georgia, country=Country(id=3, name=USA)), City(id=6, name=Chicago, state=Illionis, country=Country(id=3, name=USA)), City(id=7, name=Miami, state=Florida, country=Country(id=3, name=USA)), City(id=8, name=New York, state=null, country=Country(id=3, name=USA)), City(id=9, name=San Francisco, state=California, country=Country(id=3, name=USA)), City(id=15, name=London, state=null, country=Country(id=9, name=United Kingdom)) ] Sorting With NULLs Some data might contain a null value, and we need to deal with it. This case is covered in the dummyAscendingSortingWithNull test as: Define data with null value in the beginning (line 6).Assert null value in the beginning with Comparator.nullsFirst comparator and the ascending order by using Comparator.naturalOrder comparator. Java import static java.util.Comparator.naturalOrder; import static java.util.Comparator.nullsFirst; @Test void dummyAscendingSortingWithNull() { assertThat(Arrays.asList(new String[] { null, "London", "Tokyo" })) .isSortedAccordingTo(nullsFirst(naturalOrder())); } The same approach in our real-like solution is available in the sortingByStateAscending test. Java import static java.util.Comparator.naturalOrder; import static java.util.Comparator.nullsFirst; @Test void sortingByStateAscending() { var pageable = PageRequest.of(0, 15, ASC, City_.STATE); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .map(City::getState) .isSortedAccordingTo(nullsFirst(naturalOrder())); } It is also possible to receive a null at the end instead of the beginning. Let's see this case in our last example. A Complex Sorting Example Our last example demonstrates a more complex scenario. Our goal is to verify the order of our data sorted in descending and case-insensitive order. Additionally, this data contains null values. The simple usage is in the dummyDescendingSortingWithNull test as: Define sorted values (line 7).Assert the correct order with isSortedAccordingTo(line 8) and Comparator.nullsLast – to check that nulls are at the end -> as we have descending sorting,Collections.reverseOrder – to check the descending order andString.CASE_INSENSITIVE_ORDER – to compare values ignoring the case sensitivity. Java import static java.util.Collections.reverseOrder; import static java.util.Comparator.nullsLast; import static String.CASE_INSENSITIVE_ORDER; @Test void dummyDescendingSortingWithNull() { assertThat(Arrays.asList(new String[] { "London", "atlanta", "Alabama", null})) .isSortedAccordingTo(nullsLast(reverseOrder(CASE_INSENSITIVE_ORDER))); } The same approach in our real-like solution is available in sortingByStateDescending test. Java import static java.util.Collections.reverseOrder; import static java.util.Comparator.nullsLast; import static String.CASE_INSENSITIVE_ORDER; @Test void sortingByStateDescending() { var pageable = PageRequest.of(0, 15, DESC, City_.STATE); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .map(City::getState) .isSortedAccordingTo(nullsLast(reverseOrder(CASE_INSENSITIVE_ORDER))); } Known Pitfall When dealing with sorting, it's easy to forget we can apply sorting functions only to instances implementing the Comparable interface. In our case, the City entity doesn't implement this interface. The appropriate comparator depends on our sorting. Therefore, the comparator can be different for every sortable attribute or their combinations. Let's demonstrate this situation from our first example by the failByNotProvidingCorrectComparator test as: Java import static org.springframework.data.domain.Sort.Direction.ASC; @Test void failByNotProvidingCorrectComparator() { var pageable = PageRequest.of(0, 5, ASC, City_.NAME); Page<City> page = cityRepository.findAll(pageable); assertThat(page.getContent()) .hasSize(5) // .map(City::getName) .isSorted(); } We get the some elements are not mutually comparable in group error when the map function is commented out (line 11). Plain Text java.lang.AssertionError: some elements are not mutually comparable in group: [City(id=5, name=Atlanta, state=Georgia, country=Country(id=3, name=USA)), City(id=13, name=Barcelona, state=Catalunya, country=Country(id=7, name=Spain)), City(id=14, name=Bern, state=null, country=Country(id=8, name=Switzerland)), City(id=1, name=Brisbane, state=Queensland, country=Country(id=1, name=Australia)), City(id=6, name=Chicago, state=Illionis, country=Country(id=3, name=USA))] at com.github.aha.sat.jpa.city.CityRepositoryTests$FindAll.sortingByNameAscending(CityRepositoryTests.java:71) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) } Such simplification is wrong, but it can happen from time to time when we try to simplify our code. Summary and Source Code First, the article explained the basics of sorting with the isSorted method. Next, sorting assertions for data in reverse order and sorting by two criteria using the custom comparator were demonstrated. After that, the sorting for data with null values was covered. Finally, the pitfall related to the misuse of sorting assertions provided by the AssertJ framework was explained. The complete source code presented above is available in my GitHub repository.

By Arnošt Havelka

CORE

Mastering the Transition: From Amazon EMR to EMR on EKS

Amazon Elastic MapReduce (EMR) is a platform to process and analyze big data. Traditional EMR runs on a cluster of Amazon EC2 instances managed by AWS. This includes provisioning the infrastructure and handling tasks like scaling and monitoring. EMR on EKS integrates Amazon EMR with Amazon Elastic Kubernetes Service (EKS). It allows users the flexibility to run Spark workloads on a Kubernetes cluster. This brings a unified approach to manage and orchestrate both compute and storage resources. Key Differences Between Traditional EMR and EMR on EKS Traditional EMR and EMR on EKS differ in several key aspects: Cluster management. Traditional EMR utilizes a dedicated EC2 cluster, where AWS handles the infrastructure. EMR on EKS, on the other hand, runs on an EKS cluster, leveraging Kubernetes for resource management and orchestration.Scalability. While both services offer scalability, Kubernetes in EMR on EKS provides more fine-grained control and auto-scaling capabilities, efficiently utilizing compute resources.Deployment flexibility. EMR on EKS allows multiple applications to run on the same cluster with isolated namespaces, providing flexibility and more efficient resource sharing. Benefits of Transitioning to EMR on EKS Moving to EMR on EKS brings several key benefits: Improved resource utilization. Enhanced scheduling and management of resources by Kubernetes ensure better utilization of compute resources, thereby reducing costs.Unified management. Big data analytics can be deployed and managed, along with other applications, from the same Kubernetes cluster to reduce infrastructure and operational complexity.Scalable and flexible. The granular scaling offered by Kubernetes, alongside the ability to run multiple workloads in isolated environments, aligns closely with modern cloud-native practices.Seamless integration. EMR on EKS integrates smoothly with many AWS services like S3, IAM, and CloudWatch, providing a consistent and secure data processing environment. Transitioning to EMR on EKS can modernize the way organizations manage their big data workloads. Up next, we'll delve into understanding the architectural differences and the role Kubernetes plays in EMR on EKS. Understanding the Architecture Traditional EMR architecture is based on a cluster of EC2 instances that are responsible for running big data processing frameworks like Apache Hadoop, Spark, and HBase. These clusters are typically provisioned and managed by AWS, offering a simple way to handle the underlying infrastructure. The master node oversees all operations, and the worker nodes execute the actual tasks. This setup is robust but somewhat rigid, as the cluster sizing is fixed at the time of creation. On the other hand, EMR on EKS (Elastic Kubernetes Service) leverages Kubernetes as the orchestration layer. Instead of using EC2 instances directly, EKS enables users to run containerized applications on a managed Kubernetes service. In EMR on EKS, each Spark job runs inside a pod within the Kubernetes cluster, allowing for more flexible resource allocation. This architecture also separates the control plane (Amazon EKS) from the data plane (EMR pods), promoting more modular and scalable deployments. The ability to dynamically provision and de-provision pods helps achieve better resource utilization and cost-efficiency. Role of Kubernetes Kubernetes plays an important role in the EMR on EKS architecture because of its strong orchestration capabilities for containerized applications. Following are some of the significant roles. Pod management. Kubernetes maintains the pod as the smallest manageable unit inside of a Kubernetes Cluster. Therefore, every Spark Job in an EMR on EKS operates on a Pod of its own with a high degree of isolation and flexibility.Resource scheduling. Kubernetes intelligently schedules pods based on resource requests and constraints, ensuring optimal utilization of available resources. This results in enhanced performance and reduced wastage.Scalability. Kubernetes supports both horizontal and vertical scaling. It could dynamically adjust the number of pods depending on the workload at that moment in time, scaling up in high demand and scaling down in low usage periods of time.Self-healing. In case some PODs fail, Kubernetes will independently detect them and replace those to ensure the high resiliency of applications running in the cluster. Planning the Transition Assessing Current EMR Workloads and Requirements Before diving into the transition from traditional EMR to EMR on EKS, it is essential to thoroughly assess your current EMR workloads. Start by cataloging all running and scheduled jobs within your existing EMR environment. Identify the various applications, libraries, and configurations currently utilized. This comprehensive inventory will be the foundation for a smooth transition. Next, analyze the performance metrics of your current workloads, including runtime, memory usage, CPU usage, and I/O operations. Understanding these metrics helps to establish a baseline that ensures the new environment performs at least as well, if not better,r than the old one. Additionally, consider the scalability requirements of your workloads. Some workloads might require significant resources during peak periods, while others run constantly but with lower resource consumption. Identifying Potential Challenges and Solutions Transitioning to EMR on EKS brings different technical and operational challenges. Recognizing these challenges early helps in crafting effective strategies to address them. Compatibility issues. EMR on EKS might be different in terms of specific configurations and applications. Test applications for compatibility and be prepared to make adjustments where needed.Resource management. Unlike traditional EMR, EMR on EKS leverages Kubernetes for resource allocation. Learn Kubernetes concepts such as nodes, pods, and namespaces to efficiently manage resources.Security concerns. System transitions can reveal security weaknesses. Evaluate current security measures and ensure they can be replicated or improved upon in the new setup. This includes network policies, IAM roles, and data encryption practices.Operational overheads. Moving to Kubernetes necessitates learning new operational tools and processes. Plan for adequate training and the adoption of tools that facilitate Kubernetes management and monitoring. Creating a Transition Roadmap The subsequent step is to create a detailed transition roadmap. This roadmap should outline each phase of the transition process clearly and include milestones to keep the project on track. Step 1. Preparation Phase Set up a pilot project to test the migration with a subset of workloads. This phase includes configuring the Amazon EKS cluster and installing the necessary EMR on EKS components. Step 2. Pilot Migration Migrate a small, representative sample of your EMR jobs to EMR on EKS. Validate compatibility and performance, and make adjustments based on the outcomes. Step 3. Full Migration Roll out the migration to encompass all workloads gradually. It’s crucial to monitor and compare performance metrics actively to ensure the transition is seamless. Step 4. Post-Migration Optimization Following the migration, continuously optimize the new environment. Implement auto-scaling and right-sizing strategies to guarantee effective resource usage. Step 5. Training and Documentation Provide comprehensive training for your teams on the new tools and processes. Document the entire migration process, including best practices and lessons learned. Best Practices and Considerations Security Best Practices for EMR on EKS Security will be given the highest priority while moving to EMR on EKS. Data security and compliance laws will ensure the smooth and secure running of the processes. IAM roles and policies. Use AWS IAM roles for least-privilege access. Create policies to grant permissions to users and applications based on their needs.Network security. Leverage VPC endpoints to their maximum capacity in establishing a secure connection between your EKS cluster and any other AWS service. Inbound and outbound traffic at the instance and subnet levels can be secured through security groups and network ACLs.Data encryption. Implement data encryption in transit and at rest. To that end, it is possible to utilize AWS KMS, which makes key management easy. Turn on encryption for any data held on S3 buckets and in transit.Monitoring and auditing. Implement ongoing monitoring with AWS CloudTrail and Amazon CloudWatch for activity tracking, detection of any suspicious activity, and security standards compliance. Performance Tuning and Optimization Techniques Performance tuning on EMR on EKS is crucial to keep the resources utilized effectively and the workloads executed suitably. Resource allocation. The resources need to be allocated based on the workload. Kubernetes node selectors and namespaces allow effective resource allocation.Spark configurations tuning. Spark configuration parameters like spark.executor.memory, spark.executor.cores, and spark.sql.shuffle.partitions are required to be tuned. Tuning needs to be job-dependent based on utilization and capacity in the cluster.Job distribution. Distribute jobs evenly across nodes using Kubernetes scheduling policies. This aids in preventing bottlenecks and guarantees balanced resource usage.Profiling and monitoring. Use tools like CloudWatch and Spark UI to monitor job performance. Identify and address performance bottlenecks by tuning configurations based on insights. Scalability and High Availability Considerations Auto-scaling. Leverage auto-scaling of your cluster and workloads using Kubernetes Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler. This automatically provisions resources on demand to keep up with the needs of jobs.Fault tolerance. Set up your cluster for high availability by spreading the nodes across numerous Availability Zones (AZs). This reduces the likelihood of downtime due to AZ-specific failures.Backup and recovery. Regularly back up critical data and cluster configurations. Use AWS Backup and snapshots to ensure you can quickly recover from failures.Load balancing. Distribute workloads using load balancing mechanisms like Kubernetes Services and AWS Load Balancer Controller. This ensures that incoming requests are evenly spread across the available nodes. Conclusion For teams that are thinking about the shift to EMR on EKS, the first step should be a thorough assessment of their current EMR workloads and infrastructure. Evaluate the potential benefits specific to your operational needs and create a comprehensive transition roadmap that includes pilot projects and phased migration plans. Training your team on Kubernetes and the nuances of EMR on EKS will be vital to ensure a smooth transition and long-term success. Begin with smaller workloads to test the waters and gradually scale up as confidence in the new environment grows. Prioritize setting up robust security and governance frameworks to safeguard data throughout the transition. Implement monitoring tools and cost management solutions to keep track of resource usage and expenditures. I would also recommend adopting a proactive approach to learning and adaptation to leverage the full potential of EMR on EKS, driving innovation and operational excellence.

By Satrajit Basu

CORE

AI Regulation in the U.S.: Navigating Post-EO 14110

As the Trump administration revokes Executive Order 14110, the U.S. shifts toward a market-driven AI strategy, departing from the Biden administration’s regulatory framework. While proponents see this as a catalyst for innovation and economic growth, critics warn of increased risks, regulatory fragmentation, and strained transatlantic relations. With Europe reinforcing its AI Act and states like California exploring their own regulations, the future of AI governance in the U.S. remains uncertain. Will deregulation accelerate progress, or does it open the door to new challenges in ethics, security, and global cooperation? Just days after taking office, Donald Trump, the 47th President of the United States, issued a series of executive actions aimed at dismantling key initiatives from the Biden administration. Among them was the revocation of Executive Order (EO) 14110, a landmark policy that established a framework for AI governance and regulation. This decision marks a turning point in U.S. AI policy. For its supporters, it is a necessary reform; for its critics, it is a dangerous setback. While EO 14110 aimed to structure AI adoption by balancing innovation and oversight, its repeal raises critical questions about the future of AI in the United States and its global impact. Background on Executive Order 14110 Executive Order 14110 was issued on October 30, 2023, under the Biden administration. This major initiative aimed to regulate the development and deployment of artificial intelligence. Its goal was to balance innovation, security, and economic stability while ensuring that AI systems remained reliable, safe, and transparent. In the Biden administration’s vision, EO 14110 was designed to address key concerns such as algorithmic bias, misinformation, job displacement, and cybersecurity vulnerabilities. It was not intended to impose direct restrictions on the private sector but rather to establish security and ethical standards, particularly for AI used by federal agencies and in public sector contracts, while also influencing broader AI governance. From an international perspective, EO 14110 also aimed to strengthen the United States' role in global AI governance. It aligned with the European Union’s approach, particularly as the EU was developing its AI Act. The order was part of a broader transatlantic effort to establish ethical and security standards for AI. "Artificial Intelligence (AI) holds extraordinary potential for both promise and peril. Responsible AI use has the potential to help solve urgent challenges while making our world more prosperous, productive, innovative, and secure. At the same time, irresponsible use could exacerbate societal harms such as fraud, discrimination, bias, and disinformation; displace and disempower workers; stifle competition; and pose risks to national security." (EO 14110 - Section 1) EO 14110 as Part of a Broader AI Strategy: Continuity in Biden’s Policy It is important to understand that EO 14110 was not an isolated initiative. It was part of a broader strategy built on several existing frameworks and commitments. Blueprint for an AI Bill of Rights (2022). A foundational document outlining five key principles: safe and effective AI systems, protections against algorithmic discrimination, data privacy, transparency, and human alternatives. Voluntary AI Commitments (2023-2024). Major tech companies, including Google, OpenAI, and Microsoft, agreed to self-regulation measures focusing on AI transparency, security, and ethics. National Security AI Strategy (2024). The Biden administration made AI a priority in cybersecurity, military applications, and critical infrastructure protection. It is worth noting that even after the revocation of EO 14110, these initiatives remain in place, ensuring a degree of continuity in AI governance in the United States. Objectives and Scope of EO 14110 Executive Order 14110 pursued several strategic objectives aimed at regulating AI adoption while promoting innovation. It emphasized the security and reliability of AI systems by requiring robustness testing and risk assessments, particularly in sensitive areas such as cybersecurity and critical infrastructure. It also aimed to ensure fairness and combat bias by implementing protections against algorithmic discrimination and promoting ethical AI use in hiring, healthcare, and justice. EO 14110 included training, reskilling, and protection programs to help workers adapt to AI-driven changes. It also aimed to protect consumers by preventing fraudulent or harmful AI applications, ensuring safe and beneficial use. Finally, the executive order aimed to reinforce international cooperation, particularly with the European Union, to establish common AI governance standards. However, it’s important to note that it did not aim to regulate the entire private sector but rather to set strict ethical and security standards for AI systems used by federal agencies. Summary of EO 14110’s Key Principles To quickly get the essentials, here are the eight fundamental principles it was built on: Ensure AI is safe and secure.Promote responsible innovation and competition.Support workers affected by AI’s deployment.Advance equity and civil rights.Protect consumer interests in AI applications.Safeguard privacy and civil liberties.Enhance AI capabilities within the federal government.Promote U.S. leadership in global AI governance. Why Did the Trump Administration Revoke EO 14110? So, on January 20, 2025, the Trump administration announced the revocation of EO 14110, arguing that it restricted innovation by imposing excessive administrative constraints. The White House justified this decision as part of a broader push to deregulate the sector, boost the economy, and attract AI investment. The administration made clear its preference for a market-driven approach. According to Trump, private companies are better positioned to oversee AI development without federal intervention. Clearly, this shift marks a geopolitical turning point. The United States is moving away from a multilateral approach to assert its dominance in the AI sector. However, this revocation does not mean the end of AI regulation in the United States. Other federal initiatives, such as the NIST AI Risk Management Framework, remain in place. "Republicans support AI development rooted in free speech and human flourishing." (The 2024 Republican Party by Reuters) Consequences of the Revocation in the United States The repeal of EO 14110 has immediate effects and long-term implications. It reshapes the future of AI development in the United States. From the Trump administration’s perspective, this decision removes bureaucratic hurdles, accelerates innovation, and strengthens U.S. competitiveness in AI. Supporters argue that by reducing regulatory constraints, the repeal allows companies to move faster, lowers compliance costs, and attracts greater investment, particularly in automation and biotechnology. But on the other hand, without a federal framework, the risks associated with the development and use of AI technologies are increasing. Algorithmic bias, cybersecurity vulnerabilities, and the potential misuse of AI become harder to control without national oversight. Critics also warn of a weakening of worker and consumer protections, as the end of support programs could further deepen economic inequalities. In practical terms, regulation is becoming more fragmented. Without a federal framework, each state could, and likely will, develop its own AI laws, making compliance more complex for businesses operating nationwide. Some see this as an opportunity for regulatory experimentation, while others see it as a chance for opportunistic players to exploit loopholes or fear legal uncertainty and increased tensions with international partners. Impact on Europe The revocation of EO 14110 also affects global AI governance, particularly in Europe. Transatlantic relations are likely to become strained, as the growing divergence between U.S. and European approaches will make regulatory cooperation more challenging. European companies may tighten their compliance standards to maintain consumer trust, which could influence their strategic decisions. In fact, the European Union may face pressure to adjust its AI Act, although its regulatory framework remains largely independent from that of the United States. Conclusion The revocation of Executive Order 14110 is more than just a policy shift in the United States. It represents a strategic choice, favoring a deregulated model where innovation takes precedence over regulation. While this decision may help accelerate technological progress, it also leaves critical questions unanswered: Who will ensure the ethics, security, and transparency of AI in the United States? For Europe, this shift deepens the divide with the U.S. and strengthens its role as a "global regulator" through the AI Act. The European Union may find itself alone at the forefront of efforts to enforce strict AI regulations, risking a scenario where some companies favor the less restrictive U.S. market. More than a debate on regulation, this revocation raises a fundamental question: In the global AI race, should progress be pursued at all costs, or should every advancement be built on solid and ethical foundations? The choices made today will shape not only the future of the industry but also the role of democracies in the face of tech giants. One More Thing The revocation of EO 14110 highlights a broader debate: who really shapes AI policy, the government or private interests? While the U.S. moves toward deregulation, California’s AI safety bill (SB 1047) is taking the opposite approach, proposing strict oversight for advanced AI models. But as an investigation by Pirate Wires reveals, this push for regulation isn’t without controversy. Dan Hendrycks, a key advocate for AI safety, co-founded Gray Swan, a company developing compliance tools that could directly benefit from SB 1047’s mandates. This raises a crucial question: When policymakers and industry leaders are deeply intertwined, is AI regulation truly about safety, or about controlling the market? In the race to govern AI, transparency may be just as important as regulation itself.

By Frederic Jacquet

CORE

Teradata Performance and Skew Prevention Tips

Understanding Teradata Data Distribution and Performance Optimization Teradata performance optimization and database tuning are crucial for modern enterprise data warehouses. Effective data distribution strategies and data placement mechanisms are key to maintaining fast query responses and system performance, especially when handling petabyte-scale data and real-time analytics. Understanding data distribution mechanisms, workload management, and data warehouse management directly affects query optimization, system throughput, and database performance optimization. These database management techniques enable organizations to enhance their data processing capabilities and maintain competitive advantages in enterprise data analytics. Data Distribution in Teradata: Key Concepts Teradata's MPP (Massively Parallel Processing) database architecture is built on Access Module Processors (AMPs) that enable distributed data processing. The system's parallel processing framework utilizes AMPs as worker nodes for efficient data partitioning and retrieval. The Teradata Primary Index (PI) is crucial for data distribution, determining optimal data placement across AMPs to enhance query performance. This architecture supports database scalability, workload management, and performance optimization through strategic data distribution patterns and resource utilization. Understanding workload analysis, data access patterns, and Primary Index design is essential for minimizing data skew and optimizing query response times in large-scale data warehousing operations. What Is Data Distribution? Think of Teradata's AMPs (Access Module Processors) as workers in a warehouse. Each AMP is responsible for storing and processing a portion of your data. The Primary Index determines how data is distributed across these workers. Simple Analogy Imagine you're managing a massive warehouse operation with 1 million medical claim forms and 10 workers. Each worker has their own storage section and processing station. Your task is to distribute these forms among the workers in the most efficient way possible. Scenario 1: Distribution by State (Poor Choice) Let's say you decide to distribute claims based on the state they came from: Plain Text Worker 1 (California): 200,000 claims Worker 2 (Texas): 150,000 claims Worker 3 (New York): 120,000 claims Worker 4 (Florida): 100,000 claims Worker 5 (Illinois): 80,000 claims Worker 6 (Ohio): 70,000 claims Worker 7 (Georgia): 60,000 claims Worker 8 (Virginia): 40,000 claims Worker 9 (Oregon): 30,000 claims Worker 10 (Montana): 10,000 claims The Problem Worker 1 is overwhelmed with 200,000 formsWorker 10 is mostly idle, with just 10,000 formsWhen you need California data, one worker must process 200,000 forms aloneSome workers are overworked, while others have little to do Scenario 2: Distribution by Claim ID (Good Choice) Now, imagine distributing claims based on their unique claim ID: Plain Text Worker 1: 100,000 claims Worker 2: 100,000 claims Worker 3: 100,000 claims Worker 4: 100,000 claims Worker 5: 100,000 claims Worker 6: 100,000 claims Worker 7: 100,000 claims Worker 8: 100,000 claims Worker 9: 100,000 claims Worker 10: 100,000 claims The Benefits Each worker handles exactly 100,000 formsWork is perfectly balancedAll workers can process their forms simultaneouslyMaximum parallel processing achieved This is exactly how Teradata's AMPs (workers) function. The Primary Index (distribution method) determines which AMP gets which data. Using a unique identifier like claim_id ensures even distribution, while using state_id creates unbalanced workloads. Remember: In Teradata, like in our warehouse, the goal is to keep all workers (AMPs) equally busy for maximum efficiency. The Real Problem of Data Skew in Teradata Example 1: Poor Distribution (Using State Code) SQLite CREATE TABLE claims_by_state ( state_code CHAR(2), -- Only 50 possible values claim_id INTEGER, -- Millions of unique values amount DECIMAL(12,2) -- Claim amount ) PRIMARY INDEX (state_code); -- Creates daily hotspots which will cause skew! Let's say you have 1 million claims distributed across 50 states in a system with 10 AMPs: SQLite -- Query to demonstrate skewed distribution SELECT state_code, COUNT(*) as claim_count, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage FROM claims_by_state GROUP BY state_code ORDER BY claim_count DESC; -- Sample Result: -- STATE_CODE CLAIM_COUNT PERCENTAGE -- CA 200,000 20% -- TX 150,000 15% -- NY 120,000 12% -- FL 100,000 10% -- ... other states with smaller percentages Problems With This Distribution 1. Uneven workload California (CA) data might be on one AMPThat AMP becomes overloaded while others are idleQueries involving CA take longer 2. Resource bottlenecks SQLite -- This query will be slow SELECT COUNT(*), SUM(amount) FROM claims_by_state WHERE state_code = 'CA'; -- One AMP does all the work Example 2: Better Distribution (Using Claim ID) SQLite CREATE TABLE claims_by_state ( state_code CHAR(2), claim_id INTEGER, amount DECIMAL(12,2) ) PRIMARY INDEX (claim_id); -- Better distribution Why This Works Better 1. Even distribution Plain Text -- Each AMP gets approximately the same number of rows -- With 1 million claims and 10 AMPs: -- Each AMP ≈ 100,000 rows regardless of state 2. Parallel processing SQLite -- This query now runs in parallel SELECT state_code, COUNT(*), SUM(amount) FROM claims_by_state GROUP BY state_code; -- All AMPs work simultaneously Visual Representation of Data Distribution Poor Distribution (State-Based) SQLite -- Example demonstrating poor Teradata data distribution CREATE TABLE claims_by_state ( state_code CHAR(2), -- Limited distinct values claim_id INTEGER, -- High cardinality amount DECIMAL(12,2) ) PRIMARY INDEX (state_code); -- Causes data skew Plain Text AMP1: [CA: 200,000 rows] ⚠️ OVERLOADED AMP2: [TX: 150,000 rows] ⚠️ HEAVY AMP3: [NY: 120,000 rows] ⚠️ HEAVY AMP4: [FL: 100,000 rows] AMP5: [IL: 80,000 rows] AMP6: [PA: 70,000 rows] AMP7: [OH: 60,000 rows] AMP8: [GA: 50,000 rows] AMP9: [Other states: 100,000 rows] AMP10: [Other states: 70,000 rows] Impact of Poor Distribution Poor Teradata data distribution can lead to: Unbalanced workload across AMPsPerformance bottlenecksInefficient resource utilizationSlower query response times Good Distribution (Claim ID-Based) SQLite -- Implementing optimal Teradata data distribution CREATE TABLE claims_by_state ( state_code CHAR(2), claim_id INTEGER, amount DECIMAL(12,2) ) PRIMARY INDEX (claim_id); -- Ensures even distribution Plain Text AMP1: [100,000 rows] ✓ BALANCED AMP2: [100,000 rows] ✓ BALANCED AMP3: [100,000 rows] ✓ BALANCED AMP4: [100,000 rows] ✓ BALANCED AMP5: [100,000 rows] ✓ BALANCED AMP6: [100,000 rows] ✓ BALANCED AMP7: [100,000 rows] ✓ BALANCED AMP8: [100,000 rows] ✓ BALANCED AMP9: [100,000 rows] ✓ BALANCED AMP10: [100,000 rows] ✓ BALANCED Performance Metrics from Real Implementation In our healthcare system, changing from state-based to claim-based distribution resulted in: 70% reduction in query response time85% improvement in concurrent query performance60% better CPU utilization across AMPsElimination of processing hotspots Best Practices for Data Distribution 1. Choose High-Cardinality Columns Unique identifiers (claim_id, member_id)Natural keys with many distinct values 2. Avoid Low-Cardinality Columns State codesStatus flagsDate-only values 3. Consider Composite Keys (Advanced Teradata Optimization Techniques) Use when you need: Better data distribution than a single column providesEfficient queries on combinations of columnsBalance between distribution and data locality Plain Text Scenario | Single PI | Composite PI ---------------------------|--------------|------------- High-cardinality column | ✓ | Low-cardinality + unique | | ✓ Frequent joint conditions | | ✓ Simple equality searches | ✓ | SQLite CREATE TABLE claims ( state_code CHAR(2), claim_id INTEGER, amount DECIMAL(12,2) ) PRIMARY INDEX (state_code, claim_id); -- Uses both values for better distribution 4. Monitor Distribution Quality SQLite -- Check row distribution across AMPs SELECT HASHAMP(claim_id) as amp_number, COUNT(*) as row_count FROM claims_by_state GROUP BY 1 ORDER BY 1; /* Example Output: amp_number row_count 0 98,547 1 101,232 2 99,876 3 100,453 4 97,989 5 101,876 ...and so on */ What This Query Tells Us This query is like taking an X-ray of your data warehouse's health. It shows you how evenly your data is spread across your Teradata AMPs. Here's what it does: HASHAMP(claim_id) – this function shows which AMP owns each row. It calculates the AMP number based on your Primary Index (claim_id in this case)COUNT(*) – counts how many rows each AMP is handlingGROUP BY 1 – groups the results by AMP numberORDER BY 1 – displays results in AMP number order Interpreting the Results Good Distribution You want to see similar row counts across all AMPs (within 10-15% variance). Plain Text AMP 0: 100,000 rows ✓ Balanced AMP 1: 98,000 rows ✓ Balanced AMP 2: 102,000 rows ✓ Balanced Poor Distribution Warning signs include large variations. Plain Text AMP 0: 200,000 rows ⚠️ Overloaded AMP 1: 50,000 rows ⚠️ Underutilized AMP 2: 25,000 rows ⚠️ Underutilized This query is essential for: Validating Primary Index choicesIdentifying data skew issuesMonitoring system healthPlanning optimization strategies Conclusion Effective Teradata data distribution is fundamental to achieving optimal database performance. Organizations can significantly improve their data warehouse performance and efficiency by implementing these Teradata optimization techniques.

By Sudheer Kumar Lagisetty

Data Engineering

Functions of Data Engineering

AI/ML

Big Data

Data

Databases

IoT

DZone's Featured Data Engineering Resources

The Latest Data Engineering Topics