Cloud + data orchestration: Demolish your data silos. Enable complex analytics. Eliminate I/O bottlenecks. Learn the essentials (and more)!
2024 DZone Community Survey: SMEs wanted! Help shape the future of DZone. Share your insights and enter to win swag!
Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership.In DZone's 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business.This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
Benchmarking OpenAI Models for Automated Error Resolution
I started evaluating Google's Gemini Code Assist development in December 2023, almost about its launch time. The aim of this article is to cover its usage and impact beyond basic code generation on all the activities that a developer is supposed to do in his daily life (especially with additional responsibilities entrusted to developers these days with the advent of "Shift-Left" and full stack development roles). Gemini Code Assist Gemini Code Assist can be tried at no cost until November 2024. These are the core features it offered at the time of carrying out this exercise: AI code assistance Natural language chat AI-powered smart actions Enterprise security and privacy Refer to the link for more details and pricing: Gemini Code Assist. Note Gemini Code Assist was formerly known as Duet AI. The entire content of the study has been divided into two separate articles. Interested readers should go through both of them in sequential order. The second part will be linked following its publication. This review expresses a personal view specific to Gemini Code Assist only. As intelligent code assist is an evolving field, the review points are valid based on features available at the time of carrying out this study. Gemini Code Assist Capabilities: What’s Covered in the Study as per Features Availability Gemini Pro Code Customization Code transformations ✓ Available for all users ✓ Local Context from Relevant files in local folder x Use Natural Language to modify existing code e.g java 8 to Java 21 ✓ Chat × Remote Context from private codebases ✓ Improve Code Generations ✓ Smart Actions Note: 1. Items marked with x will be available in future releases of Gemini Code Assist. 2. Code Transformations is not released publicly and is in preview at the time of writing. Technical Tools Below are technical tools used for different focus areas during the exercise. The study is done on the specified tools, languages, and frameworks below, but the results can be applicable to other similar modern languages and frameworks with minor variations. Focus Areas Tools Language and Framework Java 11 & 17; Spring Boot 2.2.3 & 3.2.5 Database Postgres Testing Junit, Mockito IDE and Plugins VS Studio Code with extensions: Cloud Code Extension Gemini Code Assist Cloud Platform GCP with Gemini API Enabled on a project (Pre-requisite) Docker Cloud SQL (Postgres) Cloud Run Development Lifecycle Stages and Activities For simplicity, the entire development lifecycle has been divided into different stages (below) encompassing different sets of activities that developers would normally do. For each lifecycle stage, some activities were selected and tried out in VS Code Editor using Gemini Code Assist. S.No# stage Activities 1 Bootstrapping Gain deeper Domain Understanding via Enterprise Knowledge Base: Confluence, Git repos, etc. Generate Scaffolding Code for Microservices: Controller, services, repository, models Pre-Generated Templates for Unit and Integration Tests Database Schema: Table creation, relationships, scripts test-data population 2 Build and Augment Implement Business Logic/Domain Rules Leverage Implementation Patterns: e.g., Configuration Mgt, Circuit Breaker, etc. Exception and Error Handling Logging/Monitoring Optimized Code for performance: asynchronous, time-outs, concurrency, non-blocking, remove boilerplate 3 Testing and Documentation Debugging: Using Postman to test API Endpoints Unit/Integration Tests Open API Specs Creation Code Coverage; Quality; Code Smells Test Plan Creation 4 Troubleshoot Invalid/No Responses or application errors 5 Deployment Deploy Services to GCP Stack: Cloud Run/GKE/App Engine, Cloud SQL 6 Operate Get assistance modifying/upgrading existing application code and ensuring smooth operations Requirements Let's now consider a fictitious enterprise whose background and some functional requirements are given below. We will see to what extent Gemini Code Assist can help in fulfilling them. Background Functional Requirements A fictitious enterprise that moved to the cloud or adopted “cloud-native” a few years back: Domain: E-commerce Let’s keep discussion centric to “Microservices” using Spring Boot and Java Grappling with multi-fold technical challenges: Green Field (new microservices to be created Brown Field (breaking monolithic to microservices, integration with legacy systems) Iterative Development (incremental updates to microservices, upgrades, code optimization, patches) Allow List Products in Catalog Add, Modify, and Delete Products in the Catalog Recommendation Service Asynchronous implementation to retrieve the latest price Query affiliated shops for a product and fetch the lowest price for a product Bulk addition of products and grouping of processed results based on success and failure status Rules:A product will belong to a single category and a category may have many products. Let's Start with Stage 1, Bootstrapping, to gain deeper domain understanding. 1. Bootstrapping During this phase, developers will: Need more understanding of domain (i.e., e-commerce, in this case) from Enterprise Knowledge Management (Confluence, Git, Jira, etc.). Get more details about specific services that will need to be created. Get a viewpoint on the choice of tech stack (i.e., Java and Spring Boot) with steps to follow to develop new services. Let’s see how Gemini Code Assist can help in this regard and to what extent. Prompt: "I want to create microservices for an e-commerce company. What are typical domains and services that need to be created for this business domain" Note: Responses above by Gemini Code Assist: Chat are based on information retrieved from public online/web sources on which it is trained, and not retrieved from the enterprise’s own knowledge sources, such as Confluence. Though helpful, this is generic e-commerce information. In the future when Gemini Code Assist provides information more contextual to the enterprise, it will be more effective. Let’s now try to generate some scaffolding code for the catalog and recommendation service first as suggested by Code Assist. First, we will build a Catalog Service through Gemini Code Assist. A total of 7 steps along with code snippets were generated. Relevant endpoints for REST API methods to test the service are also provided once the service is up. Let's begin with the first recommended step, "Create a new Spring Boot project." Building Catalog Service, Step 1 Generate project through Spring Initializr: Note: Based on user prompts, Gemini Code Assist generates code and instructions to follow in textual form. Direct generation of files and artifacts is not supported yet. Generated code needs to be copied to files at the appropriate location. Building Catalog Service, Steps 2 and 3 Add dependency for JPA, and define Product and Category entities: Building Catalog Service, Step 4 Create Repository interfaces: Building Catalog Service, Step 5 Update Service layer: Building Catalog Service, Steps 6 and 7 Update the Controller and run the application: Building Catalog Service, Additional Step: Postgres Database Specific This step was not initially provided by Gemini Code Assist, but is part of an extended conversation/prompt by the developer. Some idiosyncrasies — for example, the Postgres database name — can not contain hyphens and had to be corrected before using the generated scripts. Building Through Gemini Code Assist vs Code Generators A counterargument to using Gemini Code Assist can be that a seasoned developer without Gemini Code Assist may be able to generate scaffolding code with JPAEntities quickly based on his past experience and familiarity with existing codebase using tools such as Spring Roo, JHipster, etc. However, there may be a learning curve, configuration, or approvals required before such tools can be used in an enterprise setup. The ease of use of Gemini Code Assist and the flexibility to cater to diverse use cases across domains makes it a viable option even for a seasoned developer, and it can, in fact, complement code-gen tools and be leveraged as the next step to initial scaffolding. 2. Build and Augment Now let's move to the second stage, Build and Augment, and evolve the product catalog service further by adding, updating, and deleting products generated through prompts. Generate a method to save the product by specifying comments at the service layer: Along similar lines to the product-catalog service, we created a Recommendation service. Each of the steps can be drilled down further as we did during the product-catalog service creation. Now, let's add some business logic by adding a comment and using Gemini Code Assist Smart Actions to generate code. Code suggestions can be generated not only by comment, but Gemini Code Assist is also intelligent enough to provide suggestions dynamically based on developer keyboard inputs and intent. Re-clicking Smart Actions can give multiple options for code. Another interactive option to generate code is the Gemini Code Assist Chat Feature. Let’s now try to change existing business logic. Say we want to return a map of successful and failed product lists instead of a single list to discern which products were processed successfully and which ones failed. Let's try to improve the existing method by an async implementation using Gemini Code Assist. Next, let's try to refactor an existing code by applying a strategy pattern through Gemini Code Assist. Note: The suggested code builds PricingStrategy for Shops; e.g., RandomPricing and ProductLength pricing. But, still, this is too much boilerplate code, so a developer, based on his experience, should probe further with prompts to reduce the boilerplate code. Let's try to reduce boilerplate code through Gemini Code Assist. Note: Based on the input prompt, the suggestion is to modify the constructor of the shop class to accept an additional function parameter for pricingstrategy using Lambdas. Dynamic behavior can be passed during the instantiation of Shop class objects. 3. Testing and Documentation Now, let's move to stage 3, testing and documentation, and probe Gemini Code Assist on how to test the endpoint. As per the response, Postman, curl, unit tests, and integration tests are some options for testing provided by Gemini Code Assist. Now, let's generate the payload from Gemini Code Assist to test the /bulk endpoint via Postman. Let's see how effective Gemini Code Assist generated payloads are by hitting the /bulk endpoint. Let's see if we can fix it with Gemini Code Assist so that invalid category IDs can be handled using product creation. Next, let's generate Open AI Specifications for our microservices using Gemini Code Assist. Note: Documenting APIs so that it becomes easy for API consumers to call and integrate these API(s) in their applications is a common requirement in microservices projects. However, it is often a time-consuming activity for developers. Swagger/Open API Specs is a common format followed to document REST APIs. Gemini Code Assist generated Open API Specs that matched the expectations in this regard. Next, we are generating unit test cases at the Controller layer. Following a similar approach, unit test cases can be generated at other layers; i.e., service and repository, too. Next, we ran the generated unit test cases and checked if we encountered any errors. 4. Troubleshooting While running this application, we encountered an error on table and an entity name mismatch, which we were able to rectify with Gemini Code Assist help. Next, we encountered empty results on the get products call when data existed in the products table. To overcome this issue, we included Lombok dependencies for missing getters and setters. Debugging: An Old Friend to the Developer’s Rescue The debugging skill of the developer will be handy, as there would be situations where results may not be as expected for generated code, resulting in hallucinations. We noted that a developer needs to be aware of concepts such as marshalling, unmarshalling, and annotations such as @RequestBody to troubleshoot such issues and then get more relevant answers from Gemini Code Assist. This is where a sound development background will come in handy. An interesting exploration in this area could be whether Code Assist tools can learn and be trained on issues that other developers in an enterprise have encountered during the development while implementing similar coding patterns. The API call to create a new product finally worked after incorporating the suggestion of adding @RequestBody. Handling exceptions in a consistent manner is a standard requirement for all enterprise projects. Create a new package for exceptions, a base class to extend, and other steps to implement custom exceptions. Gemini Code Assist does a good job of meeting this requirement. Handling specific exceptions such as "ProductNotFound": Part 1 Conclusion This concludes Part 1 of the article. In Part 2, I will cover the impact of Gemini Code Assist on the remainder of the lifecycle stages, Deployment and Operate; also, productivity improvements in different development lifecycle stages, and the next steps prescribed thereof.
When building ETL data pipelines using Azure Data Factory (ADF) to process huge amounts of data from different sources, you may often run into performance and design-related challenges. This article will serve as a guide in building high-performance ETL pipelines that are both efficient and scalable. Below are the major guidelines to consider when building optimized ETL data pipelines in ADF: Designing Pipelines Activity Type Selection ADF provides many types of activities to orchestrate the data, and based on the need and business requirement, choose the one that best fits your business. For example, when you copy data from an on-prem Oracle data source to an on-prem SQL Server (sink) that has many tables, consider using a combination of Lookup, For Each, and Copy Activity to optimally design the pipeline. Linked Services and Datasets A linked service is a connection to a data source that can be created once and reused across multiple pipelines within the same ADF. It is efficient to create one linked service per source for easy maintenance. Similarly, datasets are derived from the linked services to fetch the data from the source. These should ideally be a single dataset for each linked service and reused across all pipelines in the same ADF. Separate datasets are created when the data format from the source varies. Resiliency and Reliability When you build pipelines with numerous activities, it can complicate troubleshooting when the failure occurs. To achieve reliability, split the complex logic by dividing it into multiple pipelines and creating dependency with each other. Additionally, intermittent network issues can cause pipeline failures. To avoid manual intervention, enable the retry policy at the activity level to automatically rerun when it fails. Scheduling Using Triggers In a Big Data environment, processing large amounts of data is often required to develop and schedule multiple pipelines to run in batches. To optimize resource utilization, stagger pipeline schedules where possible. In case business requirements demand processing within a strict SLA, it is best to scale up resources on Integration Runtime (IR) and source/sink to effectively utilize and manage resources. Expressions and Variables You can leverage expressions and variables in pipelines to implement dynamic content that can adapt to environmental changes without manual intervention and diversify workflows. This can optimize the pipeline and resource usage across the activities. For example, expressions can be used in If and Switch activities to direct the activity execution based on the conditions, while variables can be used in Lookup and For Each activities to capture and define the values to pass those to subsequent activities. Performance Optimization Techniques Using Azure Integration Runtime Azure Integration Runtime (IR) provides a compute environment for your activities, which are managed by ADF. To transfer large amounts of data, higher computing power is needed, and this compute is adjustable via Data Integration Units (DIU) in Copy activity settings. The more DIUs can significantly reduce the activity runtime. Using Self-Hosted Integration Runtime (SHIR) SHIR can let you use your own compute resources for activity execution, mostly used to connect to on-premise sources. It does support a maximum of four nodes. For large amounts of workload, consider using high-configuration machines across all nodes (ideally the same size and location) to avoid performance bottlenecks such as slow throughput, and high queue time. Network Throughput When transferring huge amounts of data using pipelines, it is necessary to have high network throughput to handle such amounts of data. For instance, when connecting to on-prem data sources and cloud destinations, you need to consider using a dedicated network connection to have efficient data transfer with high throughput. Parallel Reads and Writes When you are copying large datasets without enabling parallel reads and writes, it can cause delays in execution. Choose the right partitioning column option on the source or sink side settings to enable parallel reads and writes. Advanced Techniques Data Transformations Using External Compute When running transformations on data that is stored externally, it is efficient to use the source compute for those transformations. For instance, triggering a Databricks notebook activity can leverage the compute on the Databricks cluster to perform data processing; this can reduce data movement and optimize performance. Cost Optimization Pipeline costs highly correlate with the number of activities, transformations, and runtime duration of each activity. The majority of cost occurs due to Azure IR utilization. It is essential to apply the optimization techniques discussed in this guide to lower costs. To reduce costs, enable Azure IR with a time-to-live setting when running data flow activities in parallel. Security Considerations During the execution of a pipeline, data movement typically occurs over public networks. To enhance security, there is an option to enable ADF native-managed private endpoints to ensure that the connection to your sources is private and secure. Operational Best Practices Data Lineage It is one of the best practices and important to maintain an audit for each record when moving data across using pipelines. Integrating ADF with Microsoft Purview gives the option to track the lineage of data and document the origin of data in a data catalog entirely for future reference. Fault Tolerance During pipeline execution, data compatibility issues can arise and lead to failed activities. To be able to handle compatibility issues, implement fault tolerance by logging incompatible records, allowing for their correction, reprocessing them later, and continuing to load compatible data. To enable this, configure at the activity level by directing incompatible records to the storage account. Backup and Restore Pipelines To be able to safeguard your ADF pipelines, ADF needs to be integrated with GitHub or Azure DevOps. This will ensure that pipelines can be restored from their source control if the ADF resource is accidentally deleted. Note that you need to restore other resources separately that are configured as sources or sinks in the pipelines. ADF can only restore pipeline configuration, not the actual resources that are deleted. For instance, if you are using Azure Blob Storage as a source, then this needs to be restored separately if it was deleted. Monitoring and Alerts Setting up monitoring for pipeline resources is critical for proactive reaction and prevention, ensuring no issues occur that may lead to escalation. Monitor the pipeline runs, integration runtime utilization, and set up alerts to notify you via email or phone. Conclusion There are many components of data pipelines that need to be taken into consideration and planned before their development. These best practices guide you to do the right design of each component in a data pipeline so that it will be high-performing, reliable, cost-effective, and scalable for handling your growing business needs.
Introduction to the Problem Managing concurrency in financial transaction systems is one of the most complex challenges faced by developers and system architects. Concurrency issues arise when multiple transactions are processed simultaneously, which can lead to potential conflicts and data inconsistencies. These issues manifest in various forms, such as overdrawn accounts, duplicate transactions, or mismatched records, all of which can severely undermine the system's reliability and trustworthiness. In the financial world, where the stakes are exceptionally high, even a single error can result in significant financial losses, regulatory violations, and reputational damage to the organization. Consequently, it is critical to implement robust mechanisms to handle concurrency effectively, ensuring the system's integrity and reliability. Complexities in Money Transfer Applications At first glance, managing a customer's account balance might seem like a straightforward task. The core operations — crediting an account, allowing withdrawals, or transferring funds between accounts — are essentially simple database transactions. These transactions typically involve adding or subtracting from the account balance, with the primary concern being to prevent overdrafts and maintain a positive or zero balance at all times. However, the reality is far more complex. Before executing any transaction, it's often necessary to perform a series of checks with other systems. For example, the system must verify that the account in question actually exists, which usually involves querying a central account database or service. Moreover, the system must ensure that the account is not blocked due to issues such as suspicious activity, regulatory compliance concerns, or pending verification processes. These additional steps introduce layers of complexity that go beyond simple debit and credit operations. Robust checks and balances are required to ensure that customer balances are managed securely and accurately, adding significant complexity to the overall system. Real-World Requirements (KYC, Fraud Prevention, etc.) Consider a practical example of a money transfer company that allows customers to transfer funds across different currencies and countries. From the customer's perspective, the process is simple: The customer opens an account in the system. A EUR account is created to receive money. The customer creates a recipient in the system. The customer initiates a transfer of €100 to $110 to the recipient. The system waits for the inbound €100. Once the funds arrive, they are converted to $110. Finally, the system sends $110 to the recipient. This process can be visualized as follows: While this sequence appears simple, real-world requirements introduce additional complexity: Payment verification: The system must verify the origin of the inbound payment. The payer's bank account must be valid. The bank's BIC code must be authorized within the system. If the payment originates from a non-bank payment system, additional checks are required. Recipient validation: The recipient's bank account must be active. Customer validation: The recipient must pass various checks, such as identity verification (e.g., a valid passport and a confirmed selfie ID). Source of funds and compliance: Depending on the inbound transfer amount, the source of funds may need to be verified. The fraud prevention system should review the inbound payment. Neither the sender nor the recipient should appear on any sanctions list. Transaction limits and fees: The system should calculate monthly and annual payment limits to determine applicable fees. If the transaction involves currency conversion, the system must handle foreign exchange rates. Audit and compliance: The system must log all transactions for auditing and compliance purposes. These requirements add significant complexity to what initially seems like a straightforward process. Additionally, based on the results of these checks, the payment may require manual review, further extending the payment process. Visualization of Data Flow and Potential Failure Points In a financial transaction system, the data flow for handling inbound payments involves multiple steps and checks to ensure compliance, security, and accuracy. However, potential failure points exist throughout this process, particularly when external systems impose restrictions or when the system must dynamically decide on the course of action based on real-time data. Standard Inbound Payment Flow Here's a simplified visualization of the data flow when handling an inbound payment, including the sequence of interactions between various components: Explanation of the Flow Customer initiates payment: The customer sends a payment to their bank. Bank sends payment: The bank forwards the payment to the transfer system. Compliance check: The transfer system checks the sender and recipient against compliance regulations. Verification checks: The system verifies if the sender and recipient have passed necessary identity and document verifications. Fraud detection: A fraud check is performed to ensure the payment is not suspicious. Statistic calculation: The system calculates transaction limits and other relevant metrics. Fee calculation: Any applicable fees are calculated. Confirmation: The system confirms receipt of the payment to the customer. Potential Failure Points and Dynamic Restrictions While the above flow seems straightforward, the process can become complicated due to dynamic changes, such as when an external system imposes restrictions on a customer's account. Here's how the process might unfold, highlighting the potential failure points: Explanation of the Potential Failure Points Dynamic restrictions: During the process, the compliance team may decide to restrict all operations for a specific customer due to sanctions or other regulatory reasons. This introduces a potential failure point where the process could be halted or altered mid-way. Database state conflicts: After compliance decides to restrict operations, the transfer system needs to update the state of the transfer in the database. The challenge here lies in managing the state consistency, particularly if multiple operations occur simultaneously or if there are conflicting updates. The system must ensure that the transfer's state is accurately reflected in the database, taking into account the restriction imposed. If not handled carefully, this could lead to inconsistent states or failed transactions. Decision points: The system's ability to dynamically recalculate the state and decide whether to accept or reject an inbound payment is crucial. Any misstep in this decision-making process could result in unauthorized transactions, blocked funds, or legal violations. Visualizing the data flow and identifying potential failure points in financial transaction systems reveals the complexity and risks involved in handling payments. By understanding these risks, system architects can design more robust mechanisms to manage state, handle dynamic changes, and ensure the integrity of the transaction process. Traditional Approaches to Concurrency There are various approaches to addressing concurrency challenges in financial transaction systems. Database Transactions and Their Limitations The most straightforward approach to managing concurrency is through database transactions. To start, let’s define our context: the transfer system stores its data in a Postgres database. While the database topology can vary — whether shared across multiple instances, data centers, locations, or regions — our focus here is on a simple, single Postgres database instance handling both reads and writes. To ensure that one transaction does not override another's data, we can lock the row associated with the transfer: SELECT * FROM transfers WHERE id = 'ABCD' FOR UPDATE; This command locks the row at the beginning of the process and releases the lock once the transaction is complete. The following diagram illustrates how this approach addresses the issue of lost updates: While this approach can solve the problem of lost updates in simple scenarios, it becomes less effective as the system scales and the number of active transactions increases. Scaling Issues and Resource Exhaustion Let’s consider the implications of scaling this approach. Assume that processing one payment takes 5 seconds, and the system handles 100 inbound payments every second. This results in 500 active transactions at any given time. Each of these transactions requires a database connection, which can quickly lead to resource exhaustion, increased latency, and degraded system performance, particularly under high load conditions. Locks: Local and Distributed Local locks are another common method for managing concurrency within a single application instance. They ensure that critical sections of code are executed by only one thread at a time, preventing race conditions and ensuring data consistency. Implementing local locks is relatively simple using constructs like synchronized blocks or ReentrantLocks in Java, which manages access to shared resources effectively within a single system. However, local locks fall short in distributed environments where multiple instances of an application need to coordinate their actions. In such scenarios, a local lock on one instance does not prevent conflicting actions on other instances. This is where distributed locks come into play. Distributed locks ensure that only one instance of an application can access a particular resource at any given time, regardless of which node in the cluster is executing the code. Implementing distributed locks is inherently more complex, often requiring external systems like ZooKeeper, Consul, Hazelcast, or Redis to manage the lock state across multiple nodes. These systems need to be highly available and consistent to prevent the distributed lock mechanism from becoming a single point of failure or a bottleneck. The following diagram illustrates the typical flow of a distributed lock system: The Problem of Ordering In distributed systems, where multiple nodes may request locks simultaneously, ensuring fair processing and maintaining data consistency can be challenging. Achieving an ordered queue of lock requests across nodes involves several difficulties: Network latency: Varying latencies can make strict ordering difficult to maintain Fault Tolerance: The ordering mechanism must be fault-tolerant and not become a single point of failure, which adds complexity to the system. Waiting of Lock Consumers and Deadlocks When multiple nodes hold various resources and wait for each other to release locks, a deadlock can occur, halting system progress. To mitigate this, distributed locks often incorporate timeouts. Timeouts Lock acquisition timeouts: Nodes specify a maximum wait time for a lock. If the lock is not granted within this time, the request times out, preventing indefinite waiting. Lock holding timeouts: Nodes holding a lock have a maximum duration to hold it. If the time is exceeded, the lock is automatically released to prevent resources from being held indefinitely. Timeout handling: When a timeout occurs, the system must handle it gracefully, whether by retrying, aborting, or triggering compensatory actions. Considering these challenges, guaranteeing reliable payment processing in a system that relies on distributed locking is a complex endeavor. Balancing the need for concurrency control with the realities of distributed systems requires careful planning and robust design. A Paradigm Shift: Simplifying Concurrency Let’s take a step back and review our transfer processing approach. By breaking the process into smaller steps, we can simplify each operation, making the entire system more manageable and reducing the risk of concurrency issues. When a payment is received, it triggers a series of checks, each requiring computations from different systems. Once all the results are in, the system decides on the next course of action. These steps resemble transitions in a finite state machine (FSM). Introducing a Message-Based Processing Model As shown in the diagram, payment processing involves a combination of commands and state transitions. For each command, the system identifies the initial state and the possible transition states. For example, if the system receives the [ReceivePayment] command, it checks if the transfer is in the created state. If not, it does nothing. For the [ApplyCheckResult] command, the system transitions the transfer to either checks_approved or checks_rejected based on the results of the checks. These checks are designed to be granular and quick to process, as each check operates independently and does not modify the transfer state directly. It only requires the input data to determine the result of the check. Here is how the code for such processing might look: Java interface Check<Input> { CheckResult run(Input input); } interface Processor<State, Command> { State process(State initial, Command command); } interface CommandSender<Command> { void send(UUID transferId, Command command); } Let’s see how these components interact to send, receive, and process checks: Java enum CheckStatus { NEW, ACCEPTED, REJECTED } record Check(UUID transferId, CheckType type, CheckStatus status, Data data); class CheckProcessor { void process(Check check) { // Run all required calculations // Send result to `TransferProcessor` } } enum TransferStatus { CREATED, PAYMENT_RECEIVED, CHECKS_SENT, CHECKS_PENDING, CHECKS_APPROVED, CHECKS_REJECTED } record Transfer(UUID id, List<Check> checks); sealed interface Command permits ReceivePayment, SendChecks, ApplyCheckResult {} class TransferProcessor { State process(State state, Command command) { // (1) If status == CREATED and command is `ReceivePayment` // (2) Write payment details to the state // (3) Send command `SendChecks` to self // (4) Set status = PAYMENT_RECEIVED // (4) If state = PAYMENT_RECEIVED and command is `SendChecks` // (5) Calculate all required checks (without processing) // (6) Send checks for processing to other processors // (7) Set status = CHECKS_SENT // (10) If status = CHECKS_SENT or CHECKS_PENDING // and command is ApplyCheckResult // (11) Update `transfer.checks()` // (12) Compute overall status // (13) If all checks are accepted - set status = CHECKS_APPROVED // (14) If any of the checks is rejected - set status CHECKS_REJECTED // (15) Otherwise - set status = CHECKS_PENDING } } This approach reduces processing latency by offloading check result calculations to separate processes, leading to fewer concurrent operations. However, it does not entirely solve the problem of ensuring atomic processing for commands. Communication Through Messages In this model, communication between different parts of the system occurs through messages. This approach enables asynchronous communication, decoupling components and enhancing flexibility and scalability. Messages are managed through queues and message brokers, which ensure orderly transmission and reception of messages. The diagram below illustrates this process: One-at-a-Time Message Handling To ensure correct and consistent command processing, it is crucial to order and linearize all messages for a single transfer. This means messages should be processed in the order they were sent, and no two messages for the same transfer should be processed simultaneously. Sequential processing guarantees that each step in the transaction lifecycle occurs in the correct sequence, preventing race conditions, data corruption, or inconsistent states. Here’s how it works: Message queue: A dedicated queue is maintained for each transfer to ensure that messages are processed in the order they are received. Consumer: The consumer fetches messages from the queue, processes them, and acknowledges successful processing. Sequential processing: The consumer processes each message one by one, ensuring that no two messages for the same transfer are processed simultaneously. Durable Message Storage Ensuring message durability is crucial in financial transaction systems because it allows the system to replay a message if the processor fails to handle the command due to issues like external payment failures, storage failures, or network problems. Imagine a scenario where a payment processing command fails due to a temporary network outage or a database error. Without durable message storage, this command could be lost, leading to incomplete transactions or other inconsistencies. By storing messages durably, we ensure that every command and transaction step is persistently recorded. If a failure occurs, the system can recover and replay the message once the issue is resolved, ensuring the transaction completes successfully. Durable message storage is also invaluable for dealing with external payment systems. If an external system fails to confirm a payment, we can replay the message to retry the operation without losing critical data, maintaining the integrity and consistency of our transactions. Additionally, durable message storage is essential for auditing and compliance, providing a reliable log of all transactions and actions taken by the system, and making it easier to track and verify operations when needed. The following diagram illustrates how durable message storage works: By using durable message storage, the system becomes more reliable and resilient, ensuring that failures are handled gracefully without compromising data integrity or customer trust. Kafka as a Messaging Backbone Apache Kafka is a distributed streaming platform designed for high-throughput, low-latency message handling. It is widely used as a messaging backbone in complex systems due to its ability to handle real-time data feeds efficiently. Let's explore Kafka's core components, including producers, topics, partitions, and message routing, to understand how it operates within a distributed system. Topics and Partitions Topics In Kafka, a topic is a category or feed name to which records are stored and published. Topics are divided into partitions to facilitate parallel processing and scalability. Partitions Each topic can be divided into multiple partitions, which are the fundamental units of parallelism in Kafka. Partitions are ordered, immutable sequences of records continually appended to a structured commit log. Kafka stores data in these partitions across a distributed cluster of brokers. Each partition is replicated across multiple brokers to ensure fault tolerance and high availability. The replication factor determines the number of copies of the data, and Kafka automatically manages the replication process to ensure data consistency and reliability. Each record within a partition has a unique offset, serving as the identifier for the record's position within the partition. This offset allows consumers to keep track of their position and continue processing from where they left off in case of a failure. Message Routing Kafka's message routing is a key mechanism that determines how messages are distributed across the partitions of a topic. There are several methods for routing messages: Round-robin: The default method where messages are evenly distributed across all available partitions to ensure a balanced load and efficient use of resources Key-based routing: Messages with the same key are routed to the same partition, which is useful for maintaining the order of related messages and ensuring they are processed sequentially. For example, all transactions for a specific account can be routed to the same partition using the account ID as the key. Custom partitioners: Kafka allows custom partitioning logic to define how messages should be routed based on specific criteria. This is useful for complex routing requirements not covered by the default methods. This routing mechanism optimizes performance, maintains message order when needed, and supports scalability and fault tolerance. Producers Kafka producers are responsible for publishing records to topics. They can specify acknowledgment settings to control when a message is considered successfully sent: acks=0: No acknowledgment is needed, providing the lowest latency but no delivery guarantees acks=1: The leader broker acknowledges the message, ensuring it has been written to the leader's log. acks=all: All in-sync replicas must acknowledge the message, providing the highest level of durability and fault tolerance. These configurations allow Kafka producers to meet various application requirements for message delivery and persistence, ensuring that data is reliably stored and available for consumers. Consumers Kafka consumers read data from Kafka topics. A key concept in Kafka's consumer model is the consumer group. A consumer group consists of multiple consumers working together to read data from a topic. Each consumer in the group reads from different partitions of the topic, allowing for parallel processing and increased throughput. When a consumer fails or leaves the group, Kafka automatically reassigns the partitions to the remaining consumers, ensuring fault tolerance and high availability. This dynamic balancing of partition assignments ensures that the workload is evenly distributed among the consumers in the group, optimizing resource utilization and processing efficiency. Kafka's ability to manage high volumes of data, ensure fault tolerance, and maintain message order makes it an ideal choice for serving as a messaging backbone in distributed systems, particularly in environments requiring real-time data processing and robust concurrency management. Messaging System Using Kafka Incorporating Apache Kafka as the messaging backbone into our system allows us to address various challenges associated with message handling, durability, and scalability. Let's explore how Kafka aligns with our requirements and facilitates the implementation of an Actor model-based system. One-at-a-Time Message Handling To ensure that messages for a specific transfer are handled sequentially and without overlap, we can create a Kafka topic named transfer.commands with multiple partitions. Each message's key will be the transferId, ensuring that all commands related to a particular transfer are routed to the same partition. Since a partition can only be consumed by one consumer at a time, this setup guarantees one-at-a-time message handling for each transfer. Durable Message Store Kafka's architecture is designed to ensure message durability by persisting messages across its distributed brokers. Here are some key Kafka configurations that enhance message durability and reliability: retention.ms: Specifies how long Kafka retains a record before it is deleted; for example, setting log.retention.ms=604800000 retains messages for 7 days log.segment.bytes: Controls the size of each log segment; for instance, setting log.segment.bytes=1073741824 creates new segments after 1 GB min.insync.replicas: Defines the minimum number of replicas that must acknowledge a write before it is considered successful; setting min.insync.replicas=2 ensures that at least two replicas confirm the write. acks: A producer setting that specifies the number of acknowledgments required. Setting acks=all ensures that all in-sync replicas must acknowledge the message, providing high durability. Example configurations for ensuring message durability: Java # Example 1: Retention Policy log.retention.ms=604800000 # Retain messages for 7 days log.segment.bytes=1073741824 # 1 GB segment size # Example 2: Replication and Acknowledgment min.insync.replicas=2 # At least 2 replicas must acknowledge a write acks=all # Producer requires acknowledgment from all in-sync replicas # Example 3: Producer Configuration acks=all # Ensures high durability retries=5 # Number of retries in case of transient failures Revealing the Model: The Actor Pattern In our system, the processor we previously discussed will now be referred to as an Actor. The Actor model is well-suited for managing state and handling commands asynchronously, making it a natural fit for our Kafka-based system. Core Concepts of the Actor Model Actors as fundamental units: Each Actor is responsible for receiving messages, processing them, and modifying its internal state. This aligns with our use of processors to handle commands for each transfer. Asynchronous message passing: Communication between Actors occurs through Kafka topics, allowing for decoupled, asynchronous interactions. State isolation: Each Actor maintains its own state, which can only be modified by sending a command to the Actor. This ensures that state changes are controlled and sequential. Sequential message processing: Kafka guarantees that messages within a partition are processed in order, which supports the Actor model's need for sequential handling of commands. Location transparency: Actors can be distributed across different machines or locations, enhancing scalability and fault tolerance. Fault tolerance: Kafka’s built-in fault-tolerance mechanisms, combined with the Actor model’s distributed nature, ensure that the system can handle failures gracefully. Scalability: The system’s scalability is determined by the number of Kafka partitions. For instance, with 64 partitions, the system can handle 64 concurrent commands. Kafka's architecture allows us to scale by adding more partitions and consumers as needed. Implementing the Actor Model in the System We start by defining a simple interface for managing the state: Java interface StateStorage<K, S> { S newState(); S get(K key); void put(K key, S state); } Next, we define the Actor interface: Java interface Actor<S, C> { S receive(S state, C command); } To integrate Kafka, we need helper interfaces to read the key and value from Kafka records: Java interface KafkaMessageKeyReader<K> { K readKey(byte[] key); } interface KafkaMessageValueReader<V> { V readValue(byte[] value); } Finally, we implement the KafkaActorConsumer, which manages the interaction between Kafka and our Actor system: Java class KafkaActorConsumer<K, S, C> { private final Supplier<Actor<S, C>> actorFactory; private final StateStorage<K, S> storage; private final KafkaMessageKeyReader<K> keyReader; private final KafkaMessageValueReader<C> valueReader; public KafkaActorConsumer(Supplier<Actor<S, C>> actorFactory, StateStorage<K, S> storage, KafkaMessageKeyReader<K> keyReader, KafkaMessageValueReader<C> valueReader) { this.actorFactory = actorFactory; this.storage = storage; this.keyReader = keyReader; this.valueReader = valueReader; } public void consume(ConsumerRecord<byte[], byte[]> record) { // (1) Read the key and value from the record K messageKey = keyReader.readKey(record.key()); C messageValue = valueReader.readValue(record.value()); // (2) Get the current state from the storage S state = storage.get(messageKey); if (state == null) { state = storage.newState(); } // (3) Get the actor instance Actor<S, C> actor = actorFactory.get(); // (4) Process the message S newState = actor.receive(state, messageValue); // (5) Save the new state storage.put(messageKey, newState); } } This implementation handles the consumption of messages from Kafka, processes them using an Actor, and updates the state accordingly. Additional considerations like error handling, logging, and tracing can be added to enhance the robustness of this system. By combining Kafka’s powerful messaging capabilities with the Actor model’s structured approach to state management and concurrency, we can build a highly scalable, resilient, and efficient system for handling financial transactions. This setup ensures that each command is processed correctly, sequentially, and with full durability guarantees. Advanced Topics Outbox Pattern The Outbox Pattern is a critical design pattern for ensuring reliable message delivery in distributed systems, particularly when integrating PostgreSQL with Kafka. The primary issue it addresses is the risk of inconsistencies where a transaction might be committed in PostgreSQL, but the corresponding message fails to be delivered to Kafka due to a network issue or system failure. This can lead to a situation where the database state and the message stream are out of sync. The Outbox Pattern solves this problem by storing messages in a local outbox table within the same PostgreSQL transaction. This ensures that the message is only sent to Kafka after the transaction is successfully committed. By doing so, it provides exactly-once delivery semantics, preventing message loss and ensuring consistency between the database and the message stream. Implementing the Outbox Pattern With the Outbox Pattern in place, the KafkaActorConsumer and Actor implementations can be adjusted to accommodate this pattern: Java record OutboxMessage(UUID id, String topic, byte[] key, Map<String, byte[]> headers, byte[] payload) {} record ActorReceiveResult<S, M>(S newState, List<M> messages) {} interface Actor<S, C> { ActorReceiveResult<S, OutboxMessage> receive(S state, C command); } class KafkaActorConsumer<K, S, C> { public void consume(ConsumerRecord<byte[], byte[]> record) { // ... other steps // (5) Process the message var result = actor.receive(state, messageValue); // (6) Save the new state storage.put(messageKey, result.newState()); } @Transactional public void persist(S state, List<OutboxMessage> messages) { // (7) Persist the new state storage.put(stateKey, state); // (8) Persist the outbox messages for (OutboxMessage message : messages) { outboxTable.save(message); } } } In this implementation: The Actor now returns an ActorReceiveResult containing the new state and a list of outbox messages that need to be sent to Kafka. The KafkaActorConsumer processes these messages and persists both the state and the messages in the outbox table within the same transaction. After the transaction is committed, an external process (e.g., Debezium) reads from the outbox table and sends the messages to Kafka, ensuring exactly-once delivery. Toxic Messages and Dead-Letters In distributed systems, some messages might be malformed or cause errors that prevent successful processing. These problematic messages are often referred to as "toxic messages." To handle such scenarios, we can implement a dead-letter queue (DLQ). A DLQ is a special queue where unprocessable messages are sent for further investigation. This approach ensures that these messages do not block the processing of other messages and allows for the root cause to be addressed without losing data. Here's a basic implementation for handling toxic messages: Java class ToxicMessage extends Exception {} class LogicException extends ToxicMessage {} class SerializationException extends ToxicMessage {} class DefaultExceptionDecider { public boolean isToxic(Throwable e) { return e instanceof ToxicMessage; } } interface DeadLetterProducer { void send(ConsumerRecord<?, ?> record, Throwable e); } class Consumer { private final ExceptionDecider exceptionDecider; private final DeadLetterProducer deadLetterProducer; void consume(ConsumerRecord<String, String> record) { try { // process record } catch (Exception e) { if (exceptionDecider.isToxic(e)) { deadLetterProducer.send(record, e); } else { // throw exception to retry the operation throw e; } } } } In this implementation: ToxicMessage: A base exception class for any errors deemed "toxic," meaning they should not be retried but rather sent to the DLQ DefaultExceptionDecider: Decides whether an exception is toxic and should trigger sending the message to the DLQ DeadLetterProducer: Responsible for sending messages to the DLQ Consumer: Processes messages and uses the ExceptionDecider and DeadLetterProducer to handle errors appropriately Conclusion By leveraging Kafka as the messaging backbone and implementing the Actor model, we can build a robust, scalable, and fault-tolerant financial transaction system. The Actor model offers a straightforward approach to managing state and concurrency, while Kafka provides the tools necessary for reliable message handling, durability, and partitioning. The Actor model is not a specialized or complex framework but rather a set of simple abstractions that can significantly increase the scalability and reliability of our system. Kafka’s built-in features, such as message durability, ordering, and fault tolerance, naturally align with the principles of the Actor model, enabling us to implement these concepts efficiently and effectively without requiring additional frameworks. Incorporating advanced patterns like the Outbox Pattern and handling toxic messages with DLQs further enhances the system's reliability, ensuring that messages are processed consistently and that errors are managed gracefully. This comprehensive approach ensures that our financial transaction system remains reliable, scalable, and capable of handling complex workflows seamlessly.
Moving data from one place to another is conceptually simple. You simply read from one datasource and write to another. However, doing that consistently and safely is another story. There are a variety of mistakes you can make if you overlook important details. We recently discussed the top reasons so many organizations are currently seeking DynamoDB alternatives. Beyond costs (the most frequently mentioned factor), aspects such as throttling, hard limits, and vendor lock-in are frequently cited as motivation for a switch. But what does a migration from DynamoDB to another database look like? Should you dual-write? Are there any available tools to assist you with that? What are the typical do’s and don’ts? In other words, how do you move out from DynamoDB? In this post, let’s start with an overview of how database migrations work, cover specific and important characteristics related to DynamoDB migrations, and then discuss some of the strategies employed to integrate with and migrate data seamlessly to other databases. How Database Migrations Work Most database migrations follow a strict set of steps to get the job done. First, you start capturing all changes made to the source database. This guarantees that any data modifications (or deltas) can be replayed later. Second, you simply copy data over. You read from the source database and write to the destination one. A variation is to export a source database backup and simply side-load it into the destination database. Past the initial data load, the target database will contain most of the records from the source database, except the ones that have changed during the period of time it took for you to complete the previous step. Naturally, the next step is to simply replay all deltas generated by your source database to the destination one. Once that completes, both databases will be fully in sync, and that’s when you may switch your application over. To Dual-Write or Not? If you are familiar with Cassandra migrations, then you have probably been introduced to the recommendation of simply “dual-writing” to get the job done. That is, you would proxy every writer mutation from your source database to also apply the same records to your target database. Unfortunately, not every database implements the concept of allowing a writer to retrieve or manipulate the timestamp of a record like the CQL protocol allows. This prevents you from implementing dual-writes in the application while back-filling the target database with historical data. If you attempt to do that, you will likely end up with an inconsistent migration, where some target Items may not reflect their latest state in your source database. Wait… Does it mean that dual-writing in a migration from DynamoDB is just wrong? Of course not! Consider that your DynamoDB table expires records (TTL) every 24 hours. In that case, it doesn’t make sense to back-fill your database: simply dual-write and, past the TTL period, switch your readers over. If your TTL is longer (say a year), then waiting for it to expire won’t be the most efficient way to move your data over. Back-Filling Historical Data Whether or not you need to back-fill historical data primarily depends on your use case. Yet, we can easily reason around the fact that it typically is a mandatory step in most migrations. There are 3 main ways for you to back-fill historical data from DynamoDB: ETL ETL (extract-transform-load) is essentially what a tool like Apache Spark does. It starts with a Table Scan and reads a single page worth of results. The results are then used to infer your source table’s schema. Next, it spawns readers to consume from your DynamoDB table as well as writer workers ingest the retrieved data to the destination database. This approach is great for carrying out simple migrations and also lets you transform (the T in the ETL part) your data as you go. However, it is unfortunately prone to some problems. For example: Schema inference: DynamoDB tables are schemaless, so it’s difficult to infer a schema. All table attributes (other than your hash and sort keys) might not be present on the first page of the initial scan. Plus, a given Item might not project all the attributes present within another Item. Cost: Sinces extracting data requires a DynamoDB table full scan, it will inevitably consume RCUs. This will ultimately drive up migration costs, and it can also introduce an upstream impact to your application if DynamoDB runs out of capacity. Time: The time it takes to migrate the data is proportional to your data set size. This means that if your migration takes longer than 24 hours, you may be unable to directly replay from DynamoDB Streams after, given that this is the period of time that AWS guarantees the availability of its events. Table Scan A table scan, as the name implies, involves retrieving all records from your source DynamoDB table – only after loading them to your destination database. Unlike the previous ETL approach where both the “Extract” and “Load” pieces are coupled and data gets written as you go, here each step is carried out in a phased way. The good news is that this method is extremely simple to wrap your head around. You run a single command. Once it completes, you’ve got all your data! For example: $ aws dynamodb scan --table-name source > output.json You’ll then end up with a single JSON file containing all existing Items within your source table, which you may then simply iterate through and write to your destination. Unless you are planning to transform your data, you shouldn’t need to worry about the schema (since you already know beforehand that all Key Attributes are present). This method works very well for small to medium-sized tables, but – as with the previous ETL method – it may take considerable time to scan larger tables. And that’s not accounting for the time it will take you to parse it and later load it to the destination. S3 Data Export If you have a large dataset or are concerned with RCU consumption and the impact on live traffic, you might rely on exporting DynamoDB data to Amazon S3. This allows you to easily dump your tables’ entire contents without impacting your DynamoDB table performance. In addition, you can request incremental exports later, in case the back-filling process takes longer than 24 hours. To request a full DynamoDB export to S3, simply run: $ aws dynamodb export-table-to-point-in-time --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME --s3-bucket BUCKET_NAME --s3-prefix PREFIX_NAME --export-format DYNAMODB_JSON The export will then run in the background (assuming the specified S3 bucket exists). To check for its completion, run: Plain Text $ aws dynamodb list-exports --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/source { "ExportSummaries": [ { "ExportArn": "arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME/export/01706834224965-34599c2a", "ExportStatus": "COMPLETED", "ExportType": "FULL_EXPORT" } ] } Once the process is complete, your source table’s data will be available within the S3 bucket/prefix specified earlier. Inside it, you will find a directory named AWSDynamoDB, under a structure that resembles something like this: Plain Text $ tree AWSDynamoDB/ AWSDynamoDB/ └── 01706834981181-a5d17203 ├── _started ├── data │ ├── 325ukhrlsi7a3lva2hsjsl2bky.json.gz │ ├── 4i4ri4vq2u2vzcwnvdks4ze6ti.json.gz │ ├── aeqr5obfpay27eyb2fnwjayjr4.json.gz │ ├── d7bjx4nl4mywjdldiiqanmh3va.json.gz │ ├── dlxgixwzwi6qdmogrxvztxzfiy.json.gz │ ├── fuukigkeyi6argd27j25mieigm.json.gz │ ├── ja6tteiw3qy7vew4xa2mi6goqa.json.gz │ ├── jirrxupyje47nldxw7da52gnva.json.gz │ ├── jpsxsqb5tyynlehyo6bvqvpfki.json.gz │ ├── mvc3siwzxa7b3jmkxzrif6ohwu.json.gz │ ├── mzpb4kukfa5xfjvl2lselzf4e4.json.gz │ ├── qs4ria6s5m5x3mhv7xraecfydy.json.gz │ ├── u4uno3q3ly3mpmszbnwtzbpaqu.json.gz │ ├── uv5hh5bl4465lbqii2rvygwnq4.json.gz │ ├── vocd5hpbvmzmhhxz446dqsgvja.json.gz │ └── ysowqicdbyzr5mzys7myma3eu4.json.gz ├── manifest-files.json ├── manifest-files.md5 ├── manifest-summary.json └── manifest-summary.md5 2 directories, 21 files So how do you restore from these files? Well… you need to use the DynamoDB Low-level API. Thankfully, you don’t need to dig through its details since AWS provides the LoadS3toDynamoDB sample code as a way to get started. Simply override the DynamoDB connection with the writer logic of your target database, and off you go! Streaming DynamoDB Changes Whether or not you require back-filling data, chances are you want to capture events from DynamoDB to ensure both will get in sync with each other. DynamoDB Streams can be used to capture changes performed in your source DynamoDB table. But how do you consume from its events? DynamoDB Streams Kinesis Adapter AWS provides the DynamoDB Streams Kinesis Adapter to allow you to process events from DynamoDB Streams via the Amazon Kinesis Client Library (such as the kinesis-asl module in Apache Spark). Beyond the historical data migration, simply stream events from DynamoDB to your target database. After that, both datastores should be in sync. Although this approach may introduce a steep learning curve, it is by far the most flexible one. It even lets you consume events from outside the AWS ecosystem (which may be particularly important if you’re switching to a different provider). For more details on this approach, AWS provides a walkthrough on how to consume events from a source DynamoDB table to a destination one. AWS Lambda Lambda functions are simple to get started with, handle all checkpointing logic on their own, and seamlessly integrate with the AWS ecosystem. With this approach, you simply encapsulate your application logic inside a Lambda function. That lets you write events to your destination database without having to deal with the Kinesis API logic, such as check-pointing or the number of shards in a stream. When taking this route, you can load the captured events directly into your target database. Or, if the 24-hour retention limit is a concern, you can simply stream and retain these records in another service, such as Amazon SQS, and replay them later. The latter approach is well beyond the scope of this article. For examples of how to get started with Lambda functions, see the AWS documentation. Final Remarks Migrating from one database to another requires careful planning and a thorough understanding of all steps involved during the process. Further complicating the matter, there’s a variety of different ways to accomplish a migration, and each variation brings its own set of trade-offs and benefits. This article provided an in-depth look at how a migration from DynamoDB works, and how it differs from other databases. We also discussed different ways to back-fill historical data and stream changes to another database. Finally, we ran through an end-to-end migration, leveraging AWS tools you probably already know. At this point, you should have all the tools and tactics required to carry out a migration on your own.
In today's world, data is a key success factor for many information systems. To exploit data, it needs to be moved and collected from many different locations, using many different technologies and tools. It is important to understand the difference between a data pipeline and an ETL pipeline. While both are designed to move data from one place to another, they serve different purposes and are optimized for different tasks. The comparison table below highlights the key differences: Comparison Table Feature Data Pipeline ETL Pipeline Processing Mode Real-time or near-real-time processing Batch processing at scheduled intervals Flexibility Highly flexible with various data formats Less flexible, designed for specific data sources Complexity Complex during transformation but easier in batch mode Complex during transformation but easier in batch mode Scalability Easily scalable for streaming data Scalable but resource-intensive for large batch tasks Use Cases Real-time analytics, event-driven applications Data warehousing, historical data analysis What Is a Data Pipeline? A data pipeline is a systematic process for transferring data from one system to another, often in real time or near real time. It enables the continuous flow and processing of data between systems. The process involves collecting data from multiple sources, processing it as it moves through the pipeline, and delivering it to target systems. Data pipelines are designed to handle the seamless integration and flow of data across different platforms and applications. They play a crucial role in modern data architectures by enabling real-time analytics, data synchronization, and event-driven processing. By automating the data movement and transformation processes, data pipelines help organizations maintain data consistency and reliability, reduce latency, and ensure that data is always available for critical business operations and decision-making. Data pipelines manage data from a variety of sources, including: Databases APIs Files IoT devices Processing Data pipelines can process data in real time or near real time. This involves cleaning, enriching, and structuring the data as it flows through the pipeline. For example, streaming data from IoT devices may require real-time aggregation and filtering before it is ready for analysis or storage. Delivery The final stage of a data pipeline is to deliver the processed data to its target systems, such as databases, data lakes, or real-time analytics platforms. This step ensures that the data is immediately accessible to multiple applications and provides instant insights that enable rapid decision making. Use Cases Data pipelines are essential for scenarios requiring real-time or continuous data processing. Common use cases include: Real-time analytics: Data pipelines enable real-time data analysis for immediate insights and decision making. Data synchronization: Ensures data consistency across different systems in real time. Event-driven applications: Facilitate the processing of events in real time, such as user interactions or system logs. Stream processing: Handles continuous data streams from sources like IoT devices, social media feeds, or transaction logs. Data pipelines are often used with architectural patterns such as CDC (Change Data Capture) (1), Outbox pattern (2), or CQRS (Command Query Responsibility Segregation) (3). Pros and Cons Data pipelines offer several benefits that make them suitable for various real-time data processing scenarios, but they also come with their own set of challenges. Pros The most prominent data pipeline advantages include: Real-time processing: Provides immediate data availability and insights. Scalability: Easily scales to handle large volumes of streaming data. Flexibility: Adapts to various data sources and formats in real time. Low latency: Minimizes delays in data processing and availability. Cons The most common challenges related to data pipelines include: Complex setup: Requires intricate setup and maintenance. Resource intensive: Continuous processing can demand significant computational resources. Potential for data inconsistency: Real-time processing can introduce challenges in ensuring data consistency. Monitoring Needs: Requires robust monitoring and error handling to maintain reliability. What Is an ETL Pipeline? ETL, which stands for "Extract, Transform, and Load", is a process used to extract data from different sources, transform it into a suitable format, and load it into a target system (4). Extract An ETL program can collect data from a variety of sources, including databases, APIs, files, and more. The extraction phase is separated from the other phases to make the transformation and loading phases agnostic to changes in the data sources, so only the extraction phase needs to be adapted. Transform Once the data extraction phase is complete, the transformation phase begins. In this step, the data is reworked to ensure that it is structured appropriately for its intended use. Because data can come from many different sources and formats, it often needs to be cleaned, enriched, or normalized in order to be useful. For example, data intended for visualization may require a different structure to data collected from web forms (5). The transformation process ensures that the data is suitable for its next stage — whether that is analysis, reporting, or other applications. Load The final phase of the ETL process is loading the transformed data into the target system, such as a database or data warehouse. During this phase, the data is written to the target optimized for query performance and retrieval. This ensures that the data is accessible and ready for applications (e.g., business intelligence, analytics, reporting, etc.). The efficiency of the loading process is critical because it affects the availability of data to end users. Techniques such as indexing and partitioning can be used to improve performance and manageability in the target system. Use Cases ETL processes are essential in various scenarios where data needs to be consolidated and transformed for meaningful analysis. Common use cases include: Data warehousing: ETL aggregates data from multiple sources into a central repository, enabling comprehensive reporting and analysis. Business intelligence: ETL processes extract and transform transactional data to provide actionable insights and support informed decision making. Data migration projects: ETL facilitates the seamless transition of data from legacy systems to modern platforms, ensuring consistency and maintaining data quality. Reporting and compliance: ETL processes transform and load data into secure, auditable storage systems, simplifying the generation of accurate reports and maintaining data integrity for compliance and auditing purposes. Pros and Cons Evaluating the strengths and limitations of ETL pipelines helps in determining their effectiveness for various data integration and transformation tasks. Pros The most prominent ETL pipeline advantages include: Efficient data integration: Streamlines data from diverse sources. Robust transformations: Handles complex data cleaning and structuring. Batch processing: Ideal for large data volumes during off-peak hours. Improved data quality: Enhances data usability through thorough transformations. Cons The most common challenges related to ETL pipelines include: High latency: Delays in data availability due to batch processing. Resource intensive: Requires significant computational resources and storage. Complex development: Difficult to maintain with diverse, changing data sources. No real-time processing: Limited suitability for immediate data insights. Data Pipeline vs. ETL Pipeline: Key Differences Understanding the key differences between data pipelines and ETL pipelines is essential for choosing the right solution for your data processing needs. Here are the main distinctions: Processing Mode Data pipelines operate in real time or near real time, continuously processing data as it arrives, which is ideal for applications that require immediate data insights. In contrast, ETL pipelines process data in batches at scheduled intervals, resulting in delays between data extraction and availability. Flexibility Data pipelines are highly flexible, handling multiple data formats and sources while adapting to changing data streams in real time. ETL pipelines, on the other hand, are less flexible, designed for specific data sources and formats, and require significant adjustments when changes occur. Complexity Data pipelines are complex to set up and maintain due to the need for real-time processing and continuous monitoring. ETL pipelines are also complex, especially during data transformation, but their batch nature makes them somewhat easier to manage. Scalability Data pipelines scale easily to handle large volumes of streaming data and adapt to changing loads in real time. ETL pipelines can scale for large batch tasks, but they often require significant resources and infrastructure, making them more resource intensive. Common Examples of ETL Pipelines and Data Pipelines To better understand the practical applications of ETL pipelines and data pipelines, let's explore some common examples that highlight their use in real-world scenarios. Example of ETL Pipeline An example of an ETL pipeline is a data warehouse for sales data. In this scenario, the input sources include multiple databases that store sales transactions, CRM systems, and flat files containing historical sales data. The ETL process involves extracting data from all sources, transforming it to ensure consistency and accuracy, and loading it into a centralized data warehouse. The target system, in this case, is a data warehouse optimized for business intelligence and reporting. Figure 1: Building a data warehouse around sales data Example of Data Pipeline A common example of a data pipeline is real-time sensor data processing — sensors collect data that often needs to be aggregated with standard database data. With that, the input sources include sensors that produce continuous data streams and an input database. The data pipeline consists of a listener that collects data from sensors and the database, processes it in real time, and forwards it to the target database. The target system is a real-time analytics platform that monitors sensor data and triggers alerts. Figure 2: Real-time sensor data processing How to Determine Which Is Best for Your Organization Whether an ETL vs data pipeline is best for your organization depends on several factors. The characteristics of the data are critical to this decision. Data pipelines are ideal for real-time, continuous data streams that require immediate processing and insight. ETL pipelines, on the other hand, are suitable for structured data that can be processed in batches where latency is acceptable. Business requirements also play an important role. Data pipelines are ideal for use cases that require real-time data analysis, such as monitoring, fraud detection, or dynamic reporting. In contrast, ETL pipelines are best suited to scenarios that require extensive data consolidation and historical analysis, like data warehousing and business intelligence. Scalability requirements must also be considered. Data pipelines offer high scalability for real-time data processing and can efficiently handle fluctuating data volumes. ETL pipelines are scalable for large batch processing tasks but may ultimately require more infrastructure and resources. Bottom Line: Data Pipeline vs. ETL Pipeline The choice between a data pipeline and an ETL pipeline depends on your specific data needs and business objectives. Data pipelines excel in scenarios that require real-time data processing and immediate insights, making them ideal for dynamic, fast-paced environments. ETL pipelines, on the contrary, are designed for batch processing, making them ideal for structured data integration, historical analysis, and comprehensive reporting. Understanding these differences will help you choose the right approach to optimize your data strategy and meet your business objectives. To learn more about ETL and data pipelines, check out these additional courses: ETL and Data Pipelines with Shell, Airflow, and Kafka The Path to Insights: Data Models and Pipelines
The effective de-identification algorithms that balance data usage and privacy are critical. Industries like healthcare, finance, and advertising rely on accurate and secure data analysis. However, existing de-identification methods often compromise either the data usability or privacy protection and limit advanced applications like knowledge engineering and AI modeling. To address these challenges, we introduce High Fidelity (HiFi) data, a novel approach to meet the dual objectives of data usability and privacy protection. High-fidelity data maintains the original data's usability while ensuring compliance with stringent privacy regulations. Firstly, the de-identification approaches and their strengths and weaknesses are examined. Then four fundamental features of HiFi data are specified and rationalized: visual integrity, population integrity, statistical integrity, and ownership integrity. Lastly, the balancing of data usage and privacy protection is discussed with examples. Current Status of De-Identification De-identification is the process of reducing the informative content in data to decrease the probability of discovering an individual’s identity. The growing use of personal information for extended purposes may introduce more risk of privacy leakage. Various metrics and algorithms have been developed to de-identify data. HHS published a detailed guide, "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule," known as Safe Harbor, to measure de-identified patient health records. Common de-identification approaches are as follows: Redaction and Suppression This approach involves removing certain data elements from database records. A common difficulty with these approaches is to define "done properly." Removal of elements can significantly impact the effective use of data and possible loss of critical information for analysis. Blurring Blurring is reducing the data precision by combining several data elements. Three main approaches are: Aggregation: Combining individual data points into larger groups (e.g., summarizing data by region instead of individual address) Generalization: Replacing specific data with broader categories (e.g., replacing age with age range) Pixelation: Lowering the resolution of data (e.g., less precise geographic coordinates) Blurring methods are used in various reports or statistical summaries to provide a level of anonymity without fully protecting individual data rather than general-purpose de-identification. Masking Masking involves replacing data elements with either random or made-up value, or with another value in the dataset. It may decrease the accuracy of computations in many cases, affecting the validity and usability. The main variants in this category include: Pseudonymization: Assigning pseudonyms to data elements to mask their original values while maintaining consistency across the dataset Perturbation randomization: Adding random noise to data elements to mask their true values without completely distorting the overall dataset Swapping/Shuffling: Exchanging values between records to mask identities while preserving the dataset's statistical properties Noise differential privacy: Injecting statistical noise into the data to protect privacy while allowing for meaningful aggregate analysis High Fidelity Data: What and Why There are several key needs for HiFi Data, including but not limited to: Privacy and regulatory compliance: Ensuring data privacy and adhering to associated regulations Safe data utilization: Discover business insight without risking privacy. AI modeling: Train AI models with real-world data for better and more accurate behavior of the model itself and agents. Rapid data access for production issues: Access to production quality data during issues or unexpected network traffic without compromising privacy Given these complex and multifaceted requirements, a breakthrough solution is necessary that ensures: Privacy protection: Privacy and sensitive data is encoded to prevent privacy leaks. Data integrity: The transformed data retains the same structure, size, and logical consistency as the original data. Usage for analysis and AI: For analysis, projections, and AI modeling, the transformation should preserve statistical characteristics and population properties ideally in a lossless fashion. Quick access: Transforming should be quick and on-demand-based to ensure the transformation is accessible for production issues. High Fidelity Data Specification High Fidelity Data refers to data that is faithfulness to original features after transformation and/or encoding, including: Visual integrity: The transformed data retains its original format, making it "look and feel" the same as the original ones (e.g., dates still appear as dates, phone numbers as phone numbers). Population integrity: The transformed data preserves the population characteristics of the original dataset, ensuring that the distribution and relationships within the data remain intact. Statistical integrity: The statistical properties are maintained, ensuring that analyses performed on the encoded data yield results similar to those on the original data. Ownership integrity: The data retains information about its origin, ensuring that the ownership and provenance of the data are preserved to avoid unnecessary extended use. High Fidelity Data maintains privacy, usability, and integrities, making it suitable for data analysis, AI modeling, and reliable deployment by testing of production quality data. Visual Integrity Visual Integrity means the transformed data should comply with the original data in ways: Length of words and phrases: Transformations should maintain the original length of the data. For instance, Base64 or AES encrypted names would make them 15-30% longer, which is undesirable. Data types: Data types should be preserved (e.g., phone numbers should remain as dashed digital characters). The last four digits extracted as integers would break or change the validation pipeline. Data format: Remain consistent with the original Internal structure of composite data: Complex data types, like addresses, should maintain their internal structure. Although visual integrity might not seem significant at first glance, it profoundly impacts how analysts use the data and how trained LLMs predict outcomes. As shown in the following HiFi Data Visual Integrity: Transformed birthdates still appear as dates. Transformed phone numbers or SSNs still resemble phone numbers or SSNs, rather than random strings. Transformed emails look like valid email addresses but cannot be looked up on a server. No need for popular domains like "Gmail" to encode, but for less common domains, the domain is encoded as well. Visual integrity is critical in complex software ecosystems, especially production environments. Changes in data type and length could cause database schema changes, which are labor-intensive, time-consuming, and error-prone. Validation failures during QA could restart development sprints, and may even trigger configuration changes in firewalls and security monitoring systems. For instance, invalid email addresses or phone numbers might trigger security alerts. Preserving the "Look & Feel" of data is essential for data engineers and analysts, leading to less error-prone insights. Population Integrity Population integrity ensures the consistency of report and summary statistics is maintained in a lossless fashion before and after transformation. Population distributions: The transformed data should mirror the original data's population distribution (e.g., in healthcare, the percentage of patients from different states should remain consistent). Correlations and relations: The internal relationships and correlations between data elements should be preserved which is crucial for analyses that rely on understanding the interplay between different variables. For example, if one "John" had 100 records in the database, after transforming, there would still be 100 records of "John", with each "John" represented only once. Maintaining population integrity is essential to ensure the transformed data remains useful for statistical analysis and modeling for these reasons: Accurate analysis: Analysts can rely on the transformed data to provide the same insights as the original data, ensuring that trends and patterns are correctly discovered. Reliable data linkage: Encoded data can still be linked across different datasets without loss of information, allowing for comprehensive analyses that require data integration. Consistent results: Ensures that the results of data queries and analyses are consistent with what would be obtained from the original dataset In healthcare, maintaining population integrity ensures accurate tracking of patient records and health outcomes even after data de-identification. In finance, it enables precise analysis of transaction histories and customer behavior without compromising privacy. For example, in a region defined by a set of zip codes, the ratio of vaccine takers to non-takers should remain consistent before and after data de-identification. Preserved population integrity ensures that encoded datasets remain useful and reliable for all analytical purposes without the privacy risk. Statistical Integrity Statistical integrity ensures that the statistical properties, like mean, standard deviation(STD), entropy, and more of the original dataset are preserved in the transformed data. This integrity allows for accurate and meaningful analysis, projection, and deep mining of insight and knowledge. It includes: Preservation of statistical properties: Mean, STD, and other statistical measures should be maintained. Ensures that statistical analyses yield consistent outcomes through cross-transformation Accuracy of analysis and modeling: Crucial for applications in machine learning and AI modeling, like user pharmacy visiting projection and visiting Maintaining statistical integrity is essential for several reasons: Accurate statistical analysis: Analysts can perform statistical tests and derive insights from the transformed data with confidence, knowing that the results will be reflective of the original data. Valid predictive modeling: Machine learning models and other predictive analytics can be trained on the transformed data without losing the accuracy and reliability of the predictions. Consistency across studies: Ensures that findings from different studies or analyses are consistent, facilitating reliable comparisons and meta-analyses For example, in the healthcare industry, preserving statistical integrity allows researchers to accurately assess the prevalence of diseases, the effectiveness of treatments, and the distribution of health outcomes. In finance, it enables the precise evaluation of risk, performance metrics, and market trends. By ensuring consistent statistical properties, Statistical Integrity supports robust and reliable data analysis, enabling stakeholders to make informed decisions based on accurate and trustworthy insights. Ownership Integrity Owner means an entity that has full control of the original data set. Entity usually refers to a person, but it can also mean a company, an application, or a system. Ownership Integrity ensures that the provenance and ownership information of the data is preserved throughout the transformation process. The data owner can perform additional new transformations as needed in case the scope/requirement is changed. Data ownership: Retaining ownership is crucial for maintaining data governance and regulation compliance. Provenance: Reserving the data source origination plays an important role in the traceability and accountability of the transformed data. Maintaining ownership integrity is crucial for several reasons: Regulation compliance: Helps organizations comply with legal and regulatory requirements by maintaining clear records of data provenance and ownership Data accountability: Since the transformation is project-based, it can be designed to be reusable or not reusable. For example, different purposes for data analysis and/or model training may transform data accordingly with different data subsets of its origin without cross reference. Data governance: Supports robust data governance through its lifecycle to avoid unnecessary or unintentional reuse Trust and transparency: Builds trust with stakeholders by demonstrating that the organization maintains high standards of data integrity and accountability; Users of the transformed data can be assured that it comes from the original owner. In healthcare, ownership integrity allows the tracking of patient records back to the original healthcare provider. In finance, it ensures that transaction data can be traced back to the original financial institution, supporting regulatory compliance and auditability. Preserved ownership integrity ensures that encoded datasets remain transparent, accountable, and compliant with regulations, providing confidence to all stakeholders involved. Summary of High-Fidelity Data High Fidelity Data offers a balanced approach to data transformation, combining privacy protection with the preservation of data usability, making it a valuable asset across various industries. Specification High Fidelity Data (HiFi Data) specification aims to maintain the original data's usability while ensuring privacy and compliance with regulations. HiFi Data should offer the following features: Visual integrity: The encoded data retains its original format, ensuring it looks and feels the same as the raw data. Population integrity: The transformed data preserves the population characteristics of the original dataset, like distribution and frequency. Statistical integrity: The preserved statistical properties ensure accurate analysis and projection. Ownership integrity: The ownership and provenance are preserved through the transformation which prevents unauthorized re-use. Benefits Regulatory compliance: Helps organizations comply with legal and regulatory requirements by maintaining data ownership and provenance. Data usability: Encoded data retains its usability for analysis, reporting, and machine learning, without compromising privacy and re-architecting the complicated process management. Data accountability: Population, statistical, and ownership integrity make data governance consistent and accountable. Enhanced security: This makes re-identification extremely difficult. Consistency: Supports consistent encoding across different data sources and projects, promoting uniformity in data handling. Usage Healthcare: Ensuring compliance with HIPAA. HiFi Data can be used for population health research and health services research without risking patient privacy. Finance: Financial models and analyses can be conducted accurately without exposing sensitive information. Advertising: Enables the use of detailed customer data for targeted advertising while protecting individual identities. Data analysis and AI modeling: Provides high-quality data for training models, ensuring they reflect real-world scenarios without compromising privacy-sensitive information.
How do you approach data processing? What aspects are worth special consideration? Find the difference between relational vs non-relational databases to make informed decisions and learn how to choose a database regarding your project needs. What Is a Relational vs Non-Relational Database? That’s obviously the first question to address when choosing a database for your project. Knowing the difference between relational vs non-relational databases helps to be more specific with your requirements and leverage the right solutions. Being in use for decades, databases have gone through lots of changes and advancements. But at the same time, most representatives can be referred to as one or another type. Every team commonly faces the choice between a non-relational and relational database. Let’s cover the major characteristics of each solution to make more informed decisions. And, of course, we’ll start the comparison of relational vs non-relational databases with definitions. Relational databases are used to store data in a structured table-based manner. All the data remains easily accessible, linked, and related to support relations. Non-relational databases work in a completely different way to store semi-structured data. They don’t apply a rigid structure, thus introducing more dynamic schemas for unstructured data processing. Explained as simply as possible, databases are diversified by data structures. Relational solutions focus on predefined schemas to define and manipulate data. In comparison, non-relational ones are known for better flexibility as they can process any type of data without modifying the architecture. The distinct characteristic of a relational database is that it always stores data in tables using rows and columns. Therefore, it supports a straightforward and intuitive way of displaying data. At the same time, it allows teams to form relations based on specific entities. Most relational databases use Structured Query Language; thus, they are often called SQL databases. Non-relational databases are believed to appear as a viable alternative as not all the data can be stored in tabular format. This type embraces all the database types that can’t follow the relational structure and traditional SQL syntax. It doesn’t mean they don’t apply SQL language. What’s more, most of them use both SQL and UnQL (Unstructured Query Language). Therefore this type can also be referred to as NoSQL (not only SQL) databases. If SQL databases fall under the table-based category, NoSQL databases can be divided into several categories. The most common types of NoSQL databases include: Document databases collect, process, and retrieve data as JSON-like documents. Key-value stores arrange data in a key-value format where keys serve as unique identifiers. Graph databases are single-purpose platforms to create and manipulate graphs where data is presented in the form of nodes, edges, and properties. Wide-column stores organize data into flexible columns to be spread across database nodes and multiple servers. It supports varying the column format regardless of the row in the same table. Regarding differences between relational vs non-relational databases, teams have gained the opportunity to find reasonable solutions to their needs. Today’s businesses collect and process a huge amount of data, including dealing with complex queries. Well-outlined project requirements establish the foundation for making informed decisions. The main idea is that they need to choose a database that can query data efficiently and support instant outcomes. If the project leverages structured data and follows ACID compliance, relational databases are a good choice. If the data remains unstructured and doesn’t fit the predefined criteria, it’s better to choose a non-relational database. So let’s proceed with other essential details that become decisive for the final choice. Relational vs Non-Relational Database Pros and Cons Discussing the difference between relational and non-relational databases, we’d like to draw attention to the main advantages and disadvantages of these database types. It greatly helps teams to make a choice and select a database compatible with set requirements. The main idea is that it allows them to do comprehensive research and remain business-specific. The database selection might be difficult at first sight but considering more details aims to simplify the final decision. So let’s go with the mentioned types of databases to find their pros and cons. Advantages of Relational Databases ACID Compliance ACID properties differentiate a relational database and bring it to the dominant market position. It embraces all the necessary standards to guarantee the reliability of transactions within a database. Simplicity Due to the predefined schema and simple structure, the relational database is quite a straightforward solution. It doesn’t require lots of architectural efforts as the team uses structured query language. Data Accuracy Compared to other database types, the accuracy of data is higher for relational databases. It focuses on preventing data redundancy as there is no repeated or duplicated information. Security The table-based model makes it easier to restrict access to confidential data and reduces the chances of errors significantly. Disadvantages of Relational Databases Scalability Being vertically scalable, the relational database has a distinct disadvantage: low scalability. Strict consistency requirements restrict horizontal scaling, whereas vertical scaling comes with certain limits and greatly depends on supported hardware. Flexibility Rigid schemas and constraints could become pros and cons at the same time. Though it’s easy to interpret the data and identify the relationships, it remains complex to implement changes to the data structure. Relational databases aren’t suitable for huge or unstructured data. Performance The relational database performance is tightly dependent on the amount of data, the complexity of tables, and their number. Any increase in these areas leads to a time increase in performing queries. Advantages of Non-Relational Databases Horizontal Scaling Handling large datasets became easier with the introduction of non-relational databases. Moreover, horizontal scaling allows a team to accommodate, manage, and store more data while maintaining lower costs. Flexibility With the flexible data schema and non-rigid structure, non-relational databases can combine, process, and store any type of data. It becomes a distinct feature that differentiates it from a relational database that handles only structured data. Non-relational databases apply dynamic schemas for unstructured data. Fast Queries If relational databases can be used for complex queries, queries in non-relational databases remain faster. The main advantage is that it embraces the way to store the data initially optimized for queries. Besides, queries don’t require joints typical for relational database types. Easier Maintenance Non-relational databases are simpler and faster to set up and maintain. Some of them allow developers to map the data structure similar to programming languages. Thus it supports faster development time and fewer errors. Disadvantages of Non-Relational Databases Data Integrity Maintaining data integrity greatly depends on building relationships between data elements. Lack of integrity methods in non-relational databases could reduce overall data reliability, accuracy, and completeness. It becomes the developers’ responsibility to complete accurate and error-free data transferring from one stage to another. Consistency Focusing on scalability and performance, the non-relational database opts for consistency issues. It has no required mechanisms to prevent data redundancy and relies on eventual consistency. Thus they aren’t that efficient for handling large amounts of data. Moreover, when database categories vary, achieving all the use cases with one database is hard. Data Analysis In the light of comparing relational vs non-relational databases, the second ones have fewer facilities for data analysis. Besides, it usually requires programming expertise to handle the analysis, even for the simplest query. Also, many of them lack integration with popular BI tools. When To Use Relational vs Non-Relational Databases In the light of comparing relational vs non-relational databases, it’s important to address the common use cases. Learning the good market practices and the experience of others can provide some additional insights on how to choose a database for your project. Obviously, one or another category often suits certain needs and requirements better. The team’s task remains to learn details, referring to the smallest details. At the same time, you won’t find a strict distinction on use cases. Different types of databases were successfully implemented for various types of projects. It’s worth saying that knowing the relational vs non-relational database pros and cons is a must-have there. The informed choice can be supported via the detailed analysis of project specifications and solution availability. So let’s check on some useful advice on where to use relational vs non-relational databases. Use Cases of a Relational Database Highly Structured Data A stable data structure becomes necessary unless the project entails constant changes. It’s a great option to leverage strict, planned, predictable schemas to handle data distributed across different tables. Besides, it increases access to more tools for testing and analyzing data. The organized and specific nature enables easier manipulation and data querying. Secure and Consistent Environment When security and consistency are top priorities, teams need to make the right decisions. Relational databases have become a reasonable solution here. ACID principles support all the necessary functionality to handle data due to the latest compliance regulations. This type is often a choice for healthcare, fintech, enterprises, etc. Support Wide support availability is explained by the amount of time on the market. It’s often faster to find the team with the required expertise, as most relational databases follow similar principles. Also, they are more efficient for integrating data from other systems and using additional tools. The team has more product choices when utilizing these types of databases, including business intelligence tools. Use Cases of a Non-Relational Database Large Amounts of Unstructured Data One of the main reasons to apply a non-relational database is that not all data can fit into plain tables. For example, the project needs an efficient tool to accommodate various types of data like videos, articles, or social media content. Therefore, lots of data remain unstructured though it supports horizontal scalability. It helps to cover diversity and bring proper changes if required. Flexible Development Environment Fast accumulation rates are explained by the ability to collect data quickly and easily without its predefinition. The data often remains not restricted to certain formats and can be processed later. For many teams, a non-relational database is a great option, especially when the project requirements aren’t completely clear or they plan on continuous changes or updates. Timing Priorities The fast development environment enables faster and easier product delivery. Less methodical approaches eliminate any upfront preparing, planning, preparing, or designing of the non-relational databases. Teams can proceed with immediate development instead. It commonly suits the needs of MVP or some urgent product releases. Thanks to the many different database types on the market, there is always a suitable approach to fulfill project needs. Of course, the database selection varies from project to project. Moreover, some teams find it efficient to combine several databases to cover all the use cases. Popular Databases: The Current Market State The question of how to choose a database can’t be fully addressed without checking the market availability. It’s a fact that database selection is also impacted by the market state and popularity of certain databases. Besides, the successful experience of others can become a good practice to follow. As long as the team defines project specifications, they are ready to proceed with learning more details on available databases on the market. Keeping up with market tendencies allows them to stay up-to-date and increase the efficiency of leveraged solutions. The fast growth of the market has brought a great variety of databases to adopt. At present, the number of available databases has reached more than 300 databases. So, in the same way we can diversify databases by types or functionalities, it’s common practice to rank them by popularity. As we proceed with comparing relational vs non-relational databases, it’s worth saying that representatives of both database types have gained strong positions. Based on the latest Stack Overflow Developer Survey results, let’s look at the most popular databases. Popular Relational Databases MySQL MySQL is one of the most known relational databases. Released back in 1995, it has gained considerable popularity due to its functionality and used approaches. The open-source database has great support and is compatible with most libraries and frameworks. It is suitable for delivering cross-platform solutions, and even though mostly used for SQL queries, it also has NoSQL support if required. PostgreSQL PostgreSQL is another powerful open-source object-relational database first released in 1996. One of its distinctive characteristics is that it presents data in the form of objects instead of rows and columns. PostgreSQL is highly extensible; thus, it suits the needs of large software solutions. There’s no need to recompile the database as developers can write the code in various programming languages. SQLite SQLite is also a relational database management system released in 2000. It obtains one distinctive difference since it’s a server-side database. That often makes it faster as requests are serialized by the server. Also, it has bindings to different programming languages and is used for a variety of solutions, including IoT and embedded systems. Microsoft SQL Server Microsoft SQL Server is a known relational database management system introduced by Microsoft in 1989. They have greatly improved the solution with many unique features like customization, in-memory analytics, integrations, etc. Also, it supports different development tools and cloud services; however, it only works on Windows-based servers. Popular Non-Relational Databases MongoDB MongoDB is classified as a non-relational solution, particularly a document-oriented database released in 2009. It enables storing different types of data as it uses JSON-like objects. This technology solution works way faster than relational ones because it doesn’t require processing collected data. It usually remains unstructured and is suitable for handling massive sets of data. Redis Redis is a popular in-memory data store that is also used as a key-value database introduced in 2009. This open-source non-relational solution embraces in-memory data structure to support extensibility and clustering. It allows teams to store large data sets without a complex structure. Redis is often combined to leverage other data storage solutions as it can be applied as a caching layer. DynamoDB DynamoDB is a non-relational database introduced by Amazon in 2012. The tech focus embraces the support of data structures, documents, and key-value cloud services. High scalability and performance remain the main advantages of choosing this database, as it enables running high-performance apps at any scale. Due to the good functionality and being first to the market, the relational solutions still gain a considerable share of the market. Indeed, the introduction of new representatives makes everyone strengthen available approaches and keep on advancing new solutions. How To Choose a Database: Relational vs Non-Relational Databases Gathering all the vital details on different types of databases becomes necessary for making a good choice. With well-defined project requirements, the team looks for a database to correspond to their needs and support solution efficiency. The important thing is that both database types are viable options. The awareness of major differences greatly helps with its selection. Databases Relational Non-relational Language Structured Query Language (SQL) Structured Query Language (SQL), Unstructured Query Language (UnQL) Data schema Predefined schemas Dynamic schemas Database categories Table-based Document, key-value, graph, and wide-column stores Scalability Vertical scalability Horizontal scalability Performance Low High Security High Less secure Complex queries Used Not used Base properties ACID (atomicity, consistency, isolation, durability) transaction supported Follows CAP (consistency, availability, partition tolerance) theorem Online processing Used for OLTP Used for OLAP Hierarchical data storage Not suitable Best suitable Usage Better for multi-row transactions Better for unstructured data like documents or JSON There isn’t a bad choice; it’s more about the opportunity to meet requirements better and receive more outcomes. Considering the above-mentioned aspects, we’ve also decided to focus on key aspects of how to choose a database. Data Schema The main difference between the non-relational and relational databases remains the applied data schemas. If relational solutions use predefined schemas and deal with structured data, non-relational ones apply flexible schemas to process unstructured data in various ways. It’s important to remember that this factor often explains other distinct specifications of the database selection. Data Structure Structuring supports the way to locate and access data. If the team chooses the relational architecture, they proceed with the table-based structure. The tabular format focuses on linking and relating based on common data. The non-relational solutions can differ by several structures, including key-value, document, graph, or wide-column stores. In other words, they bring alternatives to structure data impossible to deal with in relational databases. Scaling The database selection can also be impacted by properties to scale your non-relational vs relational database. The relational database is vertically scalable when the load increase should be completed on a single server. Non-relational solutions are proven more efficient here as horizontal scaling allows adding more servers, thus handling higher traffic. Security It has always been crucial to leverage well-protected and highly secured solutions. ACID compliance for relational databases makes them more secure and easier to restrict access to confidential data. Non-relational types of databases are considered less secure, though known for great performance and scalability. Analytics Capabilities Relational databases are considered more efficient for leveraging data analysis and reporting. Most BI tools won’t let you query non-relational databases but work great with structured data. Of course, it is important to check the current database’s functionality as many of them keep introducing new alternatives. Integration Another aspect to consider in choosing a relational database vs a non-relational database is the opportunity to integrate it with other tools and services. Teams always have to check its compatibility with other tech solutions applied to the project. Integration requirements are growing dramatically to support consistency across all business solutions. Support Consideration Let’s draw attention to the point of how each representative is supported. It involves constant database advancement and its popularity on the market. Lack of support always ends with unexpected results and often failures. Make sure to choose databases that have gained good market share, have strong community support, and meet the project needs. Obviously, the database selection varies from project to project, but the main thing it should correspond to the outlined needs. There won’t be a bad choice as every project can be addressed from different perspectives. The main idea is to choose a database that can bring efficiency and meet the outlined project-specific requirements. Conclusion An excellent way to compare relational vs non-relational databases relies on a comprehensive analysis of its core aspects, main pros and cons, and typical use cases. Considering all the gathered details in this article, we can conclude that relational databases are a good choice when teams look for dynamic queries, high security, and cross-platform support. If scalability, performance, and flexibility remain the main priorities, it is better to opt for non-relational databases.
Software development and architecture is continuously evolving with artificial intelligence (AI). AI-assisted code generation stands out as a particularly revolutionary advancement, offering developers the ability to create high-quality code more efficiently and accurately than ever before. This innovation not only enhances productivity, but also opens the door to new possibilities in software creation, particularly in the realm of microservices development. The Evolution of Code Generation: Traditional Coding vs. AI-Assisted Coding Traditional coding requires developers to write and test extensive lines of code manually. This process is time consuming and prone to errors. Conversely, AI-assisted code generation leverages machine learning algorithms to analyze patterns in existing codebases, understands programming logic, and generates code snippets or entire programs based on specific requirements. This technology can drastically reduce the time spent on repetitive coding tasks and minimize human errors. It is not a substitute for developers, but rather a productivity tool that eliminates tedious and monotonous infrastructure and plumbing code. Benefits of AI-Assisted Code Generation Below is a list of some of the key benefits of leveraging AI-assisted code generation. Increased Efficiency: AI can quickly generate code, which allows developers to focus on more complex and creative aspects of software development. This leads to faster project completion times and the ability to tackle more projects simultaneously. Improved Code Quality: By learning from vast datasets of existing code, AI can produce high-quality code that adheres to best practices and industry standards. This results in more robust and maintainable software. Enhanced Collaboration: AI tools can bridge the gap between different development teams by providing consistent code styles and standards. This facilitates better collaboration and smoother integration of different software components. Rapid Prototyping: With AI-assisted code generation, developers can quickly create prototypes to test new ideas and functionalities. This accelerates the innovation cycle and helps bring new products to market faster. The Relationship Between AI and Microservices Microservices architecture has gained popularity in recent years because of its ability to break down complex applications into smaller, manageable services. Each service can be developed, deployed, and scaled independently, offering greater flexibility and resilience than a monolithic architecture. AI-assisted code generation is particularly well-suited for creating microservices, as it can handle the intricacies of defining and managing numerous small, interconnected services. A Platform for AI-Generated Microservices One example of AI in practice is ServiceBricks, an open-source platform that uses AI to generate microservices. Users provide human-readable text, which the AI then converts into fully functional microservices, including REST APIs for create, update, delete, get, and query operations. The platform also generates DTO models, source code, project files, class files, unit tests, and integration tests, thereby automating parts of the development process and reducing the time and effort needed to build scalable, maintainable microservices. The Future of AI-Assisted Development As AI technology continues to advance, its role in software development will only expand. Future iterations of AI-assisted code generation tools will likely become even more intuitive and capable, handling more complex programming tasks and integrating seamlessly with various development environments. The ultimate goal is to create a synergistic relationship between human developers and AI, where each leverages their strengths to produce superior software solutions. Conclusion AI-assisted code generation is transforming software development by enhancing efficiency, code quality, and innovation. This technology is reshaping how microservices and other essential components are developed, paving the way for greater productivity and creativity. As AI technology continues to evolve, it holds the potential to drive further advancements in software development, enabling developers to reach new heights in excellence and innovation worldwide.
Goal of This Application In this article, we will build an advanced data model and use it for ingestion and various search options. For the notebook portion, we will run a hybrid multi-vector search, re-rank the results, and display the resulting text and images. Ingest data fields, enrich data with lookups, and format: Learn to ingest data including JSON and images, format and transform to optimize hybrid searches. This is done inside the streetcams.py application. Store data into Milvus: Learn to store data in Milvus, an efficient vector database designed for high-speed similarity searches and AI applications. In this step, we are optimizing the data model with scalar and multiple vector fields — one for text and one for the camera image. We do this in the streetcams.py application. Use open source models for data queries in a hybrid multi-modal, multi-vector search: Discover how to use scalars and multiple vectors to query data stored in Milvus and re-rank the final results in this notebook. Display resulting text and images: Build a quick output for validation and checking in this notebook. Simple Retrieval-Augmented Generation (RAG) with LangChain: Build a simple Python RAG application (streetcamrag.py) to use Milvus for asking about the current weather via Ollama. While outputing to the screen we also send the results to Slack formatted as Markdown. Summary By the end of this application, you’ll have a comprehensive understanding of using Milvus, data ingest object semi-structured and unstructured data, and using open source models to build a robust and efficient data retrieval system. For future enhancements, we can use these results to build prompts for LLM, Slack bots, streaming data to Apache Kafka, and as a Street Camera search engine. Milvus: Open Source Vector Database Built for Scale Milvus is a popular open-source vector database that powers applications with highly performant and scalable vector similarity searches. Milvus has a distributed architecture that separates compute and storage, and distributes data and workloads across multiple nodes. This is one of the primary reasons Milvus is highly available and resilient. Milvus is optimized for various hardware and supports a large number of indexes. You can get more details in the Milvus Quickstart. For other options for running Milvus, check out the deployment page. New York City 511 Data REST Feed of Street Camera information with latitude, longitude, roadway name, camera name, camera URL, disabled flag, and blocked flag: JSON { "Latitude": 43.004452, "Longitude": -78.947479, "ID": "NYSDOT-badsfsfs3", "Name": "I-190 at Interchange 18B", "DirectionOfTravel": "Unknown", "RoadwayName": "I-190 Niagara Thruway", "Url": "https://nyimageurl", "VideoUrl": "https://camera:443/rtplive/dfdf/playlist.m3u8", "Disabled":true, "Blocked":false } We then ingest the image from the camera URL endpoint for the camera image: After we run it through Ultralytics YOLO, we will get a marked-up version of that camera image. NOAA Weather Current Conditions for Lat/Long We also ingest a REST feed for weather conditions meeting latitude and longitude passed in from the camera record that includes elevation, observation date, wind speed, wind direction, visibility, relative humidity, and temperature. JSON "currentobservation":{ "id":"KLGA", "name":"New York, La Guardia Airport", "elev":"20", "latitude":"40.78", "longitude":"-73.88", "Date":"27 Aug 16:51 pm EDT", "Temp":"83", "Dewp":"60", "Relh":"46", "Winds":"14", "Windd":"150", "Gust":"NA", "Weather":"Partly Cloudy", "Weatherimage":"sct.png", "Visibility":"10.00", "Altimeter":"1017.1", "SLP":"30.04", "timezone":"EDT", "state":"NY", "WindChill":"NA" } Ingest and Enrichment We will ingest data from the NY REST feed in our Python loading script. In our streetcams.py Python script does our ingest, processing, and enrichment. We iterate through the JSON results from the REST call then enrich, update, run Yolo predict, then we run a NOAA Weather lookup on the latitude and longitude provided. Build a Milvus Data Schema We will name our collection: "nycstreetcameras". We add fields for metadata, a primary key, and vectors. We have a lot of varchar variables for things like roadwayname, county, and weathername. Python FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True), FieldSchema(name='latitude', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='longitude', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='name', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='roadwayname', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='directionoftravel', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='videourl', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='filepath', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='creationdate', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='areadescription', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='elevation', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='county', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='metar', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weatherid', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weathername', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='observationdate', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='temperature', dtype=DataType.FLOAT), FieldSchema(name='dewpoint', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='relativehumidity', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='windspeed', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='winddirection', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='gust', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weather', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='visibility', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='altimeter', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='slp', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='timezone', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='state', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='windchill', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weatherdetails', dtype=DataType.VARCHAR, max_length=8000), FieldSchema(name='image_vector', dtype=DataType.FLOAT_VECTOR, dim=512), FieldSchema(name='weather_text_vector', dtype=DataType.FLOAT_VECTOR, dim=384) The two vectors are image_vector and weather_text_vector, which contain an image vector and text vector. We add an index for the primary key id and for each vector. We have a lot of options for these indexes and they can greatly improve performance. Insert Data Into Milvus We then do a simple insert into our collection with our scalar fields matching the schema name and type. We have to run an embedding function on our image and weather text before inserting. Then we have inserted our record. We can then check our data with Attu. Building a Notebook for Report We will build a Jupyter notebook to query and report on our multi-vector dataset. Prepare Hugging Face Sentence Transformers for Embedding Sentence Text We utilize a model from Hugging Face, "all-MiniLM-L6-v2", a sentence transformer to build our Dense embedding for our short text strings. This text is a short description of the weather details for the nearest location to our street camera. See: Integrate with HuggingFace Prepare Embedding Model for Images We utilize a standard resnet34 Pytorch feature extractor that we often use for images. Instantiate Milvus As stated earlier, Milvus is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search. For our example, we are connecting to Milvus running in Docker. Setting the URI as a local file, e.g., ./milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have a large scale of data, say more than a million vectors, you can set up a more performant Milvus server on Docker or Kubernetes. In this setup, please use the server URI, e.g.http://localhost:19530, as your uri. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the URI and token, which correspond to the Public Endpoint and API key in Zilliz Cloud. Prepare Our Search We are building two searches (AnnSearchRequest) to combine together for a hybrid search which will include a reranker. Display Our Results We display the results of our re-ranked hybrid search of two vectors. We show some of the output scalar fields and an image we read from the stored path. The results from our hybrid search can be iterated and we can easily access all the output fields we choose. filepath contains the link to the locally stored image and can be accessed from the key.entity.filepath. The key contains all our results, while key.entity has all of our output fields chosen in our hybrid search in the previous step. We iterate through our re-ranked results and display the image and our weather details. RAG Application Since we have loaded a collection with weather data, we can use that as part of a RAG (Retrieval Augmented Generation). We will build a completely open-source RAG application utilizing the local Ollama, LangChain, and Milvus. We set up our vector_store as Milvus with our collection. Python vector_store = Milvus( embedding_function=embeddings, collection_name="CollectionName", primary_field = "id", vector_field = "weather_text_vector", text_field="weatherdetails", connection_args={"uri": "https://localhost:19530"}, ) We then connect to Ollama. Python llm = Ollama( model="llama3", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), stop=["<|eot_id|>"], ) We prompt for interacting questions. Python query = input("\nQuery: ") We set up a RetrievalQA connection between our LLM and our vector store. We pass in our query and get the result. Python qa_chain = RetrievalQA.from_chain_type( llm, retriever=vector_store.as_retriever(collection = SC_COLLECTION_NAME)) result = qa_chain({"query": query}) resultforslack = str(result["result"]) We then post the results to a Slack channel. Python response = client.chat_postMessage(channel="C06NE1FU6SE", text="", blocks=[{"type": "section", "text": {"type": "mrkdwn", "text": str(query) + " \n\n" }, {"type": "divider"}, {"type": "section","text": {"type": "mrkdwn","text": str(resultforslack) +"\n" }] ) Below is the output from our chat to Slack. You can find all the source code for the notebook, the ingest script, and the interactive RAG application in GitHub below. Source Code Conclusion In this notebook, you have seen how you can use Milvus to do a hybrid search on multiple vectors in the same collection and re-ranking the results. You also saw how to build a complex data modal that includes multiple vectors and many scalar fields that represent a lot of metadata related to our data. You learned how to ingest JSON, images, and text to Milvus with Python. And finally, we built a small chat application to check out the weather for locations near traffic cameras. To build your own applications, please check out the resources below. Resources In the following list, you can find resources helpful in learning more about using pre-trained embedding models for Milvus, performing searches on text data, and a great example notebook for embedding functions. Milvus Reranking Milvus Hybrid Search 511NY: GET api/GetCameras Using PyMilvus's Model To Generate Text Embeddings HuggingFace: sentence-transformers/all-MiniLM-L6-v2 Pretrained Models Milvus: SentenceTransformerEmbeddingFunction Vectorizing JSON Data with Milvus for Similarity Search Milvus: Scalar Index Milvus: In-memory Index Milvus: On-disk Index GPU Index Not Every Field is Just Text, Numbers, or Vectors How good is Quantization in Milvus?
Today, several significant and safety-critical decisions are being made by deep neural networks. These include driving decisions in autonomous vehicles, diagnosing diseases, and operating robots in manufacturing and construction. In all such cases, scientists and engineers claim that these models help make better decisions than humans and hence, help save lives. However, how these networks reach their decisions is often a mystery, for not just their users, but also for their developers. These changing times, thus, necessitate that as engineers we spend more time unboxing these black boxes so that we can identify the biases and weaknesses of the models that we build. This may also allow us to identify which part of the input is most critical for the model and hence, ensure its correctness. Finally, explaining how models make their decisions will not only build trust between AI products and their consumers but also help meet the diverse and evolving regulatory requirements. The whole field of explainable AI is dedicated to figuring out the decision-making process of models. In this article, I wish to discuss some of the prominent explanation methods for understanding how computer vision models arrive at a decision. These techniques can also be used to debug models or to analyze the importance of different components of the model. The most common way to understand model predictions is to visualize heat maps of layers close to the prediction layer. These heat maps when projected on the image allow us to understand which parts of the image contribute more to the model’s decision. Heat maps can be generated either using gradient-based methods like CAM, or Grad-CAM or perturbation-based methods like I-GOS or I-GOS++. A bridge between these two approaches, Score-CAM, uses the increase in model confidence scores to provide a more intuitive way of generating heat maps. In contrast to these techniques, another class of papers argues that these models are too complex for us to expect just a single explanation for their decision. Most significant among these papers is the Structured Attention Graphs method which generates a tree to provide multiple possible explanations for a model to reach its decision. Class Activation Map (CAM) Based Approaches 1. CAM Class Activation Map (CAM) is a technique for explaining the decision-making of specific types of image classification models. Such models have their final layers consisting of a convolutional layer followed by global average pooling, and a fully connected layer to predict the class confidence scores. This technique identifies the important regions of the image by taking a weighted linear combination of the activation maps of the final convolutional layer. The weight of each channel comes from its associated weight in the following fully connected layer. It's quite a simple technique but since it works for a very specific architectural design, its application is limited. Mathematically, the CAM approach for a specific class c can be written as: where is the weight for activation map (A) of the kth channel of the convolutional layer. ReLU is used as only positive contributions of the activation maps are of interest for generating the heat map. 2. Grad-CAM The next step in CAM evolution came through Grad-CAM, which generalized the CAM approach to a wider variety of CNN architectures. Instead of using the weights of the last fully connected layer, it determines the gradient flowing into the last convolutional layer and uses that as its weight. So for the convolutional layer of interest A, and a specific class c, they compute the gradient of the score for class c with respect to the feature map activations of a convolutional layer. Then, this gradient is the global average pooled to obtain the weights for the activation map. The final obtained heat map is of the same shape as the feature map output of that layer, so it can be quite coarse. Grad-CAM maps become progressively worse as we move to more initial layers due to reducing receptive fields of the initial layers. Also, gradient-based methods suffer from vanishing gradients due to the saturation of sigmoid layers or zero-gradient regions of the ReLU function. 3. Score-CAM Score-CAM addresses some of these shortcomings of Grad-CAM by using Channel-wise Increase of Confidence (CIC) as the weight for the activation maps. Since it does not use gradients, all gradient-related shortcomings are eliminated. Channel-wise Increase of Confidence is computed by following the steps below: Upsampling the channel activation maps to input size and then, normalizing them Then, computing the pixel-wise product of the normalized maps and the input image Followed by taking the difference of the model output for the above input tensors and some base images which gives an increase in confidence Finally, applying softmax to normalize the activation maps weights to [0, 1] The Score-CAM approach can be applied to any layer of the model and provides one of the most reasonable heat maps among the CAM approaches. In order to illustrate the heat maps generated by Grad-CAM and Score-CAM approaches, I selected three images: bison, camel, and school bus images. For the model, I used the Convnext-Tiny implementation in TorchVision. I extended the PyTorch Grad-CAM repo to generate heat maps for the layer convnext_tiny.features[7][2].block[5]. From the visualization below, one can observe that Grad-CAM and Score-CAM highlight similar regions for the bison image. However, Score-CAM’s heat map seems to be more intuitive for the camel and school bus examples. Perturbation-Based Approaches Perturbation-based approaches work by masking part of the input image and then observing how this affects the model's performance. These techniques directly solve an optimization problem to determine the mask that can best explain the model’s behavior. I-GOS and I-GOS++ are the most popular techniques under this category. 1. Integrated Gradients Optimized Saliency (I-GOS) The I-GOS paper generates a heat map by finding the smallest and smoothest mask that optimizes for the deletion metric. This involves identifying a mask such that if the masked portions of the image are removed, the model's prediction confidence will be significantly reduced. Thus, the masked region is critical for the model’s decision-making. The mask in I-GOS is obtained by finding a solution to an optimization problem. One way to solve this optimization problem is by applying conventional gradients in the gradient descent algorithm. However, such a method can be very time-consuming and is prone to getting stuck in local optima. Thus, instead of using conventional gradients, the authors recommend using integrated gradients to provide a better descent direction. Integrated gradients are calculated by going from a baseline image (giving very low confidence in model outputs) to the original image and accumulating gradients on images along this line. 2. I-GOS++ I-GOS++ extends I-GOS by also optimizing for the insertion metric. This metric implies that only keeping the highlighted portions of the heat map should be sufficient for the model to retain confidence in its decision. The main argument for incorporating insertion masks is to prevent adversarial masks which don’t explain the model behavior but are very good at deletion metrics. In fact, I-GOS++ tries to optimize for three masks: a deletion mask, an insertion mask, and a combined mask. The combined mask is the dot product of the insertion and deletion masks and is the output of the I-GOS++ technique. This technique also adds regularization to make masks smooth on image areas with similar colors, thus enabling the generation of better high-resolution heat maps. Next, we compare the heat maps of I-GOS and I-GOS++ with Grad-CAM and Score-CAM approaches. For this, I made use of the I-GOS++ repo to generate heat maps for the Convnext-Tiny model for the bison, camel, and school bus examples used above. One can notice in the visualization below that the perturbation techniques provide less diffused heat maps compared to the CAM approaches. In particular, I-GOS++ provides very precise heat maps. Structured Attention Graphs for Image Classification The Structured Attention Graphs (SAG) paper presents a counter view that a single explanation (heat map) is not sufficient to explain a model's decision-making. Rather multiple possible explanations exist which can also explain the model’s decision equally well. Thus, the authors suggest using beam-search to find all such possible explanations and then using SAGs to concisely present this information for easier analysis. SAGs are basically “directed acyclic graphs” where each node is an image patch and each edge represents a subset relationship. Each subset is obtained by removing one patch from the root node’s image. Each root node represents one of the possible explanations for the model’s decision. To build the SAG, we need to solve a subset selection problem to identify a diverse set of candidates that can serve as the root nodes. The child nodes are obtained by recursively removing one patch from the parent node. Then, the scores for each node are obtained by passing the image represented by that node through the model. Nodes below a certain threshold (40%) are not expanded further. This leads to a meaningful and concise representation of the model's decision-making process. However, the SAG approach is limited to only coarser representations as combinatorial search is very computationally expensive. Some illustrations for Structured Attention Graphs are provided below using the SAG GitHub repo. For the bison and camel examples for the Convnext-Tiny model, we only get one explanation; but for the school bus example, we get 3 independent explanations. Applications of Explanation Methods Model Debugging The I-GOS++ paper presents an interesting case study substantiating the need for model explainability. The model in this study was trained to detect COVID-19 cases using chest x-ray images. However, using the I-GOS++ technique, the authors discovered a bug in the decision-making process of the model. The model was paying attention not only to the area in the lungs but also to the text written on X-ray images. Obviously, the text should not have been considered by the model, indicating a possible case of overfitting. To alleviate this issue, the authors pre-processed the images to remove the text and this improved the performance of the original diagnosis task. Thus, a model explainability technique, IGOS++ helped debug a critical model. Understanding Decision-Making Mechanisms of CNNs and Transformers Jiang et. al. in their CVPR 2024 paper, deployed SAG, I-GOS++, and Score-CAM techniques to understand the decision-making mechanism of the most popular types of networks: Convolutional Neural Networks (CNNs) and Transformers. This paper applied explanation methods on a dataset basis instead of a single image and gathered statistics to explain the decision-making of these models. Using this approach, they found that Transformers have the ability to use multiple parts of an image to reach their decisions in contrast to CNNs which use several disjoint smaller sets of patches of images to reach their decision. Key Takeaways Several heat map techniques like Grad-CAM, Score-CAM, IGOS, and IGOS++ can be used to generate visualizations to understand which parts of the image a model focuses on when making its decisions. Structured Attention Graphs provide an alternate visualization to provide multiple possible explanations for the model’s confidence in its predicted class. Explanation techniques can be used to debug the models and can also help better understand model architectures.