Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.
Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.
Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
Enterprise AI
In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership.In DZone's 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business.This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.
Is Spring AI Strong Enough for AI?
Open-Source Data Management Practices and Patterns
As applications grow in complexity and user base, the demands on their underlying databases increase significantly. Efficient database scaling becomes crucial to maintain performance, ensure reliability, and manage large volumes of data. Scaling a database effectively involves a combination of strategies that optimize both hardware and software resources to handle increasing loads. This cheatsheet provides an overview of essential techniques for database scaling. From optimizing query performance with indexing to distributing data across multiple servers with horizontal scaling, each section covers a critical aspect of database management. Whether you're dealing with a rapidly growing application or preparing for future growth, understanding these strategies will help you make informed decisions to ensure your database remains robust and responsive. This guide will walk you through the key concepts and best practices for: Indexing: Enhancing query performance through efficient data retrieval methods Vertical scaling: Increasing the capacity of a single database server to handle more load Horizontal scaling/sharding: Distributing data across multiple servers to manage larger datasets and higher traffic Denormalization: Improving read performance by reducing the number of joins through strategic data redundancy Caching: Reducing database load by storing frequently accessed data in faster storage layers Replication: Enhancing availability and reliability by copying data across multiple databases By mastering these techniques, you can ensure that your database infrastructure scales efficiently and remains performant as your application and data grow. 1. Indexing What Is Indexing? Indexing is a technique used to improve the speed of data retrieval operations on a database table at the cost of additional storage space. An index creates a data structure (e.g., B-Tree, Hash Table) that allows the database to quickly locate rows without scanning the entire table. Key Concepts Primary index: Automatically created on the primary key of a table, it ensures uniqueness and speeds up query performance on that key. Secondary index: Created on columns that are frequently used in query conditions (WHERE clauses). It helps in speeding up searches but may slow down write operations due to the need to maintain the index. Composite index: An index on multiple columns. It is useful for queries that filter on multiple columns, but the order of columns in the index is crucial. Unique index: Ensures that the indexed columns have unique values, similar to a primary key but can be applied to non-primary columns. Best Practices Index selective columns: Columns with high cardinality (a large number of unique values) benefit most from indexing. Avoid over-indexing: While indexes speed up reads, they slow down writes (INSERT, UPDATE, DELETE) due to the additional overhead of maintaining the index. Use only necessary indexes. Monitor index performance: Regularly analyze query performance to ensure indexes are effectively used. Tools like EXPLAIN (in SQL) can help diagnose issues. Consider covering indexes: A covering index contains all the columns needed for a query, allowing the database to satisfy the query entirely from the index without accessing the table. Challenges Maintenance overhead: Indexes need to be updated as the data changes, which can introduce performance bottlenecks in write-heavy applications. Increased storage: Indexes consume additional disk space, which can be significant depending on the size of the data and the number of indexes. Complex queries: In some cases, complex queries may not benefit from indexes, especially if they involve functions or multiple table joins. Conclusion Indexing is a powerful tool for optimizing database performance, particularly for read-heavy workloads. However, it's essential to balance the benefits of fast data retrieval with the potential costs in terms of storage and write performance. Regularly review and optimize indexes to ensure your database scales effectively as your application grows. 2. Vertical Scaling What Is Vertical Scaling? Vertical scaling, also known as "scaling up," involves increasing the capacity of a single database server to handle a higher load. This can be achieved by upgrading the server's hardware, such as adding more CPU cores, increasing RAM, or using faster storage solutions like SSDs. The goal is to boost the server's ability to process more transactions, handle larger datasets, and improve overall performance. Key Concepts CPU upgrades: More powerful processors with higher clock speeds or additional cores can handle more concurrent queries, reducing latency and improving throughput. Memory expansion: Increasing the amount of RAM allows the database to cache more data in memory, reducing the need to access slower disk storage and speeding up query performance. Storage improvements: Moving from traditional hard drives to SSDs or even NVMe drives can drastically reduce data access times, leading to faster read and write operations. Database tuning: Beyond hardware upgrades, tuning the database configuration (e.g., adjusting buffer sizes, and cache settings) to take full advantage of the available resources is crucial for maximizing the benefits of vertical scaling. Advantages Simplicity: Vertical scaling is straightforward since it doesn't require changes to the application or database architecture. Upgrading hardware is often less complex than implementing horizontal scaling or sharding. Consistency: With a single server, there's no need to worry about issues like data consistency across multiple nodes or the complexities of distributed transactions. Maintenance: Managing a single server is simpler, as it involves fewer moving parts than a distributed system. Challenges Cost: High-performance hardware can be expensive, and there is often a diminishing return on investment as you approach the upper limits of server capacity. Single point of failure: Relying on a single server increases the risk of downtime if the server fails. Redundancy and failover mechanisms become critical in such setups. Scalability limits: There's a physical limit to how much you can scale up a single server. Once you reach the maximum hardware capacity, further scaling requires transitioning to horizontal scaling or sharding. Conclusion Vertical scaling is an effective solution for improving database performance in the short term, especially for applications that are not yet experiencing massive growth. However, it's important to recognize its limitations. As your application continues to grow, you may eventually need to combine vertical scaling with other strategies like horizontal scaling or replication to ensure continued performance and availability. Balancing the simplicity and power of vertical scaling with its potential limitations is key to maintaining a scalable database infrastructure. 3. Horizontal Scaling/Sharding What Is Horizontal Scaling? Horizontal scaling, often referred to as "scaling out," involves distributing your database across multiple servers to manage larger datasets and higher traffic. Unlike vertical scaling, where you improve a single server's capacity, horizontal scaling adds more servers to handle the load. This approach spreads the data and query load across multiple machines, allowing for virtually unlimited scaling as your application grows. Sharding Sharding is a specific technique used in horizontal scaling where the database is divided into smaller, more manageable pieces called "shards." Each shard is a subset of the overall data and is stored on a separate server. Queries are directed to the appropriate shard based on the data's partitioning logic (e.g., range-based, hash-based). Sharding helps distribute the load evenly across servers and can significantly improve performance and scalability. Key Concepts Partitioning: The process of dividing a database into smaller parts (shards) that can be spread across multiple servers. Partitioning logic determines how the data is divided (e.g., by user ID, geographic region). Replication: In conjunction with sharding, data can be replicated across shards to ensure availability and fault tolerance. Load balancing: Distributing incoming database queries evenly across multiple servers to prevent any single server from becoming a bottleneck. Consistency models: Ensuring data consistency across shards can be challenging. Different consistency models, such as eventual consistency or strong consistency, can be employed based on application requirements. Advantages Scalability: Horizontal scaling offers virtually unlimited scalability by adding more servers as needed. This allows your database infrastructure to grow with your application. Fault tolerance: By distributing data across multiple servers, the failure of a single server has less impact, as other servers can take over the load or provide data redundancy. Cost-effectiveness: Scaling out with multiple commodity servers can be more cost-effective than investing in increasingly expensive high-performance hardware for a single server. Challenges Complexity: Managing a sharded database is more complex than managing a single server. It requires careful planning of partitioning logic, replication strategies, and query routing. Consistency and availability: Ensuring consistency across shards can be difficult, especially in distributed environments. Trade-offs between consistency, availability, and partition tolerance (CAP theorem) need to be considered. Data redistribution: As your application grows, you may need to re-shard or redistribute data across servers, which can be a complex and resource-intensive process. Conclusion Horizontal scaling and sharding are powerful strategies for managing large-scale applications that require high availability and can handle massive amounts of data. While the complexity of managing a distributed system increases, the benefits of improved scalability, fault tolerance, and cost-effectiveness often outweigh the challenges. Proper planning and implementation of horizontal scaling can ensure your database infrastructure remains robust and scalable as your application continues to grow. 4. Denormalization What Is Denormalization? Denormalization is the process of intentionally introducing redundancy into a database to improve read performance. It involves restructuring a normalized database (where data is organized to minimize redundancy) by combining tables or adding duplicate data to reduce the number of joins required in queries. This can lead to faster query execution times at the cost of increased storage space and potential complexity in maintaining data consistency. Key Concepts Normalization vs. denormalization: Normalization organizes data to minimize redundancy and dependencies, typically through multiple related tables. Denormalization, on the other hand, merges these tables or adds redundant data to optimize query performance. Precomputed aggregates: Storing aggregated data (e.g., total sales per region) in a denormalized form can significantly speed up queries that require these calculations, reducing the need for complex joins or real-time computations. Data redundancy: By duplicating data across multiple tables or including commonly queried fields directly in related tables, denormalization reduces the need to join tables frequently, which can drastically improve query performance. Advantages Improved read performance: Denormalized databases can execute read-heavy queries much faster by eliminating the need for complex joins and reducing the computational overhead during query execution. Simplified queries: With fewer tables to join, queries become simpler and more straightforward, making it easier for developers to write and maintain efficient queries. Optimized for specific use cases: Denormalization allows you to tailor your database schema to optimize performance for specific, frequently executed queries, making it ideal for read-heavy applications. Challenges Data inconsistency: The primary trade-off in denormalization is the risk of data inconsistency. Since the same data might be stored in multiple places, ensuring that all copies of the data remain synchronized during updates can be challenging. Increased storage costs: Redundant data consumes additional storage space, which can be significant depending on the size of the database and the extent of denormalization. Complex updates: Updating data in a denormalized database can be more complex, as changes must be propagated across all redundant copies of the data, increasing the likelihood of errors and requiring more careful transaction management. Best Practices Selective denormalization: Only denormalize data that is frequently queried together or requires fast read performance. Avoid over-denormalizing, as it can lead to unmanageable complexity. Maintain a balance: Strive to balance the benefits of faster reads with the potential downsides of increased complexity and storage requirements. Regularly review your denormalization strategies as the application's needs evolve. Use case evaluation: Carefully evaluate the use cases where denormalization will have the most impact, such as in read-heavy workloads or where query performance is critical to user experience. Conclusion Denormalization is a powerful tool for optimizing read performance in databases, especially in scenarios where speed is critical. However, it comes with trade-offs in terms of data consistency, storage costs, and update complexity. By carefully applying denormalization where it makes the most sense, you can significantly enhance the performance of your database while managing the associated risks. Properly balancing normalization and denormalization is key to maintaining a scalable and performant database infrastructure. 5. Caching What Is Caching? Caching is a technique used to temporarily store frequently accessed data in a fast-access storage layer, such as memory, to reduce the load on the database and improve application performance. By serving data from the cache instead of querying the database, response times are significantly faster, and the overall system scalability is enhanced. Key Concepts In-memory cache: A cache stored in RAM, such as Redis or Memcached, which provides extremely fast data retrieval times. In-memory caches are ideal for storing small, frequently accessed datasets. Database query cache: Some databases offer built-in query caching, where the results of expensive queries are stored and reused for subsequent requests, reducing the need for repeated query execution. Object caching: Storing the results of expensive computations or database queries as objects in memory. This can be used to cache rendered pages, user sessions, or any other data that is expensive to generate or fetch. Cache expiration: A strategy to invalidate or refresh cached data after a certain period (time-to-live or TTL) to ensure that the cache doesn't serve stale data. Cache expiration policies can be time-based, event-based, or based on data changes. Advantages Improved performance: Caching can significantly reduce the load on the database by serving frequently accessed data from a faster cache layer, resulting in faster response times for users. Scalability: By offloading read operations to the cache, the database can handle more simultaneous users and queries, making the application more scalable. Cost efficiency: Reducing the number of database queries lowers the need for expensive database resources and can reduce the overall infrastructure costs. Challenges Cache invalidation: One of the most challenging aspects of caching is ensuring that the cached data remains fresh and consistent with the underlying database. Invalidation strategies must be carefully designed to prevent serving stale data. Cache misses: When data is not found in the cache (a cache miss), the application must fall back to querying the database, which can introduce latency. Proper cache population and management strategies are crucial to minimizing cache misses. Complexity: Implementing and maintaining a caching layer adds complexity to the application architecture. It requires careful planning and monitoring to ensure that the cache is effective and does not introduce additional issues, such as memory overuse or data inconsistency. Best Practices Use caching wisely: Cache data that is expensive to compute or frequently accessed. Avoid caching data that changes frequently unless you have a robust invalidation strategy. Monitor cache performance: Regularly monitor the cache hit rate (the percentage of requests served from the cache) and adjust cache size, expiration policies, and strategies to optimize performance. Layered caching: Consider using multiple layers of caching (e.g., in-memory cache for ultra-fast access and a distributed cache for larger datasets) to balance performance and resource utilization. Conclusion Caching is a critical component of a scalable database architecture, especially for read-heavy applications. It can dramatically improve performance and reduce the load on your database, but it must be implemented with careful consideration of cache invalidation, data consistency, and overall system complexity. By leveraging caching effectively, you can ensure that your application remains fast and responsive, even as the load increases. 6. Replication What Is Replication? Replication involves copying and maintaining database objects, such as tables, across multiple database servers. This process ensures that the same data is available across different servers, which can improve availability, fault tolerance, and load distribution. Replication can be set up in various configurations, such as master-slave, master-master, or multi-master, depending on the needs of the application. Key Concepts Master-slave replication: In this model, the master server handles all write operations, while one or more slave servers replicate the data from the master and handle read operations. This setup reduces the load on the master server and increases read performance. Master-master replication: In this configuration, multiple servers (masters) can accept write operations and replicate the changes to each other. This approach allows for high availability and load distribution but requires careful conflict resolution mechanisms. Synchronous vs. asynchronous replication: Synchronous replication ensures that data is written to all replicas simultaneously, providing strong consistency but potentially increasing latency. Asynchronous replication, on the other hand, allows for lower latency but introduces the risk of data inconsistency if a failure occurs before all replicas are updated. Failover and redundancy: Replication provides a failover mechanism where, if the master server fails, one of the slave servers can be promoted to master to ensure continuous availability. This redundancy is crucial for high-availability systems. Advantages High availability: By maintaining multiple copies of the data, replication ensures that the database remains available even if one or more servers fail. This is critical for applications that require 24/7 uptime. Load distribution: Replication allows read operations to be distributed across multiple servers, reducing the load on any single server and improving overall system performance. Fault tolerance: In the event of a hardware failure, replication provides a backup that can be quickly brought online, minimizing downtime and data loss. Challenges Data consistency: Ensuring that all replicas have consistent data can be challenging, especially in asynchronous replication setups where there might be a delay in propagating updates. Conflict resolution strategies are necessary for multi-master configurations. Increased complexity: Managing a replicated database system introduces additional complexity in terms of setup, maintenance, and monitoring. It requires careful planning and execution to ensure that replication works effectively and does not introduce new problems. Latency issues: Synchronous replication can introduce latency in write operations because the system waits for confirmation that all replicas have been updated before proceeding. This can affect the overall performance of the application. Best Practices Choose the right replication strategy: Select a replication model (master-slave, master-master, etc.) based on your application's specific needs for consistency, availability, and performance. Monitor and optimize: Regularly monitor replication lag (the delay between updates to the master and when those updates appear on the replicas) and optimize the replication process to minimize this lag. Plan for failover: Implement automated failover mechanisms to ensure that your system can quickly recover from failures without significant downtime. Conclusion Replication is a vital strategy for building a robust, high-availability database system. It enhances fault tolerance, improves read performance, and ensures data availability across multiple servers. However, it also introduces challenges related to data consistency and system complexity. By carefully selecting the right replication strategy and continuously monitoring and optimizing the replication process, you can build a scalable and reliable database infrastructure that meets the demands of modern applications.
As a solutions architect with over two decades of experience in relational database systems, I recently started exploring MariaDB's new Vector Edition to see if it could address some of the AI data challenges we're facing. A quick look seemed pretty convincing, especially with how it could bring AI magic right into a regular database setup. However, I wanted to test it with a simple use case to see how it performs in practice. In this article, I will share my hands-on experience and observations about MariaDB's vector capabilities by running a simple use case. Specifically, I will be loading sample customer reviews into MariaDB and performing fast similarity searches to find related reviews. Environment Setup Python 3.10 or higher Docker Desktop My experiment started with setting up a Docker container using MariaDB's latest release (11.6) which includes vector capabilities. Shell # Pull the latest release docker pull quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview # Update password docker run -d --name mariadb_vector -e MYSQL_ROOT_PASSWORD=<replace_password> quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview Now, create a table and load it with sample customer reviews that include sentiment scores and embeddings for each review. To generate text embeddings, I am using SentenceTransformer, which lets you use pre-trained models. To be specific, I decided to go with a model called paraphrase-MiniLM-L6-v2 that takes our customer reviews and maps them into a 384-dimensional space. Python import mysql.connector import numpy as np from sentence_transformers import SentenceTransformer model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # I already have a database created with a name vectordb connection = mysql.connector.connect( host="localhost", user="root", password="<password>", # Replace me database="vectordb" ) cursor = connection.cursor() # Create a table to store customer reviews with sentiment score and embeddings. cursor.execute(""" CREATE TABLE IF NOT EXISTS customer_reviews ( id INT PRIMARY KEY AUTO_INCREMENT, product_id INT, review_text TEXT, sentiment_score FLOAT, review_embedding BLOB, INDEX vector_idx (review_embedding) USING HNSW ) ENGINE=ColumnStore; """) # Sample reviews reviews = [ (1, "This product exceeded my expectations. Highly recommended!", 0.9), (1, "Decent quality, but pricey.", 0.6), (2, "Terrible experience. The product does not work.", 0.1), (2, "Average product, ok ok", 0.5), (3, "Absolutely love it! Best purchase I have made this year.", 1.0) ] # Load sample reviews into vector DB for product_id, review_text, sentiment_score in reviews: embedding = model.encode(review_text) cursor.execute( "INSERT INTO customer_reviews (product_id, review_text, sentiment_score, review_embedding) VALUES (%s, %s, %s, %s)", (product_id, review_text, sentiment_score, embedding.tobytes())) connection.commit() connection.close() Now, let's leverage MariaDB's vector capabilities to find similar reviews. This is more like asking "What other customers said similar to this review?". In the below example, I am going to find the top 2 reviews that are similar to a customer review that says "I am super satisfied!". To do this, I am using one of the vector functions (VEC_Distance_Euclidean) available in the latest release. Python # Convert the target customer review into vector target_review_embedding = model.encode("I am super satisfied!") # Find top 2 similar reviews using MariaDB's VEC_Distance_Euclidean function cursor.execute(""" SELECT review_text, sentiment_score, VEC_Distance_Euclidean(review_embedding, %s) AS similarity FROM customer_reviews ORDER BY similarity LIMIT %s """, (target_review_embedding.tobytes(), 2)) similar_reviews = cursor.fetchall() Observations It is easy to set up and we can combine both structured data (like product IDs and sentiment scores), unstructured data (review text), and their vector representations in a single table. I like its ability to use SQL syntax alongside vector operations which makes it easy for teams that are already familiar with relational databases. Here is the full list of vector functions supported in this release. The HNSW index improved the performance of the similarity search query for larger datasets that I tried so far. Conclusion Overall, I am impressed! MariaDB's Vector Edition is going to simplify certain AI-driven architectures. It bridges the gap between the traditional database world and the evolving demands of AI tools. In the coming months, I look forward to seeing how this technology matures and how the community adopts it in real-world applications.
The goal of this article is to investigate the fields of statistics and machine learning and look at the differences, similarities, usage, and ways of analyzing data in these two branches. Both branches of science allow interpreting data, however, they are based on different pillars: statistics on mathematics and the other on computer science — the focus of machine learning. Introduction Artificial intelligence together with machine learning is presently the technologically advanced means of extracting useful information from the raw data that is changing every day around us. On the contrary, statistics — a very old field of research of over 3 centuries — has always been regarded as a core discipline for the interpretation of the collected data and decision-making. Even though both of them share one goal of studying data, how the goal is achieved and where the focus is varies in statistics and machine learning. This article, however, seeks to relate the two fields and how they address the needs of contemporary society as the field of data science expands. 1. Foundations and Definitions Cohen's Measurement This is a subsection of mathematics that revolves around the organization, evaluation, analysis, and representation of numerical figures. It has grown through a timeline of three hundred years and finds application in such fields as economics, health sciences, and social studies Machine Learning (ML) This is the area of computer science that involves extracting intelligence from data in order to help the systems make decisions in the future. This includes those algorithms that are capable of identifying very sophisticated patterns and extending them to novel, unreleased data. However, the concept of machine learning is not so old, it has developed for about 30+ years. 2. Key Differences Between Statistics and Machine Learning Aspect Statistics Machine Learning Assumptions Assumes relationships between variables (e.g., alpha, beta) before building models Makes fewer assumptions, and can model complex relationships without prior knowledge Interpretability Focuses on interpretation: parameters like coefficients provide insight into how variables influence outcomes. Focuses on predictive accuracy: often works with complex algorithms (e.g., neural networks) that act as “black boxes.” Data Size Traditionally works with smaller, structured datasets Designed to handle large, complex datasets, including unstructured data (e.g., text, images) Applications Used in areas like social sciences, economics, and medicine for making inferences about populations Applied in AI, computer vision, NLP, and recommender systems, focusing on predictive modeling 3. Learning Approaches Statistics The methods have a static nature in that they adopt an existing proposition. That is proposing a hypothesis and including a sample to the hypothesis to either nullify or substantiate it. Often the being is to scope the bias within the sample when an inference from sample to population is made. Machine Learning The methods have an active rather than static outlook. The algorithm is able to recognize available patterns in the data without any predefined pattern. Machine learning models are all about hunting for the elephants in the room rather than just testing hypotheses. 4. Example: Linear Regression in Both Fields The same linear regression formula, y = mx + b (or y = ax + b), is adjacent to both statistics and machine learning; however, the methodologies are different: As part of the analysis and description, the model is constructed in such a way that the target variable value is represented as a function of other input variables by making a guess about the model parameters. They claim to accept the same model in order to reduce the error between the predicted output and the actual output, which in the case of the former is principally directed towards fitting and understanding the parameters. 5. Applications of Statistics vs. Machine Learning Applications Statistics Machine Learning Social Sciences Used for sampling to make inferences about large populations Predictive models for identifying patterns in survey data Economics and Medicine Statistical models (e.g., ANOVA, t-tests) to identify significant trends AI models to predict patient outcomes or stock market trends Quality Control Applies hypothesis testing for quality assurance AI-driven automation in manufacturing for predictive maintenance Artificial Intelligence (AI) Less common in AI due to its focus on smaller datasets Central to AI, including in computer vision and NLP 6. Example Algorithms in Each Field Statistics Algorithms Machine Learning Algorithms Linear Regression Decision Trees Logistic Regression Neural Networks ANOVA (Analysis of Variance) Support Vector Machines (SVM) t-tests, Chi-square tests k-Nearest Neighbors (KNN) Hypothesis Testing Random Forests 7. Handling Data Statistics A branch that is most effective when tasked with well-defined and clean datasets, where the dependence amongst the variables can either be linear or otherwise known. Machine Learning This type of data analysis does well with big, dirty, and unstructured data (such as pictures and videos) that has no recommended formats or applies in this case. It can also deal with nonlinear relationships that are often difficult to implement with statistical techniques. Conclusion: Choosing the Right Tool It is clear that both statistics and machine learning are useful in the analysis of data. However, a decision has to be arrived at concerning which one to use in which scenario. Statistics are appropriate when there is a need to analyze data and establish how independent and dependent variables are related especially when working with lower dimensional structured data. Machine Learning is appropriate when the objective is predictive modeling, with vast or non-structural data, and where computation takes precedence over explanatory power. In modern times, these two approaches are usually used together. For example, a data analyst may perform data exploration first using statistical approaches, then turn on predictive models to refine the prediction. Summary Table: Statistics vs. Machine Learning Factor Statistics Machine Learning Approach Deductive, starts with hypothesis Inductive, learns patterns from data Data Type Structured, smaller datasets Large, complex, and unstructured datasets Interpretability High: focuses on insights from models Low: models often function as "black boxes" Application Areas Economics, social sciences, medicine AI, computer vision, natural language processing By understanding both fields, data scientists can choose the right method based on their goals whether it's interpreting data or making predictions. Ultimately, the integration of statistics and machine learning is the key to unlocking powerful insights from today’s vast and complex datasets.
It may happen that we create an index, but PostgreSQL doesn’t use it. What could be the reason, and how can we fix it? We identified 11 distinct scenarios. Read on to find out. Important Things Indexes may be tricky. We already covered how they work in another article. Let’s quickly recap the important parts of how they work. How Indexes Work The B-tree is a self-balancing tree data structure that maintains a sorted order of entries, allowing for efficient searches, insertions, and deletions in logarithmic time — unlike regular heap tables, which operate in linear time. B-trees are an extension of binary search trees, with the key distinction that they can have more than two children. B-trees in PostgreSQL satisfy the following: Every node has at most 3 children. Every node (except for the root and leaves) has at least 2 children. The root node has at least two children (unless the root is a leaf). Leaves are on the same level. Nodes on the same level have links to siblings. Each key in the leaf points to a particular TID. Page number 0 of the index holds metadata and points to the tree root. We can see a sample B-tree below: When inserting a key into a full node, the node must be split into two smaller nodes, with the key being moved up to the parent. This process may trigger further splits at the parent level, causing the change to propagate up the tree. Likewise, during key deletion, the tree remains balanced by merging nodes or redistributing keys among sibling nodes. For searching, we can apply the binary search algorithm. Suppose we're searching for a value that appears in the tree only once. Here's how we would navigate the tree while looking for 62: You can see that we just traverse the nodes top-down, check the keys, and finally land at a value. A similar algorithm would work for a value that doesn’t exist in the tree. This is more obvious when we look for all the values that are greater than 0: How Database Estimates the Cost The index cost consists of the start-up cost and the run cost. The start-up cost depends on the number of tuples in the index and the height of the index. It is specified as cpu_operator_cost multiplied by the ceiling of the base-2 logarithm of the number of tuples plus the height of the tree multiplied by 50. The run cost is the sum of the CPU cost (for both the table and the index) and the I/O cost (again, for the table and the index). The CPU cost for the index is calculated as the sum of constants cpu_index_tuple_cost and qual_op_cost (which are 0.005 and 0.0025 respectively) multiplied by the number of tuples in the index multiplied by the selectivity of the index. Selectivity of the index is the estimate of how many values inside the index would match the filtering criteria. It is calculated based on the histogram that captures the counts of the values. The idea is to divide the column’s values into groups of approximately equal populations to build the histogram. Once we have such a histogram, we can easily specify how many values should match the filter. The CPU cost for the table is calculated as the cpu_tuple_cost (constant 0.01) multiplied by the number of tuples in the table multiplied by the selectivity. Notice that this assumes that we’ll need to access each tuple from the table even if the index is covering. The I/O cost for the index is calculated as the number of pages in the index multiplied by the selectivity multiplied by the random_page_cost which is set to 4.0 by default. Finally, the I/O cost for the table is calculated as the sum of the number of pages multiplied by the random_page_cost and the index correlation multiplied by the difference between the worst-case I/O cost and the best-case I/O cost. The correlation indicates how much the order of tuples in the index is aligned with the order of tuples in the table. It can be between -1.0 and 1.0. All these calculations show that index operations can be beneficial if we extract only a subset of rows. It’s no surprise that the index scans will be even more expensive than the sequential scans, but the index seeks will be much faster. Index Construction Indexes can be built on multiple columns. The order of columns matters, the same as the order of values in the column (whether it’s ascending or descending). The index requires a way to sort values. The B-tree index has an inherent capability of value organization within certain data types, a feature we can exploit quite readily for primary keys or GUIDs as the order is typically immutable and predictable in terms of such scenarios due to their "built-in" nature which aligns with increasing values. However, when dealing with complex custom datatypes — say an object containing multiple attributes of varying data types — it’s not always apparent how these might be ordered within a B-tree index given that they do not have an inherent ordering characteristic, unlike primitive (basic) or standardized composite data types such as integers and structured arrays. Consequently, for situations where we deal with complex datatypes — consider the two-dimensional points in space that lack any natural order due to their multi-attribute nature — one might find it necessary to incorporate additional components within our indexing strategy or employ alternative methods like creating multiple indexes that cater specifically towards different sorting scenarios. PostgreSQL provides a solution for this through operator families, wherein we can define the desired ordering characteristics of an attribute (or set of attributes) in complex datatypes such as two-dimensional points on Cartesian coordinates when considering order by X or Y values separately — essentially providing us with customizable flexibility to establish our unique data arrangement within a B-tree index. How to Check if My Index Is Used We can always check if the index was used with EXPLAIN ANALYZE. We use it like this: EXPLAIN ANALYZE SELECT … This query returns a textual representation of the plan with all the operations listed. If we see the index scan among them, then the index has been used. There are other tools. For instance, Metis shows which indexes were used: Why Isn’t My Index Used? Let’s now go through the reasons why an index may not be used. Even though we focus on B-tree indexes here, similar issues can apply to other index types (BRIN, GiST, GIN, etc.). The general principle is: that either the database can’t use the index, or the database thinks that the index will make the query slower. All the scenarios we explain below simply come down to these two principles and only manifest themselves differently. The Index Will Make the Query Slower Let’s start with the cases when the database thinks the index will make the query slower. Table Scan Is Cheaper The very first reason is that the table scan may be faster than the index. This can be the case for small tables. A trivial example includes the following. Let’s create a table: CREATE TABLE people (id INT, name VARCHAR, last_name VARCHAR); Let’s insert two rows: INSERT INTO people VALUES (1, 'John', 'Doe'), (2, 'Mary', 'Alice') Let’s now create an index: CREATE INDEX people_idx ON people(id); If we now try to query the table and get the execution plan, we get the following: EXPLAIN ANALYZE SELECT * FROM people WHERE id = 2 Seq Scan on people (cost=0.00..1.02 rows=1 width=36) (actual time=0.014..0.016 rows=1 loops=1) Filter: (id = 2) Rows Removed by Filter: 1 Planning Time: 0.068 ms Execution Time: 0.123 ms We can see the database decided to scan the table instead of using an index. That is because it was cheaper to scan the table. We can disable scans if possible with: set enable_seqscan=off Let’s now try the query: EXPLAIN ANALYZE SELECT * FROM people WHERE id = 2 Index Scan using people_idx on people (cost=0.13..8.14 rows=1 width=68) (actual time=0.064..0.066 rows=1 loops=1) Index Cond: (id = 2) Planning Time: 0.066 ms Execution Time: 0.170 ms We can see that the sequential scan cost was 1.02 whereas the index scan cost was 8.14. Therefore, the database was right to scan the table. The Query Is Not Selective Enough Another reason is that the query may not be selective enough. This is the same case as before, only it manifests itself differently. This time, we don’t deal with small tables, but we extract too many rows from the table. Let’s add more rows to the table: INSERT INTO people ( WITH RECURSIVE numbers(n) AS ( VALUES (3) UNION ALL SELECT n+1 FROM numbers WHERE n < 1000 ) SELECT n, 'Jack ' || n, 'Dean' FROM numbers ) Let’s now query like this: EXPLAIN ANALYZE SELECT * FROM people WHERE id <= 999 Seq Scan on people (cost=0.00..7.17 rows=5 width=68) (actual time=0.011..0.149 rows=999 loops=1) Filter: (id <= 999) Rows Removed by Filter: 1 Planning Time: 0.072 ms Execution Time: 0.347 ms We can see the sequential scan again. Let’s see the cost when we disable scans: Index Scan using people_idx on people (cost=0.28..47.76 rows=999 width=17) (actual time=0.012..0.170 rows=999 loops=1) Index Cond: (id <= 999) Planning Time: 0.220 ms Execution Time: 0.328 ms So we can see the scan is 7.17 versus 47.76 for the index scan. This is because the query extracts nearly all the rows from the table. It’s not very selective. However, if we try to pick just one row, we should get this: EXPLAIN ANALYZE SELECT * FROM people WHERE id = 2 Index Scan using people_idx on people (cost=0.28..8.29 rows=1 width=17) (actual time=0.018..0.020 rows=1 loops=1) Index Cond: (id = 2) Planning Time: 0.064 ms Execution Time: 0.126 ms Keep in mind that partial indexes may be less or more selective and affect the execution plans. LIMIT Clause Misleads the Database The LIMIT clause may mislead the database and make it think that the sequential scan will be faster. Let’s take this query: EXPLAIN ANALYZE SELECT * FROM people WHERE id <= 50 LIMIT 1 We want to find the rows with id less or equal to fifty. However, we take only one row. The plan looks like this: Limit (cost=0.00..0.39 rows=1 width=17) (actual time=0.016..0.017 rows=1 loops=1) -> Seq Scan on people (cost=0.00..19.50 rows=50 width=17) (actual time=0.015..0.015 rows=1 loops=1) Filter: (id <= 50) Planning Time: 0.073 ms Execution Time: 0.131 ms We can see the database decided to scan the table. However, let’s change the LIMIT to three rows: EXPLAIN ANALYZE SELECT * FROM people WHERE id <= 50 LIMIT 3 Limit (cost=0.28..0.81 rows=3 width=17) (actual time=0.018..0.019 rows=3 loops=1) -> Index Scan using people_idx on people (cost=0.28..9.15 rows=50 width=17) (actual time=0.017..0.018 rows=3 loops=1) Index Cond: (id <= 50) Planning Time: 0.073 ms Execution Time: 0.179 ms This time we get the index scan. This clearly shows that the LIMIT clause affects the execution plan. However, notice that both of the queries executed in 131 and 179 milliseconds respectively. If we disable the scans, we get the following: Limit (cost=0.28..0.81 rows=3 width=17) (actual time=0.019..0.021 rows=3 loops=1) -> Index Scan using people_idx on people (cost=0.28..9.15 rows=50 width=17) (actual time=0.018..0.019 rows=3 loops=1) Index Cond: (id <= 50) Planning Time: 0.074 ms Execution Time: 0.090 ms In this case, the table scan was indeed faster than the index scan. This obviously depends on the query we execute. Be careful with using LIMIT. Inaccurate Statistics Mislead the Database As we saw before, the database uses many heuristics to calculate the query cost. The query planner estimates are based on the number of rows that each operation will return. This is based on the table statistics that may be off if we change the table contents significantly or when columns are dependent on each other (with multivariate statistics). As explained in the PostgreSQL documentation, statistics like reltuples and relpages are not updated on the fly. They get updated periodically or when we run commands like VACUUM, ANALYZE, or CREATE INDEX. Always keep your statistics up to date. Run ANALYZE periodically to make sure that your numbers are not off, and always run ANALYZE after batch-loading multiple rows. You can also read more in PostgreSQL documentation about multivariate statistics and how they affect the planner. The Database Is Configured Incorrectly The planner uses various constants to estimate the query cost. These constants should reflect the hardware we use and the infrastructure backing our database. One of the prominent examples is random_page_cost. This constant is meant to represent the cost of accessing a data page randomly. It’s set to 4 by default whereas the seq_page_cost constant is set to 1. This makes a lot of sense for HDDs that are much slower for random access than the sequential scan. However, SSDs and NVMes do not suffer that much with random access. Therefore, we may want to change this constant to a much lower value (like 2 or even 1.1). This heavily depends on your hardware and infrastructure, so do not tweak these values blindly. There Are Better Indexes Another case for not using our index is when there is a better index in place. Let’s create the following index: CREATE INDEX people_idx2 ON people(id) INCLUDE (name); Let’s now run this query: EXPLAIN ANALYZE SELECT id, name FROM people WHERE id = 123 Index Only Scan using people_idx2 on people (cost=0.28..8.29 rows=1 width=12) (actual time=0.026..0.027 rows=1 loops=1) Index Cond: (id = 123) Heap Fetches: 1 Planning Time: 0.208 ms Execution Time: 0.131 ms We can see it uses people_idx2 instead of people_idx. The database could use people_idx as well but people_idx2 covers all the columns and so can be used as well. We can drop people_idx2 to see how it affects the query: Index Scan using people_idx on people (cost=0.28..8.29 rows=1 width=12) (actual time=0.010..0.011 rows=1 loops=1) Index Cond: (id = 123) Planning Time: 0.108 ms Execution Time: 0.123 ms We can see that using people_idx had the same estimated cost but was faster. Therefore, always tune your indexes to match your production queries. The Index Can’t Be Used Let’s now examine cases when the index can’t be used for some technical reasons. The Index Uses Different Sorting Each index must keep the specific order of rows to be able to run the binary search. If we query for rows in a different order, we may not be able to use the index (or we would need to sort all the rows afterward). Let’s see that. Let’s drop all the indexes and create this one: CREATE INDEX people_idx3 ON people(id, name, last_name) Notice that the index covers all the columns in the table and stores them in ascending order for every column. Let’s now run this query: EXPLAIN ANALYZE SELECT id, name FROM people ORDER BY id, name, last_name Index Only Scan using people_idx3 on people (cost=0.15..56.90 rows=850 width=68) (actual time=0.006..0.007 rows=0 loops=1) Heap Fetches: 0 Planning Time: 0.075 ms Execution Time: 0.117 ms We can see we used the index to scan the table. However, let’s now sort the name DESC: EXPLAIN ANALYZE SELECT id, name FROM people ORDER BY id, name DESC, last_name Sort (cost=66.83..69.33 rows=1000 width=17) (actual time=0.160..0.211 rows=1000 loops=1) Sort Key: id, name DESC, last_name Sort Method: quicksort Memory: 87kB -> Seq Scan on people (cost=0.00..17.00 rows=1000 width=17) (actual time=0.009..0.084 rows=1000 loops=1) Planning Time: 0.120 ms Execution Time: 0.427 ms We can see the index wasn’t used. That is because the index stores the rows in a different order than the query requested. Always configure indexes accordingly to your queries to avoid mismatches like this one. The Index Stores Different Columns Another example is when the order matches but we don’t query the columns accordingly. Let’s run this query: EXPLAIN ANALYZE SELECT id, name FROM people ORDER BY id, last_name Sort (cost=66.83..69.33 rows=1000 width=17) (actual time=0.157..0.198 rows=1000 loops=1) Sort Key: id, last_name Sort Method: quicksort Memory: 87kB -> Seq Scan on people (cost=0.00..17.00 rows=1000 width=17) (actual time=0.012..0.086 rows=1000 loops=1) Planning Time: 0.081 ms Execution Time: 0.388 ms Notice that the index wasn’t used. That is because the index stores all three columns but we query only two of them. Again, always tune your indexes to store the data you need. The Query Uses Functions Differently Let’s now run this query with a function: EXPLAIN ANALYZE SELECT id, name FROM people WHERE abs(id) = 123 Seq Scan on people (cost=0.00..22.00 rows=5 width=12) (actual time=0.019..0.087 rows=1 loops=1) Filter: (abs(id) = 123) Rows Removed by Filter: 999 Planning Time: 0.124 ms Execution Time: 0.181 ms We can see the index wasn’t used. That is because the index was created on a raw value of the id column, but the query tries to get abs(id). The engine could do some more extensive analysis to understand that the function doesn’t change anything in this case, but it decided not to. To make the query faster, we can either not use the function in the query (recommended) or create an index for this query specifically: CREATE INDEX people_idx4 ON people(abs(id)) And then we get: Bitmap Heap Scan on people (cost=4.31..11.32 rows=5 width=12) (actual time=0.022..0.024 rows=1 loops=1) Recheck Cond: (abs(id) = 123) Heap Blocks: exact=1 -> Bitmap Index Scan on people_idx4 (cost=0.00..4.31 rows=5 width=0) (actual time=0.016..0.017 rows=1 loops=1) Index Cond: (abs(id) = 123) Planning Time: 0.209 ms Execution Time: 0.159 ms The Query Uses Different Data Types Yet another example is when we store values with a different data type in the index. Let’s run this query: EXPLAIN ANALYZE SELECT id, name FROM people WHERE id = 123::numeric Seq Scan on people (cost=0.00..22.00 rows=5 width=12) (actual time=0.030..0.156 rows=1 loops=1) Filter: ((id)::numeric = '123'::numeric) Rows Removed by Filter: 999 Planning Time: 0.093 ms Execution Time: 0.278 ms Even though 123 and 123::numeric represent the same value, we can’t use the index because it stores integers instead of numeric types. To fix the issue, we can create a new index targeting this query or change the casting to match the data type: EXPLAIN ANALYZE SELECT id, name FROM people WHERE id = 123::int Bitmap Heap Scan on people (cost=4.18..12.64 rows=4 width=36) (actual time=0.005..0.006 rows=0 loops=1) Recheck Cond: (id = 123) -> Bitmap Index Scan on people_idx3 (cost=0.00..4.18 rows=4 width=0) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: (id = 123) Planning Time: 0.078 ms Execution Time: 0.088 ms Operators Are Not Supported Yet another example of when an index can’t be used is when we query data with an unsupported operator. Let’s create such an index: CREATE INDEX people_idx5 ON people(name) Let’s now query the table with the following; EXPLAIN ANALYZE SELECT name FROM people WHERE name = 'Jack 123' Index Only Scan using people_idx5 on people (cost=0.28..8.29 rows=1 width=8) (actual time=0.025..0.026 rows=1 loops=1) Index Cond: (name = 'Jack 123'::text) Heap Fetches: 1 Planning Time: 0.078 ms Execution Time: 0.139 ms We can see the index worked. However, let’s now change the operator to ILIKE: EXPLAIN ANALYZE SELECT name FROM people WHERE name ILIKE 'Jack 123' Seq Scan on people (cost=0.00..19.50 rows=1 width=8) (actual time=0.075..0.610 rows=1 loops=1) Filter: ((name)::text ~~* 'Jack 123'::text) Rows Removed by Filter: 999 Planning Time: 0.130 ms Execution Time: 0.725 ms We can see the database decided not to use an index. This is because we use the ILIKE operator which is not supported with a regular B-tree index. Therefore, always use the appropriate operators to use indexes efficiently. Keep in mind that operators can be prone to various settings. For instance different collation may cause an index to be ignored. Testing Indexes With HypoPG To test various indexes, we don’t need to create them. We can use the HypoPG extension to analyze the index without creating it in the database. Let’s see that in action. Drop all the indexes and run the following query: EXPLAIN ANALYZE SELECT * FROM people WHERE id = 2 Seq Scan on people (cost=0.00..19.50 rows=1 width=17) (actual time=0.011..0.079 rows=1 loops=1) Filter: (id = 2) Rows Removed by Filter: 999 Planning Time: 0.103 ms Execution Time: 0.163 ms We can see that no index was used as there were no indexes at all. We can now see what would happen if we had an index. Let’s first pretend as we created it: SELECT * FROM hypopg_create_index('CREATE INDEX ON people (id)'); 13563 <13563>btree_people_id And let’s now ask the database if it would be used (notice there is no ANALYZE): EXPLAIN SELECT * FROM people WHERE id = 2 Index Scan using "<13563>btree_people_id" on people (cost=0.00..8.01 rows=1 width=17) This way we can test various indexes. Summary Indexes may be tricky and there are many reasons why they are not used. However, ultimately it all goes down to either the database not being able to use the index or thinking that it would make things slower. Fortunately, we can easily verify if indexes are used with EXPLAIN ANALYZE, and we can also test new indexes with HypoPG.
Thanks to Cheney Zhang (Zilliz) Retrieval-Augmented Generation (RAG) techniques, by integrating external knowledge bases, provide additional contextual information for LLMs, effectively alleviating issues such as hallucination and insufficient domain knowledge of LLM. However, relying solely on general knowledge bases has its limitations, especially when dealing with complex entity relationships and multi-hop questions, where the model often struggles to provide accurate answers. Introducing Knowledge Graphs (KGs) into the RAG system provides a new solution to this problem. KGs present entities and their relationships in a structured manner, offering more refined contextual information during retrieval. By leveraging the abundant relational data of KGs, RAG can not only pinpoint relevant knowledge more accurately but also better handle complex question-answering scenarios, such as comparing entity relationships or answering multi-hop questions. However, the current KG-RAG is still in its early exploration stage, and the industry has not yet reached a consensus on the relevant technical path; for instance, how to effectively retrieve relevant entities and relationships in the knowledge graph, how to combine vector similarity search with graph structure, there is currently no unified paradigm. For example, Microsoft's From Local to Global aggregates subgraph structures into community summaries through a large number of LLM requests, but this process consumes a substantial number of LLM tokens, making this approach expensive and impractical. HippoRAG uses Personalized PageRank to update the weights of graph nodes and identify important entities, but this entity-centered method is easily affected by Named Entity and Relation (NER) omissions during extraction, overlooking other information in the context. IRCoT uses multi-step LLM requests to gradually infer the final answer, but this method introduces LLM into the multi-hop search process, resulting in an extended time to answer questions, making it difficult to implement in practice. We found that a simple RAG paradigm with multi-way retrieval and then reranking can handle complex multi-hop KG-RAG scenarios very well, without requiring excessive LLM overhead or any graph structure storage or algorithm. Despite using a very simple architecture, our method significantly outperforms current state-of-the-art solutions, such as HippoRAG, and only requires vector storage and a small amount of LLM overhead. We first introduce the theoretical basis of our method, and then describe the specific process. Our simple pipeline is not much different from the common multi-way retrieval and rerank architecture, but it can achieve the SoTA performance in the multihop graph RAG scenario. Limited Hop Count Theory In real-life KG-RAG scenarios, we noticed a concept known as limited hop count. In KG-based RAG, the actual query question only requires a limited and relatively small number of hops (usually less than four) within the knowledge graph, rather than a greater amount. Our limited hop count theory is based on two critical observations: Limited complexity of queries Local dense structure of "shortcuts" 1. Limited Complexity of Queries A user's query is unlikely to involve numerous entities or introduce complex relationships. If it does, the question would seem peculiar and unrealistic. Normal query: "In which year did Einstein win the Nobel Prize?" Query path in the knowledge graph: Find the "Einstein" node. Jump to the "Nobel Prize" node associated with "Einstein". Return the year the prize was awarded. Hop count: 2 hops Explanation: This is a standard user query, where the user wants to know a single fact directly associated with a specific entity. In this case, the knowledge graph only needs a few hops to complete the task, as all relevant information is directly linked to the central node, Einstein. This type of query is very common in practice, such as querying celebrity background information, award history, event time, etc. Weird query: "What is the relationship between the year the discoverer of the theory of relativity received the Nobel Prize and the number of patents they invented in a country famous for its bank secrecy laws and the magnificent scenery of the Alps?" Query path in the knowledge graph: Find that the "inventor" of "relativity" is "Einstein". Jump to the "Nobel Prize" node associated with "Einstein". Look up the year the "Nobel Prize" was awarded. Identify "Switzerland" through "bank secrecy laws and the Alps". Jump to the "patent" node associated with "Einstein". Look up patent information related to the period in Switzerland. Compare the relationship between the number of patents and the year of the award. Hop count: 7 hops Explanation: This question is complex, requiring not just a single fact query, but also intricate associations between multiple nodes. This type of question is not common in actual scenarios because users generally do not seek such complex cross-information in a single query. Usually, these types of questions are divided into multiple simple queries to gradually obtain information. You may think something about the number of hops sounds familiar, it's because all commonly used information is usually linkable in only a limited number of steps. You can see this in practice in the Six Degrees of Kevin Bacon. 2. Local Dense Structure of “Shortcuts” There are some local dense structures in the knowledge graph, and for some queries, there are "shortcuts" that can quickly connect to entities several hops away from one entity. Suppose we have a family relationship knowledge graph that contains the following entities and relationships: Alex is the child of Brian (Alex - child_of - Brian) Cole is married to Brian (Cole - married_to - Brian) Daniel is the brother of Cole (Daniel - brother_of - Cole) Daniel is the uncle of Alex (Daniel - uncle_of - Alex) This is a dense knowledge graph with redundant information. The last relationship can obviously be derived from the first three relationships. However, there are often some redundant information shortcuts in the knowledge graph. These shortcuts can reduce the number of hops between some entities. Based on these two observations, we find that the routing lookup process within the knowledge graph for a limited number of times only involves local knowledge graph information. Therefore, the process of retrieving information within the knowledge graph for a query can be implemented in the following two steps: The starting point of the route can be found through vector similarity lookup. It can involve the similarity relationship lookup between the query and entities or the query and relationships. The routing process to find other information from the starting point can be replaced with an LLM. Put this alternative information into the prompt, and rely on the powerful self-attention mechanism of LLM to select valuable routes. As the length of the prompt is limited, only local knowledge graph information can be put in, such as the knowledge graph information within a limited number of hops around the starting point, which is guaranteed by the limited hop count theory. The whole process does not need any other KG storage and complex KG query statements; it only needs to use a Milvus vector database and one access of an LLM. The vector retrieval with LLM reranking is the most critical part of this pipeline, explaining why we can reach performance far beyond the methods based on graph theory (such as HippoRAG) with a traditional two-way retrieval architecture. This also shows that we do not actually need physical storage of graph structure and complex graph query SQL statements. We only need to store the logical relationship of graph structure in the vector database, a traditional architecture can perform logical sub-graph routing, and the powerful ability of modern LLM helps to achieve this. Method Overview Our approach solely focuses on the passage retrieval phase within the RAG process, without any novel enhancements or optimizations in chunking or LLM response generation. We assume that we have acquired a set of triplet data from the corpus, incorporating a variety of entity and relationship information. This data can symbolize the information of a knowledge graph. We vectorize the entity and relationship information individually and store them in vector storage, thus creating a logical knowledge graph. When receiving a query, the relevant entities and relationships are retrieved initially. Leveraging these entities and relationships, we perform a limited expansion on the graph structure. These relationships are integrated into the prompt along with the query question, and the LLM's capability is exploited to rerank these relationships. Ultimately, we obtain the top-K vital relationships and get the related passages within their metadata information, serving as the final retrieved passages. Detailed Method Vector Storage We establish two vector storage collections: one being the entity collection, the other the relationship collection. Unique entities and relationship information are embedded into vectors via the embedding model and stored in vector storage. Entity information is directly converted into embeddings based on their word descriptions. As for the original data form of relationships, it is structured as a triplet: (Subject, Predicate, Object). We directly combine them into a sentence, which is a heuristic method: "Subject Predicate Object". For instance: (Alex, child of, Brian) -> "Alex child of Brian"(Cole, married to, Brian) -> "Cole married to Brian" This sentence is then directly transformed into an embedding and stored in the vector database. This approach is straightforward and efficient. Although minor grammatical issues may arise, they do not impact the conveyance of the sentence meaning and its distribution in the vector space. Of course, we also advocate for the use of LLM to generate succinct sentence descriptions during the initial extraction of triplets. Vector Similarity Search For the input query, we adhere to the common paradigms in GraphRAG (such as HippoRAG and Microsoft GraphRAG), extract entities from the query, transform each query entity into an embedding, and conduct a vector similarity search on each entity collection. Subsequently, we merge the results obtained from all query entities' searches. For the vector search of relationships, we directly transform the query string into an embedding and perform a vector similarity search on the relationship collection. Expanding Subgraph We take the discovered entities and relationships as starting points in the knowledge graph and expand a certain degree outward. For the initial entities, we expand a certain number of hops outward and include their adjacent relationships, denoted as $$Set(rel1)$$. For the initial relationships, we expand a certain number of hops to obtain $$Set(rel2)$$. We then unite these two sets, $$Set(merged)=Set(rel1) \cup Set(rel2) $$. Given the limited hop count theory, we only need to expand a smaller number of degrees (like 1, 2, etc.) to encompass most of the relationships that could potentially assist in answering. Please note: the concept of the expansion degree in this step differs from the concept of the total hops required to answer a question. For instance, if answering a query involves two entities that are n hops apart, typically only an expansion of ⌈n / 2⌉ degree is necessary, as these two entities are the two starting endpoints recalled by the vector similarity. As illustrated in the figure below, the vector retrieval stage returns two red entities, and starting from them, expanding 2 degrees in opposite directions can cover a 4-hop distance, which is sufficient to answer a 4-hop question involving these two entities. Large Language Model (LLM) Reranker In this stage, we deploy the powerful self-attention mechanism of LLM to further filter and refine the candidate set of relationships. We employ a one-shot prompt, incorporating the query and the candidate set of relationships into the prompt, and instruct LLM to select potential relationships that could assist in answering the query. Given that some queries may be complex, we adopt the Chain-of-Thought approach, allowing LLM to articulate its thought process in its response. We have noted that this strategy provides some assistance to weaker models. We stipulate that LLM's response is in JSON format for convenient parsing. The specific prompt is as follows: One shot input prompt: Plain Text I will provide you with a list of relationship descriptions. Your task is to select 3 relationships that may be useful to answer the given question. Please return a JSON object containing your thought process and a list of the selected relationships in order of their relevance. **Question:** When was the mother of the leader of the Third Crusade born? **Relationship descriptions:** [1] Eleanor was born in 1122. [2] Eleanor married King Louis VII of France. [3] Eleanor was the Duchess of Aquitaine. [4] Eleanor participated in the Second Crusade. [5] Eleanor had eight children. [6] Eleanor was married to Henry II of England. [7] Eleanor was the mother of Richard the Lionheart. [8] Richard the Lionheart was the King of England. [9] Henry II was the father of Richard the Lionheart. [10] Henry II was the King of England. [11] Richard the Lionheart led the Third Crusade. One shot output prompt: JSON { "thought_process": "To answer the question about the birth of the mother of the leader of the Third Crusade, I first need to identify who led the Third Crusade and then determine who his mother was. After identifying his mother, I can look for the relationship that mentions her birth.", "useful_relationships": [ "[11] Richard the Lionheart led the Third Crusade", "[7] Eleanor was the mother of Richard the Lionheart", "[1] Eleanor was born in 1122" ] } This prompt serves as an illustrative reference. In reality, transforming the triplets in relationships into a coherent sentence can be a challenging task. However, you can certainly employ the heuristic method mentioned above to directly assemble the triplets. For instance: (Eleanor, born in, 1122) can be directly transformed into Eleanor was born in 1122. While this method may occasionally lead to certain grammatical issues, it is the quickest and most straightforward approach, and it will not mislead LLM. Retrieving the Final Passages For the aforementioned example, it is feasible to directly return the final response during the LLM Rerank phase; for instance, by adding a field such as "final answer" in the JSON field of the one-shot output prompt. However, the information in this prompt is exclusive to the relationship, and not all queries can yield a final answer at this juncture; hence, other specific details should be obtained from the original passage. LLM returns precisely sorted relationships. All we need to do is extract the corresponding relationship data previously stored, and retrieve the relevant metadata from it, where corresponding passage ids reside. This passage data represents the final passages that have been retrieved. The subsequent process of generating responses is identical to naive RAG, which involves incorporating them into the context of the prompt and using LLM to generate the final answer. Results We employ the dense embedding that aligns with HippoRAG, facebook/contriever, as our embedding model. The results show that our approach significantly surpasses both naive RAG and HippoRAG on three multi-hop datasets. All methods apply the same embedding model setting. We use Recall@2 as our evaluation metric, defined as Recall = Total number of documents retrieved that are relevant/Total number of relevant documents in the database. On the multi-hop datasets, our method outperforms naive RAG and HippoRAG in all datasets. All of them are compared using the same facebook/contriever embedding model. These results suggest that even the simplest multi-way retrieval and reranking RAG paradigm, when utilized in the graph RAG context, can deliver state-of-the-art performance. It further implies that appropriate vector retrieval and LLM adoption are crucial in the multi-hop QA scenario. Reflecting on our approach, the process of transforming entities and relationships into vectors and then retrieving is like discovering the starting point of a subgraph, akin to uncovering "clues" at a crime scene. The subsequent subgraph expansion and LLM reranking resemble the process of analyzing these "clues". The LLM has a "bird's-eye view" and can intelligently select beneficial and crucial relationships from a multitude of candidate relationships. These two stages fundamentally correspond to the naive vector retrieval + LLM reranking paradigm. In practice, we recommend using open source Milvus, or its fully managed version Zilliz Cloud, to store and search for a large volume of entities and relationships in graph structures. For LLM, you can opt for open source models like Llama-3.1-70B or the proprietary GPT-4o mini, as mid-to-large scale models are well-equipped to handle these tasks. For The Full Code Graph RAG with Milvus
In today’s world of software development, systems containing some sort of event constructs are increasing in popularity. While this is primarily driven by message-based communication mechanisms, events are also used in different scopes and contexts. The frequent use of the term “event” leads to confusion, which is often observed in discussions about various software architectures among people who are new to these concepts. The terms “event-driven” and “event-sourced” are often used interchangeably, while in reality, the two are very different concepts. In this article, we are going to explore the key characteristics of each, explain how they differ, and how they complement each other. We will focus on clarifying the key differences, not a deep dive into each concept. Before we dive in, let’s clarify the definition of an “event” in both event-driven and event-sourced systems. An event is an immutable record describing something that has happened in the past. Therefore, the data that an event contains cannot be changed. Immutability and description of the past are fundamental characteristics of events. Event-Driven Event-driven architecture (EDA) is an architectural style that primarily uses events as a communication mechanism between various components of a larger system, or between systems. Events are handled asynchronously in a fire-and-forget manner, i.e. the publisher does not wait for a response from consumers. Other parts of the system that subscribe to the event handle it and can trigger larger workflows composed of further events. For example, a microservice-based system may use events to communicate the occurrence of a fact in one microservice. Those events can be handled by consumers in other microservices, which in turn can emit their events to communicate their changes. Events are published in a publish-subscribe manner, allowing multiple consumers to handle the same event simultaneously. Each consumer receives its own copy of the event. The asynchronous nature of EDA can often lead to eventual consistency. This means that the state of the system might not be instantaneously reflected across components. EDA systems typically, but not always, leverage a message broker to transmit events from the publisher to consumers. Components often use two types of events: domain and integration. Domain events are internal to the components that emit them, while integration events are used for communication between components. Typically used by most if not all components in a particular system, with each publishing and subscribing to a subset of events. Event-Sourced Unlike EDA, which is an architectural style leveraging events for communication, Event Sourcing (ES) is a persistence mechanism. Events are used to reflect changes to the state of an entity (or other constructs such as an aggregate in Domain-Driven Design) and to reconstitute that state. Upon completing a domain operation, a sequence of events is stored as a stream linked to a particular entity. For example, a set of actions performed on a shopping basket may result in the following events: Shopping started Item added Item added Item removed Item added Upon reconstituting the basket’s state, these events are loaded in chronological order and replayed to update the state one by one. The resulting basket will contain two shopping items. Event sourcing can be described by the below key characteristics: After completing an operation, the entity contains new event(s) reflecting the nature of the change(s) which are then saved to an event store (database). Events are strictly linked to a specific entity. Typically, events participate in a limited-in-scope, strongly-consistent manner. A sequence of new events will not persist if another process appends events to the stream simultaneously. New events are persisted transactionally- either all or none get saved. Events are loaded from the stream to reconstitute the entity’s state. Limited in scope to a single component. A larger system can contain heterogeneous persistence mechanisms, where only a subset uses ES. Different Yet Complementary When comparing the characteristics of EDA and ES, it becomes clear that the two are very different. Indeed, they address fundamentally different aspects of software design. While EDA is an architectural style for communication, ES is a persistence mechanism for entities’ state. However, these differences do not mean that the two approaches are mutually exclusive. On the contrary, EDA and ES often go hand in hand complementing each other. In such a setup, a component using ES processes business logic and stores its entity’s state in an event-sourced manner. At some point, an event will be published indicating the completion of an action that other components participating in EDA might be interested in. It may only be that one particular event or an aggregation of data from multiple events which needs to be handled elsewhere. In either case, a publisher will map this event(s) to an integration event and publish it, thus allowing its consumption in other components. The distinct nature of EDA and ES is also their strength — each approach focuses on a different area, but together they contribute to a broader system, enabling an audit trail and fine detail of changes in one component, and leveraging EDA to communicate selected changes to other components.
In previous articles, we built a basic RAG application. We also learned to introduce more advanced techniques to improve a RAG application. Today, we will explore how to tie those advanced techniques together. Those techniques might do different — sometimes opposite — things. Still, sometimes we need to use all of them, to cover all possibilities. So let's see how we can link different techniques together. In this article, we will take a look at a technique called Query Routing. The Problem With Advanced RAG Applications When our Generative AI application receives a query, we have to decide what to do with it. For simple Generative AI applications, we send the query directly to the LLM. For simple RAG applications, we use the query to retrieve context from a single data source and then query the LLM. But, if our case is more complex, we can have multiple data sources or different queries that need different types of context. So do we build a one-size-fits-all solution, or do we make the application adapt to take different actions depending on the query? What Is Query Routing? Query routing is about giving our RAG app the power of decision-making. It is a technique that takes the query from the user and uses it to make a decision on the next action to take, from a list of predefined choices. Query routing is a module in our Advanced RAG architecture. It is usually found after any query rewriting or guardrails. It analyzes the input query and it decides the best tool to use from a list of predefined actions. The actions are usually retrieving context from one or many data sources. It could also decide to use a different index for a data source (like parent-child retrieval). Or it could even decide to search for context on the Internet. Which Are the Choices for the Query Router? We have to define the choices that the query router can take beforehand. We must first implement each of the different strategies, and accompany each one with a nice description. It is very important that the description explains in detail what each strategy does since this description will be what our router will base its decision on. The choices a query router takes can be the following: Retrieval From Different Data Sources We can catalog multiple data sources that contain information on different topics. We might have a data source that contains information about a product that the user has questions about. And another data source with information about our return policies, etc. Instead of looking for the answers to the user’s questions in all data sources, the query router can decide which data source to use based on the user query and the data source description. Data sources can be text stored in vector databases, regular databases, graph databases, etc. Retrieval From Different Indexes Query routers can also choose to use a different index for the same data source. For example, we could have an index for keyword-based search and another for semantic search using vector embeddings. The query router can decide which of the two is best for getting the relevant context for answering the question, or maybe use both of them at the same time and combine the contexts from both. We could also have different indexes for different retrieval strategies. For example, we could have a retrieval strategy based on summaries, a sentence window retrieval strategy, or a parent-child retrieval strategy. The query router can analyze the specificity of the question and decide which strategy is best to use to get the best context. Other Data Sources The decision that the query router takes is not limited to databases and indexes. It can also decide to use a tool to look for the information elsewhere. For example, it can decide to use a tool to look for the answer online using a search engine. It can also use an API from a specific service (for example, weather forecasting) to get the data it needs to get the relevant context. Types of Query Routers An important part of our query router is how it makes the decision to choose one or another path. The decision can vary depending on each of the different types of query routers. The following are a few of the most used query router types: LLM Selector Router This solution gives a prompt to an LLM. The LLM completes the prompt with the solution, which is the selection of the right choice. The prompt includes all the different choices, each with its description, as well as the input query to base its decision on. The response to this query will be used to programmatically decide which path to take. LLM Function Calling Router This solution leverages the function-calling capabilities (or tool-using capabilities) of LLMs. Some LLMs have been trained to be able to decide to use some tools to get to an answer if they are provided for them in the prompt. Using this capability, each of the different choices is phrased like a tool in the prompt, prompting the LLM to choose which one of the tools provided is best to solve the problem of retrieving the right context for answering the query. Semantic Router This solution uses a similarity search on the vector embedding representation of the user query. For each choice, we will have to write a few examples of a query that would be routed to this path. When a user query arrives, an embeddings model converts it to a vector representation and it is compared to the example queries for each router choice. The example with the nearest vector representation to the user query is chosen as the path the router must route to. Zero-Shot Classification Router For this type of router, a small LLM is selected to act as a router. This LLM will be finetuned using a dataset of examples of user queries and the correct routing for each of them. The finetuned LLM’s sole purpose will be to classify user queries. Small LLMs are more cost-effective and more than good enough for a simple classification task. Language Classification Router In some cases, the purpose of the query router will be to redirect the query to a specific database or model depending on the language the user wrote the query in. Language can be detected in many ways, like using an ML classification model or a Generative AI LLM with a specific prompt. Keyword Router Sometimes the use case is extremely simple. In this case, the solution could be to route one way or another depending on if some keywords are present in the user query. For example, if the query contains the word “return” we could use a data source with information useful about how to return a product. For this solution, a simple code implementation is enough, and therefore, no expensive model is needed. Single Choice Routing vs Multiple Choice Routing Depending on the use case, it will make sense for the router to just choose one path and run it. However, in some cases, it also can make sense to use more than one choice for answering the same query. To answer a question that spans many topics, the application needs to retrieve information from many data sources. Or the response might be different based on each data source. Then, we can use all of them to answer the question and consolidate them into a single final answer. We have to design the router taking these possibilities into account. Example Implementation of a Query Router Let’s get into the implementation of a query router within a RAG application. You can follow the implementation step by step and run it yourself in the Google Colab notebook. For this example, we will showcase a RAG application with a query router. The application can decide to answer questions based on two documents. The first document is a paper about RAG and the second is a recipe for chicken gyros. Also, the application can decide to answer based on a Google search. We will implement a single-source query router using an LLM function calling router. Load the Paper First, we will prepare the two documents for retrieval. Let's first load the paper about RAG: Load the Recipe We will also load the recipe for chicken gyros. This recipe from Mike Price is hosted in tasty.co. We will use a simple web page reader to read the page and store it as text. Save the Documents in a Vector Store After getting the two documents we will use for our RAG application, we will split them into chunks and we will convert them to embeddings using BGE small, an open-source embeddings model. We will store those embeddings in two vector stores, ready to be questioned. Search Engine Tool Besides the two documents, the third option for our router will be to search for information using Google Search. For this example, I have created my own Google Search API keys. If you want this part to work, you should use your own API keys. Create the Query Router Next, using the LlamaIndex library, we create a Query Engine Tool for each of the three options that the router will choose between. We provide a description for each of the tools, explaining what it is useful for. This description is very important since it will be the basis on which the query router decides which path it chooses. Finally, we create a Router Query Engine, also with Llama. We give the three query engine tools to this router. Also, we define the selector. This is the component that will make the choice of which tool to use. For this example, we are using an LLM Selector. It's also a single selector, meaning it will only choose one tool, never more than one, to answer the query. Run Our RAG Application! Our query router is now ready. Let's test it with a question about RAG. We provided a vector store loaded with information from a paper on RAG techniques. The query router should choose to retrieve context from that vector store in order to answer the question. Let's see what happens: Our RAG application answers correctly. Along with the answer, we can see that it provides the sources from where it got the information from. As we expected, it used the vector store with the RAG paper. We can also see an attribute "selector_result" in the result. In this attribute, we can inspect which one of the tools the query router chose, as well as the reason that the LLM gave to choose that option. Now let's ask a culinary question. The recipe used to create the second vector store is for chicken gyros. Our application should be able to answer which are the ingredients needed for that recipe based on that source. As we can see, the chicken gyros recipe vector store was correctly chosen to answer that question. Finally, let's ask it a question that can be answered with a Google Search. Conclusion In conclusion, query routing is a great step towards a more advanced RAG application. It allows us to set up a base for a more complex system, where our app can better plan how to best answer questions. Also, query routing can be the glue that ties together other advanced techniques for your RAG application and makes them work together as a whole system. However, the complexity of better RAG systems doesn't end with query routing. Query routing is just the first stepping stone for orchestration within RAG applications. The next stepping stone for making our RAG applications better reason, decide, and take actions based on the needs of the users are Agents. In later articles, we will be diving deeper into how Agents work within RAG and Generative AI applications in general.
Organizations are fully adopting Artificial Intelligence (AI) and proving that AI is valuable. Enterprises are looking for valuable AI use cases that abound in their industry and functional areas to reap more benefits. Organizations are responding to opportunities and threats, gain improvements in sales, and lower costs. Organizations are recognizing the special requirements of AI workloads and enabling them with purpose-built infrastructure that supports the consolidated demands of multiple teams across the organization. Organizations adopting a shift-left paradigm by planning for good governance early in the AI process will minimize AI efforts for data movement to accelerate model development. In an era of rapidly evolving AI, data scientists should be flexible in choosing platforms that provide flexibility, collaboration, and governance to maximize adoption and productivity. Let's dive into the workflow automation and pipeline orchestration world. Recently, two prominent terms have appeared in the artificial intelligence and machine learning world: MLOps and LLMOps. What Is MLOps? MLOps (Machine Learning Operations) is a set of practices and technology to standardize and streamline the process of construction and deployment of machine learning systems. It covers the entire lifecycle of a machine learning application from data collection to model management. MLOps provides a provision for huge workloads to accelerate time-to-value. MLOps principles are architected based on the DevOps principles to manage applications built-in ML (Machine Learning). The ML model is created by applying an algorithm to a mass of training data, which will affect the behavior of the model in different environments. Machine learning is not just code, its workflows include the three key assets Code, Model, and Data. Figure 1: ML solution is comprised of Data, Code, and Model These assets in the development environment will have the least restrictive access controls and less quality guarantee, while those in production will be the highest quality and tightly controlled. The data is coming from the real world in production where you cannot control its change, and this raises several challenges that need to be resolved. For example: Slow, shattered, and inconsistent deployment Lack of reproducibility Performance reduction (training-serving skew) To resolve these types of issues, there are combined practices from DevOps, data engineering, and practices unique to machine learning. Figure 2: MLOps is the intersection of Machine Learning, DevOps, and Data Engineering - LLMOps rooted in MLOps Hence, MLOps is a set of practices that combines machine learning, DevOps, and data engineering, which aims to deploy and maintain ML systems in production reliably and efficiently. What Is LLMOps? The recent rise of Generative AI with its most common form of large language models (LLMs) prompted us to consider how MLOps processes should be adapted to this new class of AI-powered applications. LLMOps (Large Language Models Operations) is a specialized subset of MLOps (Machine Learning Operations) tailored for the efficient development and deployment of large language models. LLMOps ensures that model quality remains high and that data quality is maintained throughout data science projects by providing infrastructure and tools. Use a consolidated MLOps and LLMOps platform to enable close interaction between data science and IT DevOps to increase productivity and deploy a greater number of models into production faster. MLOps and LLMOps will both bring Agility to AI Innovation to the project. LLMOps tools include MLOps tools and platforms, LLMs that offer LLMOps capabilities, and other tools that can help with fine-tuning, testing, and monitoring. Explore more on LLMOps tools. Differentiate Tasks Between MLOps and LLMOps MLOps and LLMOps have two different processes and techniques in their primary tasks. Table 1 shows a few key tasks and a comparison between the two methodologies: Task MLOps LLMOps Primary focus Developing and deploying machine-learning models Specifically focused on LLMs Model adaptation If employed, it typically focuses on transfer learning and retraining. Centers on fine-tuning pre-trained models like GPT with efficient methods and enhancing model performance through prompt engineering and retrieval augmented generation (RAG) Model evaluation Evaluation relies on well-defined performance metrics. Evaluating text quality and response accuracy often requires human feedback due to the complexity of language understanding (e.g., using techniques like RLHF) Model management Teams typically manage their models, including versioning and metadata. Models are often externally hosted and accessed via APIs. Deployment Deploy models through pipelines, typically involving feature stores and containerization. Models are part of chains and agents, supported by specialized tools like vector databases. Monitoring Monitor model performance for data drift and model degradation, often using automated monitoring tools. Expands traditional monitoring to include prompt-response efficacy, context relevance, hallucination detection, and security against prompt injection threats Table 1: Key tasks of MLOPs and LLMOps methodologies Adapting any implications into MLOps required minimal changes to existing tools and processes. Moreover, many aspects do not change: The separation of development, staging, and production remains the same. The version control tool and the model registry in the catalog remain the primary channels for promoting pipelines and models toward production. The data architecture for managing data remains valid and essential for efficiency. Existing CI/CD infrastructure should not require changes. The modular structure of MLOps remains the same, with pipelines for model training, model inference, etc., A summary of key properties of LLMs and the implications for MLOps are listed in Table 2. KEY PROPERTIES OF LLMS IMPLICATIONS FOR MLOPS LLMs are available in many forms: Proprietary models behind paid APIs Pre-training models fine-tuned models Projects often develop incrementally, starting from existing, third-party, or open-source models and ending with custom fine-tuned models. This has an impact on the development process. Prompt Engineering: Many LLMs take queries and instructions as input in the form of natural language. Those queries can contain carefully engineered “prompts” to elicit the desired responses. Designing text templates for querying LLMs is often an important part of developing new LLM pipelines. Many LLM pipelines will use existing LLMs or LLM serving endpoints; the ML logic developed for those pipelines may focus on prompt templates, agents, or “chains” instead of the model itself. The ML artifacts packaged and promoted to production may frequently be these pipelines, rather than models. Context-based prompt engineering: Many LLMs can be given prompts with examples and context, or additional information to help answer the query. When augmenting LLM queries with context, it is valuable to use previously uncommon tooling such as vector databases to search for relevant context. Model Size: LLMs are very large deep-learning models, often ranging from gigabytes to hundreds of gigabytes. Many LLMs may require GPUs for real-time model serving. Since larger models require more computation and are thus more expensive to serve, techniques for reducing model size and computation may be required. Model evaluation: LLMs are hard to evaluate via traditional ML metrics since there is often no single “right” answer. Since human feedback is essential for evaluating and testing LLMs, it must be incorporated more directly into the MLOps process, both for testing and monitoring and for future fine-tuning. Table 2: Key properties of LLMs and implications for MLOps Semantics of Development, Staging, and Production An ML solution comprises data, code, and models. These assets are developed, tested, and moved to production through deployments. For each of these stages, we also need to operate within an execution environment. Each of the data, code, models, and execution environments is ideally divided into development, staging, and production. Data: Some organizations label data as either development, staging, or production, depending on which environment it originated in. Code: Machine learning project code is often stored in a version control repository, with most organizations using branches corresponding to the lifecycle phases of development, staging, or production. Model: The model and code lifecycle phases often operate asynchronously and model lifecycles do not correspond one-to-one with code lifecycles. Hence it makes sense for model management to have its model registry to manage model artifacts directly. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Semantics: Semantics indicates that when it comes to MLOps, there should always be an operational separation between development, staging, and production environments. More importantly, observe that data, code, and model, which we call Assets, in development will have the least restrictive access controls and quality guarantee, while those in production will be the highest quality and tightly controlled. Deployment Patterns Two major patterns can be used to manage model deployment. The training code (Figure 3, deploy pattern code) which can produce the model is promoted toward the production environment after the code is developed in the dev and tested in staging environments using a subset of data. Figure 3: Deploy pattern code The packaged model (Figure 4, deploy pattern model) is promoted through different environments, and finally to production. Model training is executed in the dev environment. The produced model artifact is then moved to the staging environment for model validation checks, before deployment of the model to the production environment. This approach requires two separate paths, one for deploying ancillary code such as inference and monitoring code and the other “deploy code” path where the code for these components is tested in staging and then deployed to production. This pattern is typically used when deploying a one-off model, or when model training is expensive and read-access to production data from the development environment is possible. Figure 4: Deploy pattern model The choice of process will also depend on the business use case, maturity of the machine learning infrastructure, compliance and security guidelines, resources available, and what is most likely to succeed for that particular use case. Therefore, it is a good idea to use standardized project templates and strict workflows. Your decisions around packaging ML logic as version-controlled code vs. registered models will help inform your decision about choosing between the deploy models, deploy code, and hybrid architectures. With LLMs, it is common to package machine-learning logic in new forms. These may include: MLflow can be used to package LLMs and LLM pipelines for deployment. Built-in model flavors include: PyTorch and TensorFlow Hugging Face Transformers (relatedly, see Hugging Face Transformers’ MLflowCallback) LangChain OpenAI API MLflow can package the LLM pipelines via the MLflow Pyfunc capability, which can store arbitrary Python code. Figure 5 is a machine learning operations architecture and process that uses Azure Databricks. Figure 5: MLOps Architecture (Image source, Azure Databricks) Key Components of LLM-Powered Applications The field of LLMOps is quickly evolving. Here are key components and considerations to bear in mind. Some, but not necessarily all of the following approaches make up a single LLM-based application. Any of these approaches can be taken to leverage your data with LLMs. Prompt engineering is the practice of adjusting the text prompts given to an LLM to extract more accurate or relevant responses from the model. It is very important to craft effective and specialized prompt templates to guide LLM behavior and mitigate risks such as model hallucination and data leakage. This approach is fast, cost-effective, with no training required, and less control than fine-tuning. Retrieval Augmented Generation (RAG), combining an LLM with external knowledge retrieval, requires an external knowledge base or database (e.g., vector database) with moderate training time (e.g., computing embeddings). The primary use case of this approach is dynamically updated context and enhanced accuracy but it significantly increases prompt length and inference computation. RAG LLMs use two systems to obtain external data: Vector databases: Vector databases help find relevant documents using similarity searches. They can either work independently or be part of the LLM application. Feature stores: These are systems or platforms to manage and store structured data features used in machine learning and AI applications. They provide organized and accessible data for training and inference processes in machine learning models like LLMs. Fine-tuning LLMs: Fine-tuning is the process of adapting a pre-trained LLM on a comparatively smaller dataset that is specific to an individual domain or task. During the fine-tuning process, only a small number of weights are updated, allowing it to learn new behaviors and specialize in certain tasks. The advantage of this approach is granular control, and high specialization but it requires labeled data and comes with a computational cost. The term “fine-tuning” can refer to several concepts, with the two most common forms being: Supervised instruction fine-tuning: This approach involves continuing training of a pre-trained LLM on a dataset of input-output training examples - typically conducted with thousands of training examples. Instruction fine-tuning is effective for question-answering applications, enabling the model to learn new specialized tasks such as information retrieval or text generation. The same approach is often used to tune a model for a single specific task (e.g. summarizing medical research articles), where the desired task is represented as an instruction in the training examples. Continued pre-training: This fine-tuning method does not rely on input and output examples but instead uses domain-specific unstructured text to continue the same pre-training process (e.g. next token prediction, masked language modeling). This approach is effective when the model needs to learn new vocabulary or a language it has not encountered before. Pre-training a model from scratch refers to the process of training a language model on a large corpus of data (e.g. text, code) without using any prior knowledge or weights from an existing model. This is in contrast to fine-tuning, where an already pre-trained model is further adapted to a specific task or dataset. The output of full pre-training is a base model that can be directly used or further fine-tuned for downstream tasks. The advantage of this approach is maximum control, tailored for specific needs, but it is extremely resource-intensive, and it requires longer training from days to weeks. A good rule of thumb is to start with the simplest approach possible, such as prompt engineering with a third-party LLM API, to establish a baseline. Once this baseline is in place, you can incrementally integrate more sophisticated strategies like RAG or fine-tuning to refine and optimize performance. The use of standard MLOps tools such as MLflow is equally crucial in LLM applications to track performance over different approach iterations. Quick, on-the-fly model guidance. Model Evaluation Challenges Evaluating LLMs is a challenging and evolving domain, primarily because LLMs often demonstrate uneven capabilities across different tasks. LLMs can be sensitive to prompt variations, demonstrating high proficiency in one task but faltering with slight deviations in prompts. Since most LLMs output natural language, it is very difficult to evaluate the outputs via traditional Natural Language Processing metrics. For domain-specific fine-tuned LLMs, popular generic benchmarks may not capture their nuanced capabilities. Such models are tailored for specialized tasks, making traditional metrics less relevant. It is often the case that LLM performance is being evaluated in domains where text is scarce or there is a reliance on subject matter expert knowledge. In such scenarios, evaluating LLM output can be costly and time-consuming. Some prominent benchmarks used to evaluate LLM performance include: BIG-bench (Beyond the Imitation Game Benchmark): A dynamic benchmarking framework, currently hosting over 200 tasks, with a focus on adapting to future LLM capabilities Elluether AI LM Evaluation Harness: A holistic framework that assesses models on over 200 tasks, merging evaluations like BIG-bench and MMLU, promoting reproducibility and comparability Mosaic Model Gauntlet: An aggregated evaluation approach, categorizing model competency into six broad domains (shown below) rather than distilling it into a single monolithic metric LLMOps Reference Architecture A well-defined LLMOps architecture is essential for managing machine learning workflows and operationalizing models in production environments. Here is an illustration of the production architecture with key adjustments to the reference architecture from traditional MLOps, and below is the reference production architecture for LLM-based applications: RAG workflow using a third-party API: Figure 6: RAG workflow using a third-party API (Image Source: Databricks) RAG workflow using a self-hosted fine-tuned model and an existing base model from the model hub that is then fine-tuned in production: Figure 7: RAG workflow using a self-hosted fine-tuned model (Image Source: Databricks) LLMOps: Pros and Cons Pros Minimal changes to base model: Most of the LLM applications often make use of existing, pre-trained models, and an internal or external model hub becomes a valuable part of the infrastructure. It is easy and requires simple changes to adopt it. Easy to model and deploy: The complexities of model construction, testing, and fine-tuning are overcome in LLMOps, enabling quicker development cycles. Also, deploying, monitoring, and enhancing models is made hassle-free. You can leverage expansive language models directly as the engine for your AI applications. Advanced language models: By utilizing advanced models like the pre-trained Hugging Face model (e.g., meta-llama/Llama-2-7b, google/gemma-7b) or one from OpenAI (e.g., GPT-3.5-turbo or GPT-4). LLMOps enables you to harness the power of billions or trillions of parameters, delivering natural and coherent text generation across various language tasks. Cons Human feedback: Human feedback in monitoring and evaluation loops may be used in traditional ML but becomes essential in most LLM applications. Human feedback should be managed like other data, ideally incorporated into monitoring based on near real-time streaming. Limitations and quotas: LLMOps comes with constraints such as token limits, request quotas, response times, and output length, affecting its operational scope. Risky and complex integration: The LLM pipeline will make external API calls, from the model serving endpoint to internal or third-party LLM APIs. This adds complexity, potential latency, and another layer of credential management. Also, integrating large language models as APIs requires technical skills and understanding. Scripting and tool utilization have become integral components, adding to the complexity. Conclusion Automation of workload is variable and intensive and will help in filling the gap between the data science team and the IT operations team. Planning for good governance early in the AI process will minimize AI efforts for data movement to accelerate model development. The emergence of LLMOps highlights the rapid advancement and specialized needs of the field of Generative AI and LLMOps is still rooted in the foundational principles of MLOps. In this article, we have looked at key components, practices, tools, and reference architecture with examples such as: Major similarities and differences between MLOPs and LLOPs Major deployment patterns to migrate data, code, and model Schematics of Ops such as development, staging, and production environments Major approaches to building LLM applications such as prompt engineering, RAGs, fine-tuned, and pre-trained models, and key comparisons LLM serving and observability, including tools and practices for monitoring LLM performance The end-to-end architecture integrates all components across dev, staging, and production environments. CI/CD pipelines automate deployment upon branch merges.
The Oracle WITH clause is one of the most commonly used techniques to simplify the SQL source code and improve performance. In Oracle SQL, the 'WITH' clause also known as a Common Table Expression (CTE) is a powerful tool which is also used to enhance the code readability. WITH is commonly used to define temporary named result sets, also referred to as subqueries or CTEs as defined earlier. These temporary named sets can be referenced multiple times within the main SELECT SQL query. The CTEs are like virtual tables and are very helpful in organizing and modularizing the SQL code. Understanding the WITH Clause Syntax The usage of the WITH clause is very simple. Create a namespace with the AS operator followed by the SELECT query and you can add as many SELECT queries as you want followed by a comma (,). It's a good practice to use meaningful terms for namespaces in order to distinguish in the main SELECT. In terms of internal execution of the WITH clause, Oracle will internally execute the namespaces individually and cache the results in the memory which will then be utilized by the main SELECT SQL. It mimics a materialized view with intermediate results and reduces redundant calculations. This suggests that Oracle optimizes SQL queries with CTEs by storing the results of the subqueries temporarily, allowing for faster retrieval and processing in subsequent parts of the query. SQL WITH cte_name1 as (SELECT * FROM Table1), cte_name2 as (SELECT * FROM Table2), ... SELECT ... FROM cte_name1, cte_name2 WHERE ...; Use Case In this use case, I am going to talk specifically about how you can effectively utilize inner joins alongside, using a WITH clause, which can tremendously help in performance tuning the process. Let's take a look at the dataset first and the problem statement before we delve deep into the solution. The scenario is of an e-commerce retail chain for whom the bulk product sales price data needs to be loaded for a particular e-store location. Imagine that a product can have several price lines meant for regular prices, promotional and BOGO offer prices. In this case, the user is trying to create multiple promotional price lines and is unaware of the possible mistakes he/she could commit. Through this process, we will detect duplicate data that is functionally redundant and prevent the creation of poor data quality in the pricing system. By doing so, we will avoid the interface program failures in the Pricing repository staging layer, which acts as a bridge between the pricing computation engine and the pricing repository accessed by the e-commerce platform. TABLE: e_promotions Price_LINE UPC_code Description Price Start_DT End_dt Row_num flag 10001 049000093322 Coca-Cola 12 OZ $6.86 01/01/2024 09/30/2024 1 0 10001 049000093322 Coca-Cola 12 OZ $5.86 01/31/2024 03/30/2024 2 0 10001 049000028201 Fanta Pineapple Soda, 20 OZ $2.89 01/01/2024 09/30/2024 3 0 10001 054000150296 Scott 1000 $1.19 01/01/2024 09/30/2024 4 0 PS: This a sample data, but in the real world, there could be thousands and millions of price lines being updated to mark down or mark up the prices on a weekly basis. The table above captures the UPC codes and the respective items within the price line 10001. The issue with this data set is that the back office user is trying to create a duplicate line as part of the same price line through an upload process and the user does not know the duplicate data he/she may be creating. The intent here is to catch the duplicate record and reject both entries 1 and 2 so that the user can decide which one among the two needs to go in the pricing system to be reflected on the website. Using the code below would simplify error detection and also optimize the store proc solution for better performance. PLSQL WITH price_lines as (SELECT rowid, price_line, UPC, start_dt, end_dt FROM e_promotions WHERE price_line = 10001 AND flag = 0) SELECT MIN(a.rowid) as row_id, a.price_line, a.UPC, a.start_dt, a.end_dt FROM price_lines a, price_lines b WHERE a.price_line = b.price_line AND a.flag = b.flag AND a.UPC = b.UPC AND a.rowid <> b.rowid AND (a.start_dt BETWEEN b.start_dt AND b.end_dt OR a.end_dt BETWEEN b.start_dt AND b.end_dt OR b.start_dt BETWEEN a.start_dt AND a.end_dt OR b.end_dt BETWEEN a.start_dt AND a.end_dt) GROUP BY a.price_line, a.UPC, a.start_dt, a.end_dt; With the code above we did two things in parallel: Queried the table once for the dataset we need to process using the WITH clause Added the inner join to detect duplicates without having to query the table for the 2nd time, hence optimizing the performance of the store proc This is one of the many use cases I have used in the past that gave me significant performance gain in my PLSQL and SQL coding. Have fun and post your comments if you have any questions!
In today's data-driven world, organizations increasingly rely on sophisticated data pipelines to manage vast volumes of data generated daily. A data pipeline is more than just a conduit for data — it is a complex system that involves the extraction, transformation, and loading (ETL) of data from various sources to ensure that it is clean, consistent, and ready for analysis. Advanced Data Modeling: Enroll in Free Course Today!* *Affiliate link. See Terms of Use. This article explores advanced strategies, the process of building data pipelines, and the pillars of a successful modern data strategy, with a focus on both real-time and batch data processing. Steps for Building Data Pipelines Building modern data pipelines that are efficient, scalable, and resilient requires advanced strategies and a deep understanding of the underlying technologies. Let’s dive into the key steps to building out your data pipelines. 1. Planning Phase The foundation of any successful data pipeline is a robust planning phase. This stage involves understanding the data requirements, identifying the sources of data, and determining the desired outcomes. During this phase, it is crucial to define clear objectives, such as the type of data to be processed, the frequency of data updates, and the level of granularity required for analysis. Key considerations include: Data sources: Identify the sources of data, such as databases, APIs, logs, or third-party data providers. Determine whether these sources provide structured, semi-structured, or unstructured data, as this will influence the design of the pipeline. Data volume and velocity: Estimate the volume of data and the speed at which it needs to be processed. For example, real-time data from IoT devices requires immediate processing, while batch data, such as daily sales reports, can be processed at scheduled intervals. Data quality: Define the standards for data quality, including accuracy, consistency, and completeness. Consider implementing data validation checks early in the pipeline to catch issues before they propagate. 2. Staging Phase The staging phase involves setting up the environment where data will be temporarily stored before being processed. This is a critical step, as it ensures that data is properly organized and ready for transformation. During this phase, raw data is ingested from various sources and stored in a staging area, often in a data lake or a cloud-based storage service. Key activities include: Data ingestion: Implement data ingestion mechanisms such as batch processing or real-time streaming to bring data into the staging area. Data organization: Organize the data in a way that facilitates easy access and transformation. This might involve partitioning the data by time, source, or other relevant dimensions. Data backup: Ensure that raw data is backed up before any transformation processes begin. This provides a safety net in case of errors during subsequent stages. 3. Execution Phase In the execution phase, the data is transformed, enriched, and loaded into its final destination, usually a data warehouse or a data mart. This phase is where the bulk of the computational work happens, including data cleaning, filtering, aggregation, and enrichment. Key tasks include: Data transformation: Apply transformations to clean and normalize the data. This may involve removing duplicates, handling missing values, or converting data types. Data enrichment: Enhance the data by integrating additional information from other sources. For instance, in real-time pipelines, data from social media might be combined with customer transaction data to provide timely insights. Data loading: Load the transformed data into its final storage location, ensuring that it is properly indexed and optimized for querying. Real-time data might be fed directly into dashboards or applications, while batch data could be used for deeper analytics and reporting. 4. Monitoring Phase The final phase of building a data pipeline is monitoring and maintenance. Once the pipeline is up and running, it is essential to continuously monitor its performance and address any issues that arise. Key aspects include: Performance monitoring: Track the performance of the pipeline, including data processing times, throughput, and resource usage. Real-time pipelines require constant monitoring to ensure low-latency processing, while batch pipelines may need scheduled checks to ensure timely completion. Error handling: Implement robust error handling and logging mechanisms to detect and address issues promptly. Ensure that any errors are logged with sufficient detail to facilitate debugging. Pipeline optimization: Continuously review and optimize the pipeline to improve efficiency, reduce costs, and accommodate changes in data volume or requirements. This is especially important in real-time pipelines, where performance bottlenecks can significantly impact the timeliness of data delivery. Real-Time vs. Batch Data Processing A critical component of modern data pipelines is the ability to handle both real-time and batch data processing. Understanding the differences and appropriate use cases for each is essential for building efficient data pipelines. Real-time data processing: Involves processing data as soon as it is generated or received. This approach is crucial for use cases requiring immediate action, such as fraud detection, stock trading, or real-time analytics on streaming data. Batch data processing: Involves processing large volumes of data at scheduled intervals. This approach is suitable for use cases where immediate processing is not critical, such as end-of-day reports, large-scale data transformations, and historical data analysis. Integration of Real-Time and Batch Processing In many modern data architectures, real-time and batch processing are integrated to provide a comprehensive data solution. This hybrid approach allows organizations to leverage the strengths of both methods. Lambda architecture: Combines real-time processing with batch processing by maintaining two separate paths for data. The real-time layer provides immediate results, while the batch layer ensures data accuracy and completeness. Results from both layers are merged to deliver a complete view. Kappa architecture: Simplifies the approach by using a single stream processing engine for both real-time and batch data, often by replaying streams for batch processing. This architecture is particularly suited for scenarios where real-time data processing is a priority. Conclusion Building modern data pipelines requires a strategic approach, advanced tools, and a strong data culture. By carefully planning each phase of the pipeline, leveraging the right tools, and fostering collaboration across teams, organizations can create data pipelines that are not only robust and scalable but also capable of delivering valuable insights. The integration of real-time and batch data processing, supported by best practices and a deep understanding of evolving technologies, ensures that your data pipeline remains efficient and effective in meeting the demands of today's data-driven landscape.