A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Why Do We Need Databases and SQL?
A Developer's Guide to Database Sharding With MongoDB
User-defined functions (UDFs) are a very useful feature supported in SQL++ (UDF documentation). Couchbase 7.6 introduces improvements that allow for more debuggability and visibility into UDF execution. This blog will explore two new features in Couchbase 7.6 in the world of UDFs: Profiling for SQL++ statements executed in JavaScript UDFs EXPLAIN FUNCTION to access query plans of SQL++ statements within UDFs The examples in this blog require the travel-sample dataset to be installed. Documentation to install sample buckets Profiling SQL++ Executed in JavaScript UDFs Query profiling is a debuggability feature that SQL++ offers. When profiling is enabled for a statement’s execution, the result of the request includes a detailed execution tree with timing and metrics of each step of the statement’s execution. In addition to the profiling information being returned in the results of the statement, it can also be accessed for the request in the system:active_requests and system:completed_requests system keyspaces. To dive deeper into request profiling, see request profiling in SQL++. In Couchbase 7.0, profiling was included for subqueries. This included profiling subqueries that were within Inline UDFs. However, in versions before Couchbase 7.6, profiling was not extended to SQL++ statements within JavaScript UDFs. In earlier versions, to profile statements within a JavaScript UDF, the user would be required to open up the function’s definition, individually run each statement within the UDF, and collect their profiles. This additional step will no longer be needed in 7.6.0! Now, when profiling is enabled, if the statement contains JavaScript UDF execution, profiles for all SQL++ statements executed in the UDF will also be collected. This UDF-related profiling information will be available in the request output, system:active_requests and system:completed_requests system keyspaces as well. Example 1 Create a JavaScript UDF “js1” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js1() { var query = SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1; var res = []; for (const row of query) { res.push(row); } query.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js1() LANGUAGE JAVASCRIPT AS "js1" AT "lib1"; Execute the UDF with profiling enabled. SQL EXECUTE FUNCTION js1(); The response to the statement above will contain the following: In the profile section of the returned response, the executionTimings subsection contains a field ~udfStatements. ~udfStatements: An array of profiling information that contains an entry for every SQL++ statement within the JavaScript UDF Every entry within the ~udfStatements section contains: executionTimings: This is the execution tree for the statement. It has metrics and timing information for every step of the statement’s execution. statement: The statement string function: This is the name of the function where the statement was executed and is helpful to identify the UDF that executed the statement when there are nested UDF executions. JavaScript { "requestID": "2c5576b5-f01d-445f-a35b-2213c606f394", "signature": null, "results": [ [ { "airline": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" } } ] ], "status": "success", "metrics": { "elapsedTime": "20.757583ms", "executionTime": "20.636792ms", "resultCount": 1, "resultSize": 310, "serviceLoad": 2 }, "profile": { "phaseTimes": { "authorize": "12.835µs", "fetch": "374.667µs", "instantiate": "27.75µs", "parse": "251.708µs", "plan": "9.125µs", "primaryScan": "813.249µs", "primaryScan.GSI": "813.249µs", "project": "5.541µs", "run": "27.925833ms", "stream": "26.375µs" }, "phaseCounts": { "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1 }, "phaseOperators": { "authorize": 2, "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1, "project": 1, "stream": 1 }, "cpuTime": "468.626µs", "requestTime": "2023-12-04T20:30:00.369+05:30", "servicingHost": "127.0.0.1:8091", "executionTimings": { "#operator": "Authorize", "#planPreparedTime": "2023-12-04T20:30:00.369+05:30", "#stats": { "#phaseSwitches": 4, "execTime": "1.918µs", "servTime": "1.125µs" }, "privileges": { "List": [] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "2.208µs" }, "~children": [ { "#operator": "ExecuteFunction", "#stats": { "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "22.375µs", "kernTime": "20.271708ms" }, "identity": { "name": "js1", "namespace": "default", "type": "global" } }, { "#operator": "Stream", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 2, "execTime": "26.375µs" }, "serializable": true } ] }, "~udfStatements": [ { "executionTimings": { "#operator": "Authorize", "#stats": { "#phaseSwitches": 4, "execTime": "2.626µs", "servTime": "7.166µs" }, "privileges": { "List": [ { "Priv": 7, "Props": 0, "Target": "default:travel-sample.inventory.airline" } ] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "4.375µs" }, "~children": [ { "#operator": "PrimaryScan3", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "22.082µs", "kernTime": "1.584µs", "servTime": "791.167µs" }, "bucket": "travel-sample", "index": "def_inventory_airline_primary", "index_projection": { "primary_key": true }, "keyspace": "airline", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 45.28617059639748, "fr_cost": 12.1780009122802, "size": 12 }, "scope": "inventory", "using": "gsi" }, { "#operator": "Fetch", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 10, "execTime": "18.376µs", "kernTime": "797.542µs", "servTime": "356.291µs" }, "bucket": "travel-sample", "keyspace": "airline", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 192.01699202888378, "fr_cost": 24.89848658838975, "size": 204 }, "scope": "inventory" }, { "#operator": "InitialProject", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "5.541µs", "kernTime": "1.1795ms" }, "discard_original": true, "optimizer_estimates": { "cardinality": 187, "cost": 194.6878862611588, "fr_cost": 24.912769445246838, "size": 204 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] }, { "#operator": "Limit", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "6.25µs", "kernTime": "333ns" }, "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 24.927052302103924, "fr_cost": 24.927052302103924, "size": 204 } }, { "#operator": "Receive", "#stats": { "#phaseSwitches": 3, "execTime": "10.324833ms", "kernTime": "792ns", "state": "running" } } ] } }, "statement": "SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1;", "function": "default:js1" } ], "~versions": [ "7.6.0-N1QL", "7.6.0-1847-enterprise" ] } } } Query Plans With EXPLAIN FUNCTION SQL++ offers another wonderful capability to access the plan of a statement with the EXPLAIN statement. However, the EXPLAIN statement does not extend to plans of statements within UDFs, neither inline nor JavaScript UDFs. In earlier versions, to analyze the query plans for SQL++ within a UDF, it would require the user to open the function’s definition and individually run an EXPLAIN on all the statements within the UDF. These extra steps will be minimized in Couchbase 7.6 with the introduction of a new statement: EXPLAIN FUNCTION. This statement does exactly what EXPLAIN does, but for SQL++ statements within a UDF. Let’s explore how to use the EXPLAIN FUNCTION statement! Syntax explain_function ::= 'EXPLAIN' 'FUNCTION' function function refers to the name of the function. For more detailed information on syntax, please check out the documentation. Prerequisites To execute EXPLAIN FUNCTION, the user requires the correct RBAC permissions. To run EXPLAIN FUNCTION on a UDF, the user must have sufficient RBAC permissions to execute the function. The user must also have the necessary RBAC permissions to execute the SQL++ statements within the UDF function body as well. For more information, refer to the documentation regarding roles supported in Couchbase. Inline UDF EXPLAIN FUNCTION on an inline UDF will return the query plans of all the subqueries within its definition (see inline function documentation). Example 2: EXPLAIN FUNCTION on an Inline Function Create an inline UDF and run EXPLAIN FUNCTION on it. SQL CREATE FUNCTION inline1() { ( SELECT * FROM default:`travel-sample`.inventory.airport WHERE city = "Zachar Bay" ) }; SQL EXPLAIN FUNCTION inline1(); The results of the above statement will contain: function: The name of the function on which EXPLAIN FUNCTION was run plans: An array of plan information that contains an entry for every subquery within the inline UDF JavaScript { "function": "default:inline1", "plans": [ { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "plan": { "#operator": "Sequence", "~children": [ { "#operator": "IndexScan3", "bucket": "travel-sample", "index": "def_inventory_airport_city", "index_id": "2605c88c115dd3a2", "index_projection": { "primary_key": true }, "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 12.200561852726496, "fr_cost": 12.179450078755286, "size": 12 }, "scope": "inventory", "spans": [ { "exact": true, "range": [ { "high": "\\"Zachar Bay\\"", "inclusion": 3, "index_key": "`city`", "low": "\\"Zachar Bay\\"" } ] } ], "using": "gsi" }, { "#operator": "Fetch", "bucket": "travel-sample", "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.082370508382763, "fr_cost": 24.96843677065826, "size": 249 }, "scope": "inventory" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Filter", "condition": "((`airport`.`city`) = \\"Zachar Bay\\")", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.100006681495888, "fr_cost": 24.98421650449632, "size": 249 } }, { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "fr_cost": 24.99999623833438, "size": 249 }, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, "statement": "select self.* from `default`:`travel-sample`.`inventory`.`airport` where ((`airport`.`city`) = \\"Zachar Bay\\")" } ] } JavaScript UDF SQL++ statements within JavaScript UDFs can be of two types as listed below. EXPLAIN FUNCTION works differently based on the way the SQL++ statement is called. Refer to the documentation to learn more about calling SQL++ in JavaScript functions. 1. Embedded SQL++ Embedded SQL++ is “embedded” in the function body and its detection is handled by the JavaScript transpiler. EXPLAIN FUNCTION can return query plans for embedded SQL++ statements. 2. SQL++ Executed by the N1QL() Function Call SQL++ can also be executed by passing a statement in the form of a string as an argument to the N1QL() function. When parsing the function for potential SQL++ statements to run the EXPLAIN on, it is difficult to get the dynamic string in the function argument. This can only be reliably resolved at runtime. With this reasoning, EXPLAIN FUNCTION does not return the query plans for SQL++ statements executed via N1QL() calls, but instead, returns the line numbers where the N1QL() function calls have been made. This line number is calculated from the beginning of the function definition. The user can then map the line numbers in the actual function definition and investigate further. Example 3: EXPLAIN FUNCTION on an External JavaScript Function Create a JavaScript UDF “js2” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js2() { // SQL++ executed by a N1QL() function call var query1 = N1QL("UPDATE default:`travel-sample` SET test = 1 LIMIT 1"); // Embedded SQL++ var query2 = SELECT * FROM default:`travel-sample` LIMIT 1; var res = []; for (const row of query2) { res.push(row); } query2.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js2() LANGUAGE JAVASCRIPT AS "js2" AT "lib1"; Run EXPLAIN FUNCTION on the SQL++ function. SQL EXPLAIN FUNCTION js2; The results of the statement above will contain: function: The name of the function on which EXPLAIN FUNCTION was run line_numbers: An array of line numbers calculated from the beginning of the JavaScript function definition where there are N1QL() function calls plans: An array of plan information that contains an entry for every embedded SQL++ statement within the JavaScript UDF JavaScript { "function": "default:js2", "line_numbers": [ 4 ], "plans": [ { "cardinality": 1, "cost": 25.51560885530435, "plan": { "#operator": "Authorize", "privileges": { "List": [ { "Target": "default:travel-sample", "Priv": 7, "Props": 0 } ] }, "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "PrimaryScan3", "index": "def_primary", "index_projection": { "primary_key": true }, "keyspace": "travel-sample", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 5402.279801258844, "fr_cost": 12.170627071041082, "size": 11 }, "using": "gsi" }, { "#operator": "Fetch", "keyspace": "travel-sample", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 46269.39474997121, "fr_cost": 25.46387878667884, "size": 669 } }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 31591, "cost": 47086.49704894546, "fr_cost": 25.489743820991595, "size": 669 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, { "#operator": "Limit", "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 } } ] }, { "#operator": "Stream", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 }, "serializable": true } ] } }, "statement": "SELECT * FROM default:`travel-sample` LIMIT 1 ;" } ] } Constraints If the N1QL() function has been aliased in a JavaScript function definition, EXPLAIN FUNCTION will not be able to return the line numbers where this aliased function was called.Example of such a function definition: JavaScript function js3() { var alias = N1QL; var q = alias("SELECT 1"); } If the UDF contains nested UDF executions, EXPLAIN FUNCTION does not support generating the query plans of SQL++ statements within these nested UDFs. Summary Couchbase 7.6 introduces new features to debug UDFs which will help users peek into UDF execution easily. Helpful References 1. Javascript UDFs: A guide to JavaScript UDFs Creating an external UDF 2. EXPLAIN statement
Paramount+ streaming platform has outdone itself this NFL season, shattering viewership records during the AFC Championship Game and now the Super Bowl — hailed as the “most-watched telecast in history” with 123.4 million average viewers. Over 200 million tuned in to the game at some point, approximately ⅔ of the population of the United States. It also set a new benchmark as the most-streamed Super Bowl ever. In anticipation of this level of interest, Paramount+ finalized their migration to a multi-region architecture early in 2023. Since then, the streaming platform has been operating across multiple regions in Google Cloud and running on a distributed SQL database that functions across multiple distant locations. Prior to this, the database tier posed the biggest architectural challenge, prompting them to begin the search for a multi-master distributed database: "Paramount+ was hosted on a single master (aka read/write) database. A single vertically scaled master database can only carry us forward so far. While the team considered sharding the data and spreading it out, our past experience taught us that this would be a laborious process. We started looking for a new multi-master capable database with the criteria that we had to make sure that we stick to a relational database due to the existing nature of the application. This narrowed down the criteria, and after some internal research and POCs we narrowed it down to a new player in the database space called YugabyteDB."- Quote from a Paramount+ team member So how can you achieve this level of app scalability and high availability across multiple regions? In this blog, I’ll use a sample application to analyze how services like Paramount+ can scale in a multi-region setup. The Key Component to Getting a Multi-Region Architecture Right Scaling the application tier across multiple regions is usually a no-brainer. Simply pick the most suitable cloud regions, deploy application instances there, and use a global load balancer to automatically route and load balance user requests. Things get more complicated when dealing with a multi-region database deployment, especially for transactional applications requiring low latency and data inconsistency. It is possible to achieve global data consistency by deploying a database instance with a single primary that handles all the user read and write requests. However, this approach means that only users near the cloud region with the database (US East above) will experience low latency for read-write requests. Users farther from the database’s cloud region will face higher latency since their requests travel longer distances. Additionally, an outage on the server, data center, or region hosting the database can make the application unavailable. Therefore, getting the database right is crucial when designing a multi-region service or application. Now, let’s experiment using YugabyteDB, the distributed database Paramount+ used for the Super Bowl and their global streaming platform. Two YugabyteDB Design Patterns for Multi-Region Applications YugabyteDB is a distributed SQL database built on PostgreSQL, essentially acting as a distributed version of PostgreSQL. Usually, the database is deployed in a multi-node configuration spanning several servers, availability zones, data centers, or regions. The Yugabyte database shards data across all nodes and then distributes the load by having all the nodes process read and write requests. Transactional consistency is ensured with the Raft consensus protocol that replicates changes synchronously among the cluster nodes. In multi-region database deployments, the latency between regions has the biggest impact on application performance. While there is no one-size-fits-all solution for multi-region deployments (with YugabyteDB or any other distributed transactional database), you can pick from several design patterns for global applications and configure your database so that it works best for your application workloads. YugabyteDB offers eight commonly used design patterns to balance read-write latency with two key aspects of highly available systems: the recovery time objective (RTO) and recovery point objective (RPO). Now, let’s review two of the design patterns from our list of eight — global database and follower reads — by looking into the latency of our sample multi-region application. Design Pattern #1: Global Database The global database design pattern assumes that a database is spread across multiple (i.e., three or more) regions or zones. If there’s a failure in one zone/region, the nodes in other regions/zones will detect the outage within seconds (RTO) and continue serving application workloads without any loss of data (RPO=0). With YugabyteDB, you can reduce the number of cross-region requests by defining a preferred region. All the shards/Raft leaders will be located in the preferred region, delivering low-latency reads for the users near the region and predictable latency for those further away. I provisioned a three-node YugabyteDB cluster (below) across the US East, Central, and West, with the US East region configured as the preferred region. Each region hosts an application instance that is connected to the node in the preferred region (US East). In this configuration, the round-trip latency between an application instance and the database varies by distance from the preferred region. For example, the app instance from the US East is 5 ms away from the preferred region, while the instance from the US West is 65 ms away. The US West and Central app instances are not connected to the database nodes in their local regions directly, because those nodes will still automatically route all the requests to the leaders in the preferred region. Our sample application is a movie recommendation service that takes user questions in plain English and uses a generative AI stack (OpenAI, Spring AI, and the PostgreSQL pgvector extension) to provide users with relevant movie recommendations. Suppose you are in the mood for a space adventure movie with an unexpected ending. You connect to the movie recommendation service and send the following API request: Shell http GET {app_instance_address}:80/api/movie/search \ prompt=='a movie about a space adventure with an unexpected ending' \ rank==7 \ X-Api-Key:superbowl-2024 The application performs a vector similarity search by comparing an embedding generated for the prompt parameter to the embeddings of the movie overviews stored in the database. It then identifies the most relevant movies and sends back the following response (below) in JSON format: JSON { "movies": [ { "id": 157336, "overview": "Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.", "releaseDate": "2014-11-05", "title": "Interstellar", "voteAverage": 8.1 }, { "id": 49047, "overview": "Dr. Ryan Stone, a brilliant medical engineer on her first Shuttle mission, with veteran astronaut Matt Kowalsky in command of his last flight before retiring. But on a seemingly routine spacewalk, disaster strikes. The Shuttle is destroyed, leaving Stone and Kowalsky completely alone-tethered to nothing but each other and spiraling out into the blackness of space. The deafening silence tells them they have lost any link to Earth and any chance for rescue. As fear turns to panic, every gulp of air eats away at what little oxygen is left. But the only way home may be to go further out into the terrifying expanse of space.", "releaseDate": "2013-09-27", "title": "Gravity", "voteAverage": 7.3 }, { "id": 13475, "overview": "The fate of the galaxy rests in the hands of bitter rivals. One, James Kirk, is a delinquent, thrill-seeking Iowa farm boy. The other, Spock, a Vulcan, was raised in a logic-based society that rejects all emotion. As fiery instinct clashes with calm reason, their unlikely but powerful partnership is the only thing capable of leading their crew through unimaginable danger, boldly going where no one has gone before. The human adventure has begun again.", "releaseDate": "2009-05-06", "title": "Star Trek", "voteAverage": 7.4 } ], "status": {/i> "code": 200, "success": true } } The response speed of the application and the read latency of this API call depend on which application instance received and processed your request: If the request originates from US East, the latency can be as low as 9 ms since the database leaders are just a few milliseconds away from the US East-based application instance. However, latency is much higher for application instances in the US Central and West. This is because they must perform the vector similarity search on the US East database leaders and then receive/process a large result set with detailed information about suggested movies. Note: The numbers above are not meant as the baseline of a performance benchmark. I ran a simple experiment on commodity VMs with a handful of shared vCPUs and didn’t perform any optimizations for software stack components. The results were just a quick functional test of this multi-region deployment. Now, what if you want the application to generate movie recommendations at low latency regardless of the users’ location? How can you achieve low-latency reads across all regions? YugabyteDB supports several design patterns that can achieve this, including follower reads. Design Pattern #2: Follower Reads The follower reads pattern lets the application instances in secondary regions read from local nodes/followers instead of going to the database leaders in the preferred region. This pattern speeds up the reads to match those from the leaders, although the followers may not hold the most current data at the time of the request. To use this pattern, I had to: Connect the application instances from US Central and West to the database nodes from their respective regions. Allow follower reads by setting the following flags for the database session. Properties files SET session characteristics as transaction read only; SET yb_read_from_followers = true; With this configuration, read latency is similar across all the regions. There might be a 30 ms data lag on the database node in the US Central and a 65 ms data lag on the US West node. Why? My multi-region cluster is configured with a replication factor of 3. This means that a transaction would be considered committed once two nodes out of three confirm the changes. So, if the US East and Central nodes have acknowledged a transaction, the US West node might still be recording the change, explaining the lag during follower reads. Despite potential data lags, the entire data set on the followers always remains in a consistent state (across all tables and other database objects). YugabyteDB ensures data consistency through its transactional sub-system and the Raft consensus protocol, which replicates changes synchronously across the entire multi-region cluster. Now, let’s use follower reads to send the same HTTP request to US Central and West instances: Shell http GET {app_instance_address}:80/api/movie/search \ prompt=='a movie about a space adventure with an unexpected ending' \ rank==7 \ X-Api-Key:superbowl-2024 Now, the read latency across all the regions is consistently low and comparable: Note: The application instance from US East doesn’t need to use the follower reads pattern as long as it can work directly with the leaders from the preferred region. A Quick Note on Multi-Region Writes So far, we’ve used the global database with the preferred region and follower reads design patterns to ensure low latency reads across distant locations. This configuration can tolerate region-level outages with RTO measured in seconds and RPO=0 (no data loss). In this configuration, there is a tradeoff with write latency. If YugabyteDB has to keep a consistent copy of data across all regions, cross-region latency will affect the time needed for the Raft consensus protocol to synchronize changes across all locations. For example, suppose you want to watch the movie “Interstellar." You add it to your watch list with the following API call to the movie recommendations service (Note: 157336 is Interstellar’s internal ID): Shell http PUT {app_instance_address}:80/api/library/add/157336 X-Api-Key:superbowl-2024 The latency in my application setup is: Write latency was lowest for requests originating from the US East-based application instance directly connected to the database node in the preferred region (US East). Latency for writes from other locations was higher because their requests had to travel to leaders in the preferred region before a transaction could be executed and replicated across the entire cluster. Does this mean that the write latency is always high in a multi-region configuration? Not necessarily. YugabyteDB offers several design patterns that allow you to achieve low-latency reads and writes in a multi-region setting. One such pattern is latency-optimized geo-partitioning, where user data is pinned to locations closest to the users, resulting in single-digit millisecond latency for reads and writes. Video Summary Paramount+ successfully transitioning to a multi-region architecture shows that with the right design patterns, you can build applications that tolerate region-level outages, scale, and perform at low latencies across distant locations. The Paramount+ tech team learned the art of scaling by creating a streaming platform that accommodates millions of users during peak periods, with low latency and uninterrupted service. Implementing a multi-region setup correctly is essential. If you pick the right design pattern, you, too, can build multi-region applications that scale and tolerate all sorts of possible outages.
Docker has become an essential tool for developers, offering consistent and isolated environments without installing full-fledged products locally. The ideal setup for microservice development using Spring Boot with MySQL as the backend often involves a remotely hosted database. However, for rapid prototyping or local development, running a MySQL container through Docker offers a more streamlined approach. I encountered a couple of issues while attempting to set up this configuration with the help of Docker Desktop for a proof of concept. An online search revealed a lack of straightforward guides on integrating Spring Boot microservices with MySQL in Docker Desktop; most resources primarily focus on containerizing the Spring Boot application. Recognizing this gap, I decided to write this short article. Prerequisites Before diving in, we must have the following: A foundational understanding of Spring Boot and microservices architecture Familiarity with Docker containers Docker Desktop installed on our machine Docker Desktop Setup We can install Docker Desktop using this link. Installation is straightforward and includes steps that can be navigated efficiently, as illustrated in the accompanying screenshots. Configuring MySQL Container Once we have installed the Docker desktop when we launch, we will get through some standard questions, and we can skip the registration part. Once the desktop app is ready, then we need to search for the MySQL container, as shown below: We need to click Pull and then Run the container. Once you run the container, the settings dialog will pop up, as shown below. Please enter the settings as below: MYSQL_ROOT_PASSWORD: This environment variable specifies the password that will be set for the MySQL root superuser account. MYSQL_DATABASE: This environment variable allows us to specify the name of a database that will be created on image startup. If a user/password was supplied (see below), that user will be granted superuser access (corresponding to GRANT ALL) to this database. MYSQL_USER, MYSQL_PASSWORD: These variables are used to create a new user and set that user's password. This user will be granted superuser permissions for the database specified by the MYSQL_DATABASE variable. Upon running the container, Docker Desktop displays logs indicating the container's status. We can now connect to the MySQL instance using tools like MySQL Workbench to manage database objects. Spring Application Configuration In the Spring application, we can configure the configurations below in the application.properties. YAML spring.esign.datasource.jdbc-url=jdbc:mysql://localhost:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=e-sign spring.esign.datasource.password=Password1 We opted for a custom prefix spring.esign over the default spring.datasource for our database configuration within the Spring Boot application. This approach shines in scenarios where the application requires connections to multiple databases. To enable this custom configuration, we need to define the Spring Boot configuration class ESignDbConfig: Java @Configuration @EnableTransactionManagement @EnableJpaRepositories( entityManagerFactoryRef = "eSignEntityManagerFactory", transactionManagerRef = "eSignTransactionManager", basePackages ="com.icw.esign.repository") public class ESignDbConfig { @Bean("eSignDataSource") @ConfigurationProperties(prefix="spring.esign.datasource") public DataSource geteSignDataSource(){ return DataSourceBuilder.create().type(HikariDataSource.class).build(); } @Bean(name = "eSignEntityManagerFactory") public LocalContainerEntityManagerFactoryBean eSignEntityManagerFactory( EntityManagerFactoryBuilder builder, @Qualifier("eSignDataSource") DataSource dataSource) { return builder.dataSource(dataSource).packages("com.icw.esign.dao") .build(); } @Bean(name = "eSignTransactionManager") public PlatformTransactionManager eSignTransactionManager(@Qualifier("eSignEntityManagerFactory") EntityManagerFactory entityManagerFactory) { return new JpaTransactionManager(entityManagerFactory); } } @Bean("eSignDataSource"): This method defines a Spring bean for the eSign module's data source. The @ConfigurationProperties(prefix="spring.esign.datasource") annotation is used to automatically map and bind all configuration properties starting with spring.esign.datasource from the application's configuration files (like application.properties or application.yml) to this DataSource object. The method uses DataSourceBuilder to create and configure a HikariDataSource, a highly performant JDBC connection pool. This implies that the eSign module will use a dedicated database whose connection parameters are isolated from other modules or the main application database. @Bean(name = "eSignEntityManagerFactory"): This method creates a LocalContainerEntityManagerFactoryBean, which is responsible for creating the EntityManagerFactory. This factory is crucial for managing JPA entities specific to the eSign module. The EntityManagerFactory is configured to use the eSignDataSource for its database operations and to scan the package com.icw.esign.dao for entity classes. This means that only entities in this package or its subpackages will be managed by this EntityManagerFactory and thus, can access the eSign database. @Bean(name = "eSignTransactionManager"): This defines a PlatformTransactionManager specific way of managing transactions of the eSignmodule's EntityManagerFactory. This transaction manager ensures that all database operations performed by entities managed by the eSignEntityManagerFactory are wrapped in transactions. It enables the application to manage transaction boundaries, roll back operations on failures, and commit changes when operations succeed. Repository Now that we have defined configurations, we can create repository classes and build other objects required for the API endpoint. Java @Repository public class ESignDbRepository { private static final Logger logger = LoggerFactory.getLogger(ESignDbRepository.class); @Qualifier("eSignEntityManagerFactory") @Autowired private EntityManager entityManager; @Autowired ObjectMapper objectMapper; String P_GET_DOC_ESIGN_INFO = "p_get_doc_esign_info"; public List<DocESignMaster> getDocumentESignInfo(String docUUID) { StoredProcedureQuery proc = entityManager.createStoredProcedureQuery(P_GET_DOC_ESIGN_INFO, DocESignMaster.class); proc.registerStoredProcedureParameter("v_doc_uuid", String.class, ParameterMode.IN); proc.setParameter("v_doc_uuid", docUUID); try { return (List<DocESignMaster>) proc.getResultList(); } catch (PersistenceException ex) { logger.error("Error while fetching document eSign info for docUUID: {}", docUUID, ex); } return Collections.emptyList(); } } @Qualifier("eSignEntityManagerFactory"): Specifies which EntityManagerFactory should be used to create EntityManager, ensuring that the correct database configuration is used for eSign operations. Conclusion Integrating Spring Boot microservices with Docker Desktop streamlines microservice development and testing. This guide walks through the essential steps of setting up a Spring Boot application and ensuring seamless service communication with a MySQL container hosted on the Docker Desktop application. This quick setup guide is useful for proof of concept or setting up an isolated local development environment.
Performance tuning in Snowflake is optimizing the configuration and SQL queries to improve the efficiency and speed of data operations. It involves adjusting various settings and writing queries to reduce execution time and resource consumption, ultimately leading to cost savings and enhanced user satisfaction. Performance tuning is crucial in Snowflake for several reasons: Cost efficiency Improved query performance Enhanced user experience Scalability Resource optimization Concurrency and workload management Performance Tuning in Snowflake: Techniques Some of the techniques used for performance tuning in Snowflake are as follows: Warehouse sizing and scaling: Adjusting the size of virtual warehouses to fit the workload, leveraging multi-cluster warehouses for high-concurrency Query optimization: Writing efficient SQL queries, minimizing data scanned, and optimizing joins and aggregations Data clustering: Organizing data to align with common query patterns to minimize the amount of data scanned Caching strategies: Leveraging Snowflake's automatic caching to reduce the need for repeated computations Materialized views: Pre-computing and storing complex query results for faster access Resource monitors and alerts: Setting up monitors to track and manage compute and storage usage, avoiding unexpected costs Use of features like search optimization and query acceleration services: Take advantage of Snowflake-specific features designed to improve performance for specific queries or data scenarios. Performance tuning is an active process as workloads, data volumes, and business needs evolve. Regular monitoring, testing, and adjustments ensure that your Snowflake environment remains efficient, cost-effective, and responsive to the needs of your users and applications. Let us take a look at a few practical examples. Practical Examples Performance tuning in Snowflake involves optimizing storage and computation to improve efficiency and reduce costs. Here's an overview of each technique with code examples (where applicable): 1. Minimize Data Scanning Reducing the amount of data scanned by queries can significantly decrease execution time and costs. This can be achieved by using filters in your queries to limit the rows and columns being read. SQL -- Only select the columns and rows you need SELECT column1, column2 FROM your_table WHERE your_condition = 'specific_value'; 2. Clustering Snowflake does not use traditional indexing (like B-trees in other databases). Instead, it automatically creates and uses micro-partitions and metadata about these partitions to optimize query performance. You can influence this process indirectly by clustering your data. SQL -- Create a clustering key ALTER TABLE your_table CLUSTER BY (your_column); 3. Optimize Joins Prefer joining on columns with the same data types and consider using approximate joins if exact matches are not necessary. Also, structuring your SQL to filter data before joining can reduce the computation needed. SQL -- Efficient join with filtering before joining SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 WHERE your_condition = 'value') AS filtered_table2 ON table1.id = filtered_table2.id; 4. Utilize Materialized Views Materialized views store the result of a query and can significantly speed up queries that are run frequently with the same criteria. SQL CREATE MATERIALIZED VIEW your_view AS SELECT columns FROM your_table WHERE your_condition = 'value' GROUP BY columns; 5. Partitioning Snowflake automatically partitions data into micro-partitions. While explicit partitioning is not necessary, you can influence partitioning through clustering. 6. Warehouse Sizing Adjusting the size of your virtual warehouse can improve performance for larger queries or workloads. SQL -- Resize warehouse ALTER WAREHOUSE your_warehouse SET WAREHOUSE_SIZE = 'X-LARGE'; 7. Query Caching Snowflake caches the results of queries for 24 hours, which can be leveraged to speed up repeated queries. 8. Bind Variables Bind variables can improve query performance by reducing parsing time, especially for repeated queries with different parameters. SQL -- Using a bind variable in a session SET my_variable = 'value'; SELECT * FROM your_table WHERE your_column = $my_variable; 9. Monitoring Monitor your queries and warehouses to identify and optimize inefficient operations. SQL -- View query history SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY()); 10. Enable Auto-Suspend Automatically suspends a warehouse when it's not in use to save costs. SQL ALTER WAREHOUSE your_warehouse SET AUTO_SUSPEND = 300; -- Auto-suspend after 300 seconds of inactivity 11. Enable Auto-Resume Automatically resumes a suspended warehouse when a query requests its resources. SQL ALTER WAREHOUSE your_warehouse SET AUTO_RESUME = TRUE; 12. Drop Unused Tables Remove tables that are no longer needed to save on storage costs. SQL DROP TABLE if_exists_your_table; 13. Purge Dormant Users Identify and remove users who are no longer active. SQL -- Manual review and action required SHOW USERS; 14. Apply Resource Monitors Set up resource monitors to track and control computing costs. SQL CREATE RESOURCE MONITOR your_monitor WITH CREDIT_QUOTA = 100 TRIGGERS ON 90 PERCENT DO NOTIFY; 15. Monitor Warehouses That Are Approaching the Cloud Service Billing Threshold Keep an eye on warehouse usage to avoid unexpected charges. SQL -- Use the ACCOUNT_USAGE schema to monitor warehouse costs SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY; 16. Set Timeouts Appropriately for Workloads Specify query timeouts to prevent long-running queries from consuming excessive resources. SQL ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = 1200; -- Set query timeout to 20 minutes 16. Search Optimization Service The Search Optimization Service in Snowflake is designed to improve the performance of queries that filter on single or multiple columns, especially beneficial for large tables with billions of rows or more. This service optimizes the time it takes to retrieve results from tables using filters without requiring any changes to your query. It's beneficial for queries with equality and range conditions on columns. When enabled, Snowflake utilizes additional structures to speed up access to filtered data, making it an excellent choice for scenarios requiring frequent access to specific rows of large datasets. However, it incurs extra costs, so it is recommended that it be enabled for tables where performance gains justify the expense. Example Consider a large table, sales_data, with billions of rows. You frequently run queries to retrieve sales for a specific day. Without search optimization: SQL Copy code SELECT * FROM sales_data WHERE sale_date = '2023-01-01'; This query might take significant time to execute because Snowflake has to scan a large portion of the table to find the rows that match the condition. With search optimization enabled: First, you enable the service on the sales_data table: SQL Copy code ALTER TABLE sales_data ENABLE SEARCH OPTIMIZATION; Then, running the same query as above can result in faster execution times, as Snowflake can more efficiently locate the relevant rows. 17. Query Acceleration Service The Query Acceleration Service in Snowflake allows users to accelerate specific queries that might not be performing well due to the nature of the data or the complexity of the query. This service dynamically directs queries to an optimized compute cluster, enhancing performance without manual optimization or tuning. It's beneficial for ad hoc, complex analytical queries involving large datasets requiring significant compute resources. The service automatically identifies opportunities to improve query performance and applies acceleration without user intervention. Example Consider an analytical query that joins several large tables and performs complex aggregations and window functions. SQL Copy code SELECT a.customer_id, SUM(b.transaction_amount) OVER (PARTITION BY a.customer_id) AS total_spent, AVG(b.transaction_amount) OVER (PARTITION BY a.customer_id) AS avg_spent FROM customers a JOIN transactions b ON a.customer_id = b.customer_id WHERE b.transaction_date BETWEEN '2023-01-01' AND '2023-01-31' GROUP BY a.customer_id; This query might initially run slowly if the involved tables are large and the computations are complex. By leveraging the Query Acceleration Service, Snowflake can automatically apply optimizations to improve the execution time of such queries without requiring any modifications to the query itself. The Query Acceleration Service typically needs to be enabled at the account or user level, depending on the Snowflake edition and your organization's settings. Additional costs may apply when using this service, so evaluating the performance benefits against the costs for your specific use cases is essential. 18. Multi-Cluster Virtual Warehouse Setting up a multi-cluster virtual warehouse in Snowflake allows you to scale compute resources horizontally to manage varying concurrency demands efficiently. This feature enables multiple clusters of compute resources to operate simultaneously, providing additional processing power when needed and ensuring that multiple users or jobs can run without experiencing significant delays or performance degradation. Here's how to set up and configure a multi-cluster warehouse practically in Snowflake: Example 1: Creating a Multi-Cluster Warehouse When you create a multi-cluster warehouse, you specify the minimum and maximum number of clusters it can scale out to, along with the scaling policy. SQL Copy code CREATE WAREHOUSE my_multi_cluster_warehouse WITH WAREHOUSE_SIZE = 'X-SMALL' -- Specify the size of each cluster. AUTO_SUSPEND = 300 -- Auto-suspend after 5 minutes of inactivity. AUTO_RESUME = TRUE -- Automatically resume when a query is submitted. MIN_CLUSTER_COUNT = 1 -- Minimum number of clusters. MAX_CLUSTER_COUNT = 4 -- Maximum number of clusters, allowing scaling up to 4 clusters based on demand. SCALING_POLICY = 'STANDARD'; -- 'STANDARD' (default) balances queries across available clusters, and 'ECONOMY' minimizes the number of clusters used. This command sets up a multi-cluster warehouse named my_multi_cluster_warehouse. It starts with a single cluster and can automatically scale up to four clusters depending on the workload. Each cluster uses an 'X-SMALL' size and implements an auto-suspend feature for cost efficiency. Example 2: Altering an Existing Warehouse to a Multi-Cluster If you already have a single-cluster warehouse and want to modify it to be a multi-cluster warehouse to handle higher concurrency, you can alter its configuration. SQL Copy code ALTER WAREHOUSE my_warehouse SET MIN_CLUSTER_COUNT = 2, -- Adjusting the minimum number of clusters. MAX_CLUSTER_COUNT = 6, -- Adjusting the maximum number of clusters to allow more scaling. SCALING_POLICY = 'ECONOMY'; -- Opting for 'ECONOMY' scaling policy to conserve resources. This alters my_warehouse to operate between 2 to 6 clusters, adapting to workload demands while aiming to conserve resources by preferring fewer, fuller clusters over more, emptier ones under the 'ECONOMY' scaling policy. Managing workloads: In practical terms, using a multi-cluster warehouse can significantly improve how you handle different types of workloads: For high concurrency: If many users execute queries simultaneously, the warehouse can scale out to more clusters to accommodate the increased demand, ensuring all users get the resources they need without long wait times. For varying workloads: During periods of low activity, the warehouse can scale into fewer clusters or even suspend entirely, helping manage costs effectively while still being ready to scale up as demand increases. Using multi-cluster warehouses effectively requires monitoring and potentially adjusting configurations as your workload patterns evolve. Snowflake's ability to automatically scale and manage compute resources makes it a powerful tool for managing diverse and dynamic workloads with varying concurrency requirements. Conclusion Implementing performance-tuning strategies in Snowflake involves careful consideration of the trade-offs between achieving optimal performance, effectively managing costs, and ensuring that the data platform remains versatile and adaptable to changing business needs. This balancing act is crucial because overly aggressive optimization might lead to increased complexity or higher costs, while insufficient optimization can result in poor performance and user dissatisfaction. Adjusting settings such as warehouse size or enabling features like auto-suspend and auto-resume must be done with an understanding of your specific workload patterns and requirements. For instance, selecting the right size for a virtual warehouse involves predicting the computational power needed for typical workloads while avoiding over-provisioning that could lead to unnecessary expenses. Similarly, employing data clustering and materialized views should align with common query patterns to ensure that the benefits of query performance outweigh the additional storage costs or maintenance overhead.
In Part 1 of this series, we looked at MongoDB, one of the most reliable and robust document-oriented NoSQL databases. Here in Part 2, we'll examine another quite unavoidable NoSQL database: Elasticsearch. More than just a popular and powerful open-source distributed NoSQL database, Elasticsearch is first of all a search and analytics engine. It is built on the top of Apache Lucene, the most famous search engine Java library, and is able to perform real-time search and analysis operations on structured and unstructured data. It is designed to handle efficiently large amounts of data. Once again, we need to disclaim that this short post is by no means an Elasticsearch tutorial. Accordingly, the reader is strongly advised to extensively use the official documentation, as well as the excellent book, "Elasticsearch in Action" by Madhusudhan Konda (Manning, 2023) to learn more about the product's architecture and operations. Here, we're just reimplementing the same use case as previously, but using this time, using Elasticsearch instead of MongoDB. So, here we go! The Domain Model The diagram below shows our *customer-order-product* domain model: This diagram is the same as the one presented in Part 1. Like MongoDB, Elasticsearch is also a document data store and, as such, it expects documents to be presented in JSON notation. The only difference is that to handle its data, Elasticsearch needs to get them indexed. There are several ways that data can be indexed in an Elasticsearch data store; for example, piping them from a relational database, extracting them from a filesystem, streaming them from a real-time source, etc. But whatever the ingestion method might be, it eventually consists of invoking the Elasticsearch RESTful API via a dedicated client. There are two categories of such dedicated clients: REST-based clients like curl, Postman, HTTP modules for Java, JavaScript, Node.js, etc. Programming language SDKs (Software Development Kit): Elasticsearch provides SDKs for all the most used programming languages, including but not limited to Java, Python, etc. Indexing a new document with Elasticsearch means creating it using a POST request against a special RESTful API endpoint named _doc. For example, the following request will create a new Elasticsearch index and store a new customer instance in it. Plain Text POST customers/_doc/ { "id": 10, "firstName": "John", "lastName": "Doe", "email": { "address": "john.doe@gmail.com", "personal": "John Doe", "encodedPersonal": "John Doe", "type": "personal", "simple": true, "group": true }, "addresses": [ { "street": "75, rue Véronique Coulon", "city": "Coste", "country": "France" }, { "street": "Wulfweg 827", "city": "Bautzen", "country": "Germany" } ] } Running the request above using curl or the Kibana console (as we'll see later) will produce the following result: Plain Text { "_index": "customers", "_id": "ZEQsJI4BbwDzNcFB0ubC", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 1, "_primary_term": 1 } This is the Elasticsearch standard response to a POST request. It confirms having created the index named customers, having a new customer document, identified by an automatically generated ID ( in this case, ZEQsJI4BbwDzNcFB0ubC). Other interesting parameters appear here, like _version and especially _shards. Without going into too much detail, Elasticsearch creates indexes as logical collections of documents. Just like keeping paper documents in a filing cabinet, Elasticsearch keeps documents in an index. Each index is composed of shards, which are physical instances of Apache Lucene, the engine behind the scenes responsible for getting the data in or out of the storage. They might be either primary, storing documents, or replicas, storing, as the name suggests, copies of primary shards. More on that in the Elasticsearch documentation - for now, we need to notice that our index named customers is composed of two shards: of which one, of course, is primary. A final notice: the POST request above doesn't mention the ID value as it is automatically generated. While this is probably the most common use case, we could have provided our own ID value. In each case, the HTTP request to be used isn't POST anymore, but PUT. To come back to our domain model diagram, as you can see, its central document is Order, stored in a dedicated collection named Orders. An Order is an aggregate of OrderItem documents, each of which points to its associated Product. An Order document references also the Customer who placed it. In Java, this is implemented as follows: Java public class Customer { private Long id; private String firstName, lastName; private InternetAddress email; private Set<Address> addresses; ... } The code above shows a fragment of the Customer class. This is a simple POJO (Plain Old Java Object) having properties like the customer's ID, first and last name, email address, and a set of postal addresses. Let's look now at the Order document. Java public class Order { private Long id; private String customerId; private Address shippingAddress; private Address billingAddress; private Set<String> orderItemSet = new HashSet<>() ... } Here you can notice some differences compared to the MongoDB version. As a matter of fact, with MongoDB, we were using a reference to the customer instance associated with this order. This notion of reference doesn't exist with Elasticsearch and, hence, we're using this document ID to create an association between the order and the customer who placed it. The same applies to the orderItemSet property which creates an association between the order and its items.The rest of our domain model is quite similar and based on the same normalization ideas. For example, the OrderItem document: Java public class OrderItem { private String id; private String productId; private BigDecimal price; private int amount; ... } Here, we need to associate the product which makes the object of the current order item. Last but not least, we have the Product document: Java public class Product { private String id; private String name, description; private BigDecimal price; private Map<String, String> attributes = new HashMap<>(); ... } The Data Repositories Quarkus Panache greatly simplifies the data persistence process by supporting both the active record and the repository design patterns. In Part 1, we used the Quarkus Panache extension for MongoDB to implement our data repositories, but there is not yet an equivalent Quarkus Panache extension for Elasticsearch. Accordingly, waiting for a possible future Quarkus extension for Elasticsearch, here we have to manually implement our data repositories using the Elasticsearch dedicated client. Elasticsearch is written in Java and, consequently, it is not a surprise that it offers native support for invoking the Elasticsearch API using the Java client library. This library is based on fluent API builder design patterns and provides both synchronous and asynchronous processing models. It requires Java 8 at minimum. So, what do our fluent API builder-based data repositories look like? Below is an excerpt from the CustomerServiceImpl class which acts as a data repository for the Customer document. Java @ApplicationScoped public class CustomerServiceImpl implements CustomerService { private static final String INDEX = "customers"; @Inject ElasticsearchClient client; @Override public String doIndex(Customer customer) throws IOException { return client.index(IndexRequest.of(ir -> ir.index(INDEX).document(customer))).id(); } ... As we can see, our data repository implementation must be a CDI bean having an application scope. The Elasticsearch Java client is simply injected, thanks to the quarkus-elasticsearch-java-client Quarkus extension. This way avoids lots of bells and whistles that we would have had to use otherwise. The only thing we need to be able to inject the client is to declare the following property: Properties files quarkus.elasticsearch.hosts = elasticsearch:9200 Here, elasticsearch is the DNS (Domain Name Server) name that we associate with the Elastic search database server in the docker-compose.yaml file. 9200 is the TCP port number used by the server to listen for connections.The method doIndex() above creates a new index named customers if it doesn't exist and indexes (stores) into it a new document representing an instance of the class Customer. The indexing process is performed based on an IndexRequest accepting as input arguments the index name and the document body. As for the document ID, it is automatically generated and returned to the caller for further reference.The following method allows to retrieve the customer identified by the ID given as an input argument: Java ... @Override public Customer getCustomer(String id) throws IOException { GetResponse<Customer> getResponse = client.get(GetRequest.of(gr -> gr.index(INDEX).id(id)), Customer.class); return getResponse.found() ? getResponse.source() : null; } ... The principle is the same: using this fluent API builder pattern, we construct a GetRequest instance in a similar way that we did with the IndexRequest, and we run it against the Elasticsearch Java client. The other endpoints of our data repository, allowing us to perform full search operations or to update and delete customers, are designed the same way. Please take some time to look at the code to understand how things are working. The REST API Our MongoDB REST API interface was simple to implement, thanks to the quarkus-mongodb-rest-data-panache extension, in which the annotation processor automatically generated all the required endpoints. With Elasticsearch, we don't benefit yet from the same comfort and, hence, we need to manually implement it. That's not a big deal, as we can inject the previous data repositories, shown below: Java @Path("customers") @Produces(APPLICATION_JSON) @Consumes(APPLICATION_JSON) public class CustomerResourceImpl implements CustomerResource { @Inject CustomerService customerService; @Override public Response createCustomer(Customer customer, @Context UriInfo uriInfo) throws IOException { return Response.accepted(customerService.doIndex(customer)).build(); } @Override public Response findCustomerById(String id) throws IOException { return Response.ok().entity(customerService.getCustomer(id)).build(); } @Override public Response updateCustomer(Customer customer) throws IOException { customerService.modifyCustomer(customer); return Response.noContent().build(); } @Override public Response deleteCustomerById(String id) throws IOException { customerService.removeCustomerById(id); return Response.noContent().build(); } } This is the customer's REST API implementation. The other ones associated with orders, order items, and products are similar.Let's see now how to run and test the whole thing. Running and Testing Our Microservices Now that we looked at the details of our implementation, let's see how to run and test it. We chose to do it on behalf of the docker-compose utility. Here is the associated docker-compose.yml file: YAML version: "3.7" services: elasticsearch: image: elasticsearch:8.12.2 environment: node.name: node1 cluster.name: elasticsearch discovery.type: single-node bootstrap.memory_lock: "true" xpack.security.enabled: "false" path.repo: /usr/share/elasticsearch/backups ES_JAVA_OPTS: -Xms512m -Xmx512m hostname: elasticsearch container_name: elasticsearch ports: - "9200:9200" - "9300:9300" ulimits: memlock: soft: -1 hard: -1 volumes: - node1-data:/usr/share/elasticsearch/data networks: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.6.2 hostname: kibana container_name: kibana environment: - elasticsearch.url=http://elasticsearch:9200 - csp.strict=false ulimits: memlock: soft: -1 hard: -1 ports: - 5601:5601 networks: - elasticsearch depends_on: - elasticsearch links: - elasticsearch:elasticsearch docstore: image: quarkus-nosql-tests/docstore-elasticsearch:1.0-SNAPSHOT depends_on: - elasticsearch - kibana hostname: docstore container_name: docstore links: - elasticsearch:elasticsearch - kibana:kibana ports: - "8080:8080" - "5005:5005" networks: - elasticsearch environment: JAVA_DEBUG: "true" JAVA_APP_DIR: /home/jboss JAVA_APP_JAR: quarkus-run.jar volumes: node1-data: driver: local networks: elasticsearch: This file instructs the docker-compose utility to run three services: A service named elasticsearch running the Elasticsearch 8.6.2 database A service named kibana running the multipurpose web console providing different options such as executing queries, creating aggregations, and developing dashboards and graphs A service named docstore running our Quarkus microservice Now, you may check that all the required processes are running: Shell $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 005ab8ebf6c0 quarkus-nosql-tests/docstore-elasticsearch:1.0-SNAPSHOT "/opt/jboss/containe…" 3 days ago Up 3 days 0.0.0.0:5005->5005/tcp, :::5005->5005/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 8443/tcp docstore 9678c0a04307 docker.elastic.co/kibana/kibana:8.6.2 "/bin/tini -- /usr/l…" 3 days ago Up 3 days 0.0.0.0:5601->5601/tcp, :::5601->5601/tcp kibana 805eba38ff6c elasticsearch:8.12.2 "/bin/tini -- /usr/l…" 3 days ago Up 3 days 0.0.0.0:9200->9200/tcp, :::9200->9200/tcp, 0.0.0.0:9300->9300/tcp, :::9300->9300/tcp elasticsearch $ To confirm that the Elasticsearch server is available and able to run queries, you can connect to Kibana at http://localhost:601. After scrolling down the page and selecting Dev Tools in the preferences menu, you can run queries as shown below: In order to test the microservices, proceed as follows: 1. Clone the associated GitHub repository: Shell $ git clone https://github.com/nicolasduminil/docstore.git 2. Go to the project: Shell $ cd docstore 3. Checkout the right branch: Shell $ git checkout elastic-search 4. Build: Shell $ mvn clean install 5. Run the integration tests: Shell $ mvn -DskipTests=false failsafe:integration-test This last command will run the 17 provided integration tests, which should all succeed. You can also use the Swagger UI interface for testing purposes by firing your preferred browser at http://localhost:8080/q:swagger-ui. Then, in order to test endpoints, you can use the payload in the JSON files located in the src/resources/data directory of the docstore-api project.Enjoy!
In the ever-evolving world of software development, staying up-to-date with the latest tools and frameworks is crucial. One such framework that has been making waves in NoSQL databases is Eclipse JNoSQL. This article will deeply dive into the latest release, version 1.1.0, and explore its compatibility with Oracle NoSQL. Understanding Eclipse JNoSQL Eclipse JNoSQL is a Java-based framework that facilitates seamless integration between Java applications and NoSQL databases. It leverages Java enterprise standards, specifically Jakarta NoSQL and Jakarta Data, to simplify working with NoSQL databases. The primary objective of this framework is to reduce the cognitive load associated with using NoSQL databases while harnessing the full power of Jakarta EE and Eclipse MicroProfile. With Eclipse JNoSQL, developers can easily integrate NoSQL databases into their projects using Widfly, Payara, Quarkus, or other Java platforms. This framework bridges the Java application layer and various NoSQL databases, making it easier to work with these databases without diving deep into their intricacies. What’s New in Eclipse JNoSQL Version 1.1.0 The latest version of Eclipse JNoSQL, 1.1.0, comes with several enhancements and upgrades to make working with NoSQL databases smoother. Let’s explore some of the notable changes: Jakarta Data Version Upgrade In this release, one of the significant updates in Eclipse JNoSQL version 1.1.0 is upgrading the Jakarta Data version to M2. To better understand the importance of this upgrade, let’s delve into what Jakarta Data is and how it plays a crucial role in simplifying data access across various database types. Jakarta Data Jakarta Data is a specification that provides a unified API for simplified data access across different types of databases, including both relational and NoSQL databases. This specification is part of the broader Jakarta EE ecosystem, which aims to offer a standardized and consistent programming model for enterprise Java applications. Jakarta Data empowers Java developers to access diverse data repositories straightforwardly and consistently. It achieves this by introducing concepts like Repositories and custom query methods, making data retrieval and manipulation more intuitive and developer-friendly. One of the key features of Jakarta Data is its flexibility in allowing developers to compose custom query methods on Repository interfaces. This flexibility means that developers can craft specific queries tailored to their application’s needs without manually writing complex SQL or NoSQL queries. This abstraction simplifies the interaction with databases, reducing the development effort required to access and manipulate data. The Goal of Jakarta Data The primary goal of Jakarta Data is to provide a familiar and consistent programming model for data access while preserving the unique characteristics and strengths of the underlying data stores. In other words, Jakarta Data aims to abstract away the intricacies of interacting with different types of databases, allowing developers to focus on their application logic rather than the specifics of each database system. By upgrading Eclipse JNoSQL to use Jakarta Data version M2, the framework aligns itself with the latest Jakarta EE standards and leverages the newest features and improvements introduced by Jakarta Data. It ensures that developers using Eclipse JNoSQL can benefit from the enhanced capabilities and ease of data access that Jakarta Data brings. Enhanced Support for Inheritance One of the standout features of Eclipse JNoSQL is its support for Object-Document Mapping (ODM). It allows developers to work with NoSQL databases in an object-oriented manner, similar to how they work with traditional relational databases. In version 1.1.0, the framework has enhanced its support for inheritance, making it even more versatile when dealing with complex data models. Oracle NoSQL Database: A Brief Overview Before we conclude, let’s take a moment to understand the database we’ll work with – Oracle NoSQL Database. Oracle NoSQL Database is a distributed key-value and document database developed by Oracle Corporation. It offers robust transactional capabilities for data manipulation, horizontal scalability to handle large workloads, and simplified administration and monitoring. It is particularly well-suited for applications that require low-latency access to data, flexible data models, and elastic scaling to accommodate dynamic workloads. The Oracle NoSQL Database Cloud Service provides a managed cloud platform for deploying applications that require the capabilities of Oracle NoSQL Database, making it even more accessible and convenient for developers. Show Me the Code We’ll create a simple demo application to showcase the new features of Eclipse JNoSQL 1.1.0 and its compatibility with Oracle NoSQL. This demo will help you understand how to set up the environment, configure dependencies, and interact with Oracle NoSQL using Eclipse JNoSQL. Prerequisites Before we begin, ensure you have an Oracle NoSQL instance running. You can use either the “primes” or “cloud” flavor. For local development, you can run Oracle NoSQL in a Docker container with the following command: Shell docker run -d --name oracle-instance -p 8080:8080 ghcr.io/oracle/nosql:latest-ce Setting up the Project We’ll create a Java SE project using the Maven Quickstart Archetype to keep things simple. It will give us a basic project structure to work with. Project Dependencies For Eclipse JNoSQL to work with Oracle NoSQL, we must include specific dependencies. Additionally, we’ll need Jakarta CDI, Eclipse MicroProfile, Jakarta JSONP, and the Eclipse JNoSQL driver for Oracle NoSQL. We’ll also include “datafaker” for generating sample data. Here are the project dependencies: XML <dependencies> <dependency> <groupId>org.eclipse.jnosql.databases</groupId> <artifactId>jnosql-oracle-nosql</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>net.datafaker</groupId> <artifactId>datafaker</artifactId> <version>2.0.2</version> </dependency> </dependencies> Configuration Eclipse JNoSQL relies on configuration properties to establish a connection to the database. This is where the flexibility of Eclipse MicroProfile Config shines. You can conveniently define these properties in your application.properties or application.yml file, allowing for easy customization of your database settings. Remarkably, Eclipse JNoSQL caters to key-value and document databases, a testament to its adaptability. Despite Oracle NoSQL's support for both data models, the seamless integration and configuration options provided by Eclipse JNoSQL ensure a smooth experience, empowering developers to effortlessly switch between these database paradigms to meet their specific application needs. Properties files # Oracle NoSQL Configuration jnosql.keyvalue.database=cars jnosql.document.database=cars jnosql.oracle.nosql.host=http://localhost:8080 Creating a Model for Car Data in Eclipse JNoSQL After setting up the database configuration, the next step is to define the data model to be stored in it. The process of defining a model is consistent across all databases in Eclipse JNoSQL. For instance, in this example, we will form a data model for cars using a basic Car class. Java @Entity public class Car { @Id private String id; @Column private String vin; @Column private String model; @Column private String make; @Column private String transmission; // Constructors, getters, and setters @Override public boolean equals(Object o) { if (this == o) { return true; } if (o == null || getClass() != o.getClass()) { return false; } Car car = (Car) o; return Objects.equals(id, car.id); } @Override public int hashCode() { return Objects.hashCode(id); } @Override public String toString() { return "Car{" + "id='" + id + '\'' + ", vin='" + vin + '\'' + ", model='" + model + '\'' + ", make='" + make + '\'' + ", transmission='" + transmission + '\'' + '}'; } // Factory method to create a Car instance public static Car of(Faker faker) { Vehicle vehicle = faker.vehicle(); Car car = new Car(); car.id = UUID.randomUUID().toString(); car.vin = vehicle.vin(); car.model = vehicle.model(); car.make = vehicle.make(); car.transmission = vehicle.transmission(); return car; } } In the Car class, we use annotations to define how the class and its fields should be persisted in the database: @Entity: This annotation marks the class as an entity to be stored in the database. @Id: Indicates that the id field will serve as the unique identifier for each Car entity. @Column: Annotations like @Column specify that a field should be persisted as a column in the database. In this case, we annotate each field we want to store in the database. Additionally, we provide methods for getters, setters, equals, hashCode, and toString for better encapsulation and compatibility with database operations. We also include a factory method Car.of(Faker faker), to generate random car data using the “datafaker” library. This data model encapsulates the structure of a car entity, making it easy to persist and retrieve car-related information in your Oracle NoSQL database using Eclipse JNoSQL. Simplifying Database Operations With Jakarta Data Annotations In Eclipse JNoSQL’s latest version, 1.1.0, developers can harness the power of Jakarta Data annotations to streamline and clarify database operations. These annotations allow you to express your intentions in a more business-centric way, making your code more expressive and closely aligned with the actions you want to perform on the database. Here are some of the Jakarta Data annotations introduced in this version: @Insert: Effortless Data Insertion The @Insert annotation signifies the intent to perform an insertion operation in the database. When applied to a method, it indicates that it aims to insert data into the database. This annotation provides clarity and conciseness, making it evident that the method is responsible for adding new records. @Update: Seamless Data Update The @Update annotation is used to signify an update operation. It is beneficial when you want to modify existing records in the database. Eclipse JNoSQL will check if the information to be updated is already present and proceed accordingly. This annotation simplifies the code by explicitly stating its purpose. @Delete: Hassle-Free Data Removal When you want to delete data from the database, the @Delete annotation comes into play. It communicates that the method’s primary function is to remove information. Like the other annotations, it enhances code readability by conveying the intended action. @Save: Combining Insert and Update The @Save annotation serves a dual purpose. It behaves like the save method in BasicRepository but with added intelligence. It checks if the information is already in the database, and if so, it updates it; otherwise, it inserts new data. This annotation provides a convenient way to handle insertion and updating without needing separate methods. With these Jakarta Data annotations, you can express database operations in a more intuitive and business-centric language. In the context of a car-centric application, such as managing a garage or car collection, you can utilize these annotations to define operations like parking a car and unparking it: Java @Repository public interface Garage extends DataRepository<Car, String> { @Save Car parking(Car car); @Delete void unpark(Car car); Page<Car> findByTransmission(String transmission, Pageable page); } In this example, the @Save annotation is used for parking a car, indicating that this method handles both inserting new cars into the “garage” and updating existing ones. The @Delete annotation is employed for unparking, making it clear that this method is responsible for removing cars from the “garage.” These annotations simplify database operations and enhance code clarity and maintainability by aligning your code with your business terminology and intentions. Executing Database Operations With Oracle NoSQL Now that our entity and repository are set up let’s create classes to execute the application. These classes will initialize a CDI container to inject the necessary template classes for interacting with the Oracle NoSQL database. Interacting With Document Database As a first step, we’ll interact with the document database. We’ll inject the DocumentTemplate interface to perform various operations: Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { DocumentTemplate template = container.select(DocumentTemplate.class).get(); // Insert 10 random cars into the database for (int index = 0; index < 10; index++) { Car car = Car.of(faker); template.insert(car); } // Retrieve and print all cars template.select(Car.class).stream().toList().forEach(System.out::println); // Retrieve and print cars with Automatic transmission, ordered by model (descending) template.select(Car.class).where("transmission").eq("Automatic").orderBy("model").desc() .stream().forEach(System.out::println); // Retrieve and print cars with CVT transmission, ordered by make (descending) template.select(Car.class).where("transmission").eq("CVT").orderBy("make").desc() .stream().forEach(System.out::println); } System.exit(0); } In this code, we use the DocumentTemplate to insert random cars into the document database, retrieve and print all cars, and execute specific queries based on transmission type and ordering. Interacting With Key-Value Database Oracle NoSQL also supports a key-value database, and we can interact with it as follows: Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { KeyValueTemplate template = container.select(KeyValueTemplate.class).get(); // Create a random car and put it in the key-value database Car car = Car.of(faker); template.put(car); // Retrieve and print the car based on its unique ID System.out.println("The query result: " + template.get(car.id(), Car.class)); // Delete the car from the key-value database template.delete(car.id()); // Attempt to retrieve the deleted car (will return null) System.out.println("The query result: " + template.get(car.id(), Car.class)); } System.exit(0); } In this code, we utilize the KeyValueTemplate to put a randomly generated car into the key-value database, retrieve it by its unique ID, delete it, and attempt to retrieve it again (resulting in null since it’s been deleted). These examples demonstrate how to execute database operations seamlessly with Oracle NoSQL, whether you’re working with a document-oriented or key-value database model, using Eclipse JNoSQL’s template classes. In this final sampling execution, we will demonstrate how to interact with the repository using our custom repository interface. This approach simplifies database operations and makes them more intuitive, allowing you to work with your custom-defined terminology and actions. Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { Garage repository = container.select(Garage.class, DatabaseQualifier.ofDocument()).get(); // Parking 10 random cars in the repository for (int index = 0; index < 10; index++) { Car car = Car.of(faker); repository.parking(car); } // Park a car and then unpark it Car car = Car.of(faker); repository.parking(car); repository.unpark(car); // Retrieve the first page of cars with CVT transmission, ordered by model (descending) Pageable page = Pageable.ofPage(1).size(3).sortBy(Sort.desc("model")); Page<Car> page1 = repository.findByTransmission("CVT", page); System.out.println("The first page"); page1.forEach(System.out::println); // Retrieve the second page of cars with CVT transmission System.out.println("The second page"); Pageable secondPage = page.next(); Page<Car> page2 = repository.findByTransmission("CVT", secondPage); page2.forEach(System.out::println); System.out.println("The query result: "); } System.exit(0); } We create a Garage instance through the custom repository interface in this code. We then demonstrate various operations such as parking and unparking cars and querying for cars with specific transmission types, sorted by model and paginated. You can express database operations in a business-centric language by utilizing the repository interface with custom annotations like @Save and @Delete. This approach enhances code clarity and aligns with your domain-specific terminology, providing a more intuitive and developer-friendly way to interact with the database. Conclusion Eclipse JNoSQL 1.1.0, with its support for Oracle NoSQL databases, simplifies and streamlines the interaction between Java applications and NoSQL data stores. With the introduction of Jakarta Data annotations and custom repositories, developers can express database operations in a more business-centric language, making code more intuitive and easier to maintain. This article has covered the critical aspects of Eclipse JNoSQL’s interaction with Oracle NoSQL, including setting up configurations, creating data models, and executing various database operations. Whether you are working with document-oriented or key-value databases, Eclipse JNoSQL provides the necessary tools and abstractions to make NoSQL data access a breeze. To dive deeper into the capabilities of Eclipse JNoSQL and explore more code samples, check out the official repository. There, you will find a wealth of information, examples, and resources to help you leverage the power of Eclipse JNoSQL in your Java applications. Eclipse JNoSQL empowers developers to harness the flexibility and scalability of NoSQL databases while adhering to Java enterprise standards, making it a valuable tool for modern application development.
For the next 20 days (don’t ask me why I chose that number), I will be publishing a DynamoDB quick tip per day with code snippets. The examples use the DynamoDB packages from AWS SDK for Go V2 but should be applicable to other languages as well. Day 20: Converting Between Go and DynamoDB Types Posted: 13/Feb/2024 The DynamoDB attributevalue in the AWS SDK for Go package can save you a lot of time, thanks to the Marshal and Unmarshal family of utility functions that can be used to convert between Go types (including structs) and AttributeValues. Here is an example using a Go struct: MarshalMap converts Customer struct into a map[string]types.AttributeValue that's required by PutItem UnmarshalMap converts the map[string]types.AttributeValue returned by GetItem into a Customer struct Go type Customer struct { Email string `dynamodbav:"email"` Age int `dynamodbav:"age,omitempty"` City string `dynamodbav:"city"` } customer := Customer{Email: "abhirockzz@gmail.com", City: "New Delhi"} item, _ := attributevalue.MarshalMap(customer) client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: item, }) resp, _ := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{"email": &types.AttributeValueMemberS{Value: "abhirockzz@gmail.com"}, }) var cust Customer attributevalue.UnmarshalMap(resp.Item, &cust) log.Println("item info:", cust.Email, cust.City) Recommended reading: MarshalMap API doc UnmarshalMap API doc AttributeValue API doc Day 19: PartiQL Batch Operations Posted: 12/Feb/2024 You can use batched operations with PartiQL as well, thanks to BatchExecuteStatement. It allows you to batch reads as well as write requests. Here is an example (note that you cannot mix both reads and writes in a single batch): Go //read statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) //separate batch for write statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }, { Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) Just like BatchWriteItem, BatchExecuteStatement is limited to 25 statements (operations) per batch. Recommended reading: BatchExecuteStatementAPI docs Build faster with Amazon DynamoDB and PartiQL: SQL-compatible operations (thanks, Pete Naylor !) Day 18: Using a SQL-Compatible Query Language Posted: 6/Feb/2024 DynamoDB supports PartiQL to execute SQL-like select, insert, update, and delete operations. Here is an example of how you would use PartiQL-based queries for a simple URL shortener application. Notice how it uses a (generic) ExecuteStatement API to execute INSERT, SELECT, UPDATE and DELETE: Go _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("SELECT * FROM url_metadata where shortcode=? AND active=true"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) Recommended reading: Amazon DynamoDB documentation on PartiQL support ExecuteStatement API docs Day 17: BatchGetItem Operation Posted: 5/Feb/2024 You can club multiple (up to 100) GetItem requests in a single BatchGetItem operation - this can be done across multiple tables. Here is an example that fetches includes four GetItem calls across two different tables: Go resp, err := client.BatchGetItem(context.Background(), &dynamodb.BatchGetItemInput{ RequestItems: map[string]types.KeysAndAttributes{ "customer": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, { "email": &types.AttributeValueMemberS{Value: "c2@foo.com"}, }, }, }, "Thread": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "ForumName": &types.AttributeValueMemberS{Value: "Amazon DynamoDB"}, "Subject": &types.AttributeValueMemberS{Value: "DynamoDB Thread 1"}, }, { "ForumName": &types.AttributeValueMemberS{Value: "Amazon S3"}, "Subject": &types.AttributeValueMemberS{Value: "S3 Thread 1"}, }, }, ProjectionExpression: aws.String("Message"), }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Just like an individual GetItem call, you can include Projection Expressions and return RCUs. Note that BatchGetItem can only retrieve up to 16 MB of data. Recommended reading: BatchGetItem API doc Day 16: Enhancing Write Performance With Batching Posted: 2/Feb/2024 The DynamoDB BatchWriteItem operation can provide a performance boost by allowing you to squeeze in 25 individual PutItem and DeleteItem requests in a single API call — this can be done across multiple tables. Here is an example that combines PutItem and DeleteItem operations for two different tables (customer, orders): Go _, err := client.BatchWriteItem(context.Background(), &dynamodb.BatchWriteItemInput{ RequestItems: map[string][]types.WriteRequest{ "customer": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c3@foo.com"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, }, }, }, "orders": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_1234"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_4321"}, }, }, }, }, }, }) Be aware of the following constraints: The total request size cannot exceed 16 MB BatchWriteItem cannot update items Recommended reading: BatchWriteItem API doc Day 15: Using the DynamoDB Expression Package To Build Update Expressions Posted: 31/Jan/2024 The DynamoDB Go, SDK expression package, supports the programmatic creation of Update expressions. Here is an example of how you can build an expression to include execute a SET operation of the UpdateItem API and combine it with a Condition expression (update criteria): Go updateExpressionBuilder := expression.Set(expression.Name("category"), expression.Value("standard")) conditionExpressionBuilder := expression.AttributeNotExists(expression.Name("account_locked")) expr, _ := expression.NewBuilder(). WithUpdate(updateExpressionBuilder). WithCondition(conditionExpressionBuilder). Build() resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, UpdateExpression: expr.Update(), ConditionExpression: expr.Condition(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), ReturnValues: types.ReturnValueAllOld, }) Recommended reading: WithUpdate method in the package API docs. Day 14: Using the DynamoDB Expression Package To Build Key Condition and Filter Expressions Posted: 30/Jan/2024 You can use expression package in the AWS Go SDK for DynamoDB to programmatically build key condition and filter expressions and use them with Query API. Here is an example: Go keyConditionBuilder := expression.Key("ForumName").Equal(expression.Value("Amazon DynamoDB")) filterExpressionBuilder := expression.Name("Views").GreaterThanEqual(expression.Value(3)) expr, _ := expression.NewBuilder(). WithKeyCondition(keyConditionBuilder). WithFilter(filterExpressionBuilder). Build() _, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String("Thread"), KeyConditionExpression: expr.KeyCondition(), FilterExpression: expr.Filter(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), }) Recommended reading: Key and NameBuilder in the package API docs Day 13: Using the DynamoDB Expression Package To Build Condition Expressions Posted: 25/Jan/2024 Thanks to the expression package in the AWS Go SDK for DynamoDB, you can programmatically build Condition expressions and use them with write operations. Here is an example of the DeleteItem API: Go conditionExpressionBuilder := expression.Name("inactive_days").GreaterThanEqual(expression.Value(20)) conditionExpression, _ := expression.NewBuilder().WithCondition(conditionExpressionBuilder).Build() _, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: conditionExpression.Condition(), ExpressionAttributeNames: conditionExpression.Names(), ExpressionAttributeValues: conditionExpression.Values(), }) Recommended reading: WithCondition method in the package API docs Day 12: Using the DynamoDB Expression Package To Build Projection Expressions Posted: 24/Jan/2024 The expression package in the AWS Go SDK for DynamoDB provides a fluent builder API with types and functions to create expression strings programmatically along with corresponding expression attribute names and values. Here is an example of how you would build a Projection Expression and use it with the GetItem API: Go projectionBuilder := expression.NamesList(expression.Name("first_name"), expression.Name("last_name")) projectionExpression, _ := expression.NewBuilder().WithProjection(projectionBuilder).Build() _, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String("customer"), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, ProjectionExpression: projectionExpression.Projection(), ExpressionAttributeNames: projectionExpression.Names(), }) Recommended reading: expression package API docs. Day 11: Using Pagination With Query API Posted: 22/Jan/2024 The Query API returns the result set size to 1 MB. Use ExclusiveStartKey and LastEvaluatedKey elements to paginate over large result sets. You can also reduce page size by limiting the number of items in the result set with the Limit parameter of the Query operation. Go func paginatedQuery(searchCriteria string, pageSize int32) { currPage := 1 var exclusiveStartKey map[string]types.AttributeValue for { resp, _ := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: searchCriteria}, }, Limit: aws.Int32(pageSize), ExclusiveStartKey: exclusiveStartKey, }) if resp.LastEvaluatedKey == nil { return } currPage++ exclusiveStartKey = resp.LastEvaluatedKey } } Recommended reading: Query Pagination Day 10: Query API With Filter Expression Posted: 19/Jan/2024 With the DynamoDB Query API, you can use Filter Expressions to discard specific query results based on criteria. Note that the filter expression is applied after a Query finishes but before the results are returned. Thus, it has no impact on the RCUs (read capacity units) consumed by the query. Here is an example that filters out forum discussion threads that have less than a specific number of views: Go resp, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), FilterExpression: aws.String("#v >= :num"), ExpressionAttributeNames: map[string]string{ "#v": "Views", }, ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":num": &types.AttributeValueMemberN{Value: numViews}, }, }) Recommended reading: Filter Expressions Day 9: Query API Posted: 18/Jan/2024 The Query API is used to model one-to-many relationships in DynamoDB. You can search for items based on (composite) primary key values using Key Condition Expressions. The value for the partition key attribute is mandatory - the query returns all items with that partition key value. Additionally, you can also provide a sort key attribute and use a comparison operator to refine the search results. With the Query API, you can also: Switch to strongly consistent read (eventual consistent being the default) Use a projection expression to return only some attributes Return the consumed Read Capacity Units (RCU) Here is an example that queries for a specific thread based on the forum name (partition key) and subject (sort key). It only returns the Message attribute: Go resp, err = client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name and Subject = :sub"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":sub": &types.AttributeValueMemberS{Value: subject}, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("Message"), }) Recommended reading: API Documentation Item Collections Key Condition Expressions Composite primary key Day 8: Conditional Delete Operation Posted: 17/Jan/2024 All the DynamoDB write APIs, including DeleteItem support criteria-based (conditional) execution. You can use DeleteItem operation with a condition expression — it must be evaluated to true in order for the operation to succeed. Here is an example that verifies the value of inactive_days attribute: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("inactive_days >= :val"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":val": &types.AttributeValueMemberN{Value: "20"}, }, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { return } else { log.Fatal(err) } } Recommended reading: Conditional deletes documentation Day 7: DeleteItem API Posted: 16/Jan/2024 The DynamoDB DeleteItem API does what it says - delete an item. But it can also: Return the content of the old item (at no additional cost) Return the consumed Write Capacity Units (WCU) Return the item attributes for an operation that failed a condition check (again, no additional cost) Retrieve statistics about item collections, if any, that were affected during the operation Here is an example: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ReturnValues: types.ReturnValueAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) Recommended reading: DeleteItem API doc Day 6: Atomic Counters With UpdateItem Posted: 15/Jan/2024 Need to implement an atomic counter using DynamoDB? If you have a use case that can tolerate over-counting or under-counting (for example, visitor count), use the UpdateItem API. Here is an example that uses the SET operator in an update expression to increment num_logins attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET num_logins = num_logins + :num"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":num": &types.AttributeValueMemberN{ Value: num, }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Note that every invocation of UpdateItem will increment (or decrement) — hence, it is not idempotent. Recommended reading: Atomic Counters Day 5: Avoid Overwrites When Using DynamoDB UpdateItem API Posted: 12/Jan/2024 The UpdateItem API creates a new item or modifies an existing item's attributes. If you want to avoid overwriting an existing attribute, make sure to use the SET operation with if_not_exists function. Here is an example that sets the category of an item only if the item does not already have a category attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET category = if_not_exists(category, :category)"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":category": &types.AttributeValueMemberS{ Value: category, }, }, }) Note that if_not_exists function can only be used in the SET action of an update expression. Recommended reading: DynamoDB documentation Day 4: Conditional UpdateItem Posted: 11/Jan/2024 Conditional operations are helpful in cases when you want a DynamoDB write operation (PutItem, UpdateItem or DeleteItem) to be executed based on certain criteria. To do so, use a condition expression - it must evaluate to true in order for the operation to succeed. Here is an example that demonstrates a conditional UpdateItem operation. It uses the attribute_not_exists function: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET first_name = :fn"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":fn": &types.AttributeValueMemberS{ Value: firstName, }, }, ConditionExpression: aws.String("attribute_not_exists(account_locked)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: ConditionExpressions Day 3: UpdateItem Add-On Benefits Posted: 10/Jan/2024 The DynamoDB UpdateItem operation is quite flexible. In addition to using many types of operations, you can: Use multiple update expressions in a single statement Get the item attributes as they appear before or after they are successfully updated Understand which item attributes failed the condition check (no additional cost) Retrieve the consumed Write Capacity Units (WCU) Here is an example (using AWS Go SDK v2): Go resp, err = client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET last_name = :ln REMOVE category"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":ln": &types.AttributeValueMemberS{ Value: lastName, }, }, ReturnValues: types.ReturnValueAllOld, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, } Recommended reading: UpdateItem API Update Expressions Day 2: GetItem Add-On Benefits Posted: 9/Jan/2024 Did you know that the DynamoDB GetItem operation also gives you the ability to: Switch to strongly consistent read (eventually consistent being the default) Use a projection expression to return only some of the attributes Return the consumed Read Capacity Units (RCU) Here is an example (DynamoDB Go SDK): Go resp, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ //email - partition key "email": &types.AttributeValueMemberS{Value: email}, }, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("first_name, last_name"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: GetItem API doc link Projection expressions Day 1: Conditional PutItem Posted: 8/Jan/2024 The DynamoDB PutItem API overwrites the item in case an item with the same primary key already exists. To avoid (or work around) this behavior, use PutItem with an additional condition. Here is an example that uses the attribute_not_exists function: Go _, err := client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("attribute_not_exists(email)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValues: types.ReturnValueAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { log.Println("failed pre-condition check") return } else { log.Fatal(err) } } With the PutItem operation, you can also: Return the consumed Write Capacity Units (WCU) Get the item attributes as they appeared before (in case they were updated during the operation) Retrieve statistics about item collections, if any, that were modified during the operation Recommended reading: API Documentation Condition Expressions Comparison Functions
Understanding the structures within a Relational Database Management System (RDBMS) is critical to optimizing performance and managing data effectively. Here's a breakdown of the concepts with examples. RDBMS Structures 1. Partition Partitioning in an RDBMS is a technique to divide a large database table into smaller, more manageable pieces, called partitions, without changing the application's SQL queries. Example Consider a table sales_records that contains sales data over several years. Partitioning this table by year (YEAR column) means that data for each year is stored in a separate partition. This can significantly speed up queries that filter on the partition key, e.g., SELECT * FROM sales_records WHERE YEAR = 2021, as the database only searches the relevant partition. 2. Subpartition Subpartitioning is dividing a partition into smaller pieces, called subpartitions. This is essentially a second level of partitioning and can be used for further organizing data within each partition based on another column. Example Using the sales_records table, you might partition the data by year and then subpartition each year's data by quarter. This way, data for each quarter of each year is stored in its subpartition, potentially improving query performance for searches within a specific quarter of a particular year. 3. Local Index A local index is an index that exists on a partitioned table, where each partition has its independent index. The scope of a local index is limited to its partition, meaning that each index contains only the keys from that partition. Example If the sales_records table is partitioned by year, a local index on the customer_id column will create separate indexes for each year's partition. Queries filtering on both customer_id and year can be very efficient, as the database can quickly locate the partition by year and then use the local index to find records within that partition. 4. Global Index A global index is an index on a partitioned table that is not partition-specific. It includes keys from all partitions of the table, providing a way to search across all partitions quickly. Example A global index on the customer_id column in the sales_records table would enable fast searches for a particular customer's records across all years without needing to access each partition's local index. 5. Create Deterministic Functions for Same Input and Known Output A deterministic function in SQL returns the same result every time it's called with the same input. This consistency can be leveraged for optimization purposes, such as function-based indexes. Function Example CREATE OR REPLACE FUNCTION get_discount_category(price NUMBER) RETURN VARCHAR2 DETERMINISTIC IS BEGIN IF price < 100 THEN RETURN 'Low'; ELSIF price BETWEEN 100 AND 500 THEN RETURN 'Medium'; ELSE RETURN 'High'; END IF; END; This function returns a discount category based on the price. Since it's deterministic, the database can optimize calls to this function within queries. 6. Create Bulk Load for Heavy Datasets Bulk loading is the process of efficiently importing large volumes of data into a database. This is crucial for initializing databases with existing data or integrating large datasets periodically. Example In Oracle, you can use SQL*Loader for bulk-loading data. Here's a simple command to load data from a CSV file into the sales_records table. Bash: Shell sqlldr userid=username/password@database control=load_sales_records.ctl direct=true The control file (load_sales_records.ctl) defines how the data in the CSV file maps to the columns in the sales_records table. The direct=true option specifies that SQL*Loader should use direct path load, which is faster and uses fewer database resources than conventional path load. SQL Tuning Techniques SQL tuning methodologies are essential for optimizing query performance in relational database management systems. Here's an explanation of the methods with examples to illustrate each: 1. Explain Plan Analysis An explain plan shows how the database executes a query, including its paths and methods to access data. Analyzing an explain plan helps identify potential performance issues, such as full table scans or inefficient joins. Example EXPLAIN PLAN FOR SELECT * FROM employees WHERE department_id = 10; Analyzing the output might reveal whether the query uses an index or a full table scan, guiding optimization efforts. 2. Gather Statistics Gathering statistics involves collecting data about table size, column distribution, and other characteristics that the query optimizer uses to determine the most efficient query execution plan. Full statistics: Collect statistics for the entire table Incremental statistics: Collect statistics for the parts of the table that have changed since the last collection Example -- Gather full statistics EXEC DBMS_STATS.GATHER_TABLE_STATS('MY_SCHEMA', 'MY_TABLE'); -- Gather incremental statistics EXEC DBMS_STATS.SET_TABLE_PREFS('MY_SCHEMA', 'MY_TABLE', 'INCREMENTAL', 'TRUE'); EXEC DBMS_STATS.GATHER_TABLE_STATS('MY_SCHEMA', 'MY_TABLE'); 3. Structure Your Queries for Efficient Joins Structuring your SQL queries to take advantage of the most efficient join methods based on your data characteristics and access patterns is critical to query optimization. This strategy involves understanding the nature of your data, the relationships between different data sets, and how your application accesses this data. You can significantly improve query performance by aligning your query design with these factors. Here's a deeper dive into what this entails: Understanding Your Data and Access Patterns Data volume: The size of the data sets you're joining affects which join method will be most efficient. For instance, hash joins might be preferred for joining two large data sets, while nested loops could be more efficient for smaller data sets or when an indexed access path exists. Data distribution and skew: Knowing how your data is distributed and whether there are skewnesses (e.g., some values are far more common than others) can influence join strategy. For skewed data, certain optimizations might be necessary to avoid performance bottlenecks. Indexes: The presence of indexes on the join columns can make nested loop joins more efficient, especially if one of the tables involved in the join is significantly smaller than the other. Choosing the right join type: Use inner joins, outer joins, cross joins, etc., based on the logical requirements of your query and the characteristics of your data. Each join type has its performance implications. Order of tables in the join: In certain databases and scenarios, the order in which tables are joined can influence performance, especially for nested loop joins where the outer table should ideally have fewer rows than the inner table. Filter early: Apply filters as early as possible in your query to reduce the size of the data sets that need to be joined. This can involve subqueries, CTEs (Common Table Expressions), or WHERE clause optimizations to narrow down the data before it is joined. Use indexes effectively: Design your queries to take advantage of indexes on join columns, where possible. This might involve structuring your WHERE clauses or JOIN conditions to use indexed columns efficiently. Practical Examples For large data set joins: If you're joining two large data sets and you know the join will involve scanning large portions of both tables, structuring your query to use a hash join can be beneficial. Ensure that neither table has a filter that could significantly reduce its size before the join, as this could make a nested loops join more efficient if one of the tables becomes much smaller after filtering. For indexed access: If you're joining a small table to a large table and the large table has an index on the join column, structuring your query to encourage a nested loops join can be advantageous. The optimizer will likely pick this join method, but careful query structuring and hinting can ensure it. Join order and filtering: Consider how the join order and placement of filter conditions can impact performance in complex queries involving multiple joins. Placing the most restrictive filters early in the query can reduce the amount of data being joined in later steps. By aligning your query structure with your data's inherent characteristics and your application's specific access patterns, you can guide the SQL optimizer to choose the most efficient execution paths. This often involves a deep understanding of both the theoretical aspects of how different join methods work and practical knowledge gained from observing the performance of your queries on your specific data sets. Continuous monitoring and tuning are essential for maintaining optimal performance based on changing data volumes and usage patterns. Example: If you're joining a large table with a small table and there's an index on the join column of the large table, structuring the query to ensure the optimizer chooses a nested loop join can be more efficient. 4. Use Common Table Expressions (CTEs) CTEs make your queries more readable and can improve performance by breaking down complex queries into simpler parts. Example SQL WITH RegionalSales AS ( SELECT region, SUM(sales) AS total_sales FROM sales GROUP BY region ) SELECT * FROM RegionalSales WHERE total_sales > 1000000; 5. Use Global Temporary Tables and Indexes Global temporary tables store intermediate results for the duration of a session or transaction, which can be indexed for faster access. Example SQL CREATE GLOBAL TEMPORARY TABLE temp_sales AS SELECT * FROM sales WHERE year = 2021; CREATE INDEX idx_temp_sales ON temp_sales(sales_id); 6. Multiple Indexes With Different Column Ordering Creating multiple indexes on the same set of columns but in different orders can optimize different query patterns. Example SQL CREATE INDEX idx_col1_col2 ON my_table(col1, col2); CREATE INDEX idx_col2_col1 ON my_table(col2, col1); 7. Use Hints Hints are instructions embedded in SQL statements that guide the optimizer to choose a particular execution plan. Example SQL SELECT /*+ INDEX(my_table my_index) */ * FROM my_table WHERE col1 = 'value'; 8. Joins Using Numeric Values Numeric joins are generally faster than string joins because numeric comparisons are faster than string comparisons. Example Instead of joining on string columns, if possible, join on numeric columns like IDs that represent the same data. 9. Full Table Scan vs. Partition Pruning Use a full table scan when you need to access a significant portion of the table or when there's no suitable index. Use partition pruning when you're querying partitioned tables and your query can be limited to specific partitions. Example -- Likely results in partition pruning SELECT * FROM sales_partitioned WHERE sale_date BETWEEN '2021-01-01' AND '2021-01-31'; 10. SQL Tuning Advisor The SQL Tuning Advisor analyzes SQL statements and provides recommendations for improving performance, such as creating indexes, restructuring the query, or gathering statistics. Example In Oracle, you can use the DBMS_SQLTUNE package to run the SQL Tuning Advisor: SQL DECLARE l_tune_task_id VARCHAR2(100); BEGIN l_tune_task_id := DBMS_SQLTUNE.create_tuning_task(sql_id => 'your_sql_id_here'); DBMS_SQLTUNE.execute_tuning_task(task_name => l_tune_task_id); DBMS_OUTPUT.put_line(DBMS_SQLTUNE.report_tuning_task(l_tune_task_id)); END; Conclusion Each of these structures and techniques optimizes data storage, retrieval, and manipulation in an RDBMS, enabling efficient handling of large datasets and complex queries. Each of these tuning methodologies targets specific aspects of SQL performance, from how queries are structured to how the database's optimizer interprets and executes them. By applying these techniques, you can significantly improve the efficiency and speed of your database operations.
In this article, we will analyze how composite index can help us with better query response time from DB. In SQL, a composite index is an index built on multiple columns within a table, instead of just one. If you go through the first part of this series, SQL Query Performance Tuning in MySQL - Part 1, you will have a better understanding of the setup. Let us start troubleshooting. First, we have to insert data in bulk. Create DB and tables. Here employee table is without any primary key and the employee is with a primary key. Shell mysql> create database testdb; mysql> show databases; mysql> use testdb; mysql> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255)); Query OK, 0 rows affected (0.01 sec) mysql> CREATE TABLE employee2 (id int primary key,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255)); Query OK, 0 rows affected (0.02 sec mysql> show tables; +------------------+ | Tables_in_testdb | +------------------+ | employee1 | | employee2 | +------------------+ 2 rows in set (0.00 sec) Insert data in these tables using the following stored procedure. I have used multiple if-else just to avoid data in one sequence being inserted in tables. Shell CREATE DEFINER=`root`@`localhost` PROCEDURE `testdb`.`BulkInsert`() BEGIN DECLARE i INT DEFAULT 1; DECLARE j INT DEFAULT 1; DECLARE k INT DEFAULT 25; truncate table employee1; truncate table employee2; WHILE (i <= 40000) DO INSERT INTO testdb.employee1 (id,LastName,FirstName,Address,profile) VALUES(i, CONCAT("last","-",j),CONCAT("first","-",j), CONCAT("address","-",k), CONCAT("engineer","-",i)); INSERT INTO testdb.employee2 (id,LastName,FirstName,Address,profile) VALUES(i, CONCAT("last","-",j),CONCAT("first","-",j), CONCAT("address","-",k), CONCAT("engineer","-",i)); SET i = i+1; IF j < 100 then SET j = j+1; else SET j = 1; END IF; IF k > 1 then SET k = k-1; else SET k = 25; END IF; END WHILE; END Call this stored procedure: CALL testdb.BulkInsert(); After a while, it will finish execution inserting records in tables. Check if records exist in tables. Shell select count(*) from employee1 e; count(*)| --------+ 40000| select count(*) from employee2 e; count(*)| --------+ 40000| Create a composite index on three columns of employee1. We will not create any composite index on the employee2 table so that we can compare performance. Shell mysql> CREATE INDEX compositeindex1 ON employee1 (FirstName,Address,profile); #check indexes on employee1 table. mysql> SHOW INDEXES FROM employee1 \G; *************************** 1. row *************************** Table: employee1 Non_unique: 1 Key_name: compositeindex1 Seq_in_index: 1 Column_name: FirstName Collation: A Cardinality: 101 Sub_part: NULL Packed: NULL Null: YES Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL *************************** 2. row *************************** Table: employee1 Non_unique: 1 Key_name: compositeindex1 Seq_in_index: 2 Column_name: Address Collation: A Cardinality: 101 Sub_part: NULL Packed: NULL Null: YES Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL *************************** 3. row *************************** Table: employee1 Non_unique: 1 Key_name: compositeindex1 Seq_in_index: 3 Column_name: profile Collation: A Cardinality: 39920 Sub_part: NULL Packed: NULL Null: YES Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL 3 rows in set (0.01 sec) # check indexes on employee2 table mysql> SHOW INDEXES FROM employee2 \G; *************************** 1. row *************************** Table: employee2 Non_unique: 0 Key_name: PRIMARY Seq_in_index: 1 Column_name: id Collation: A Cardinality: 0 Sub_part: NULL Packed: NULL Null: Index_type: BTREE Comment: Index_comment: Visible: YES Expression: NULL 1 row in set (0.00 sec) Now run select queries on these tables. What we find is there is some delay in query execution in the employee2 table, which is expected as we have not created an index on employee2. Shell mysql> select * from employee1 where FirstName = "first-96" and Address ="address-5" and profile = "engineer-12196"; +-------+----------+-----------+-----------+----------------+ | id | LastName | FirstName | Address | profile | +-------+----------+-----------+-----------+----------------+ | 12196 | last-96 | first-96 | address-5 | engineer-12196 | +-------+----------+-----------+-----------+----------------+ 1 row in set (0.00 sec) mysql> select * from employee2 where FirstName = "first-96" and Address ="address- +-------+----------+-----------+-----------+----------------+ | id | LastName | FirstName | Address | profile | +-------+----------+-----------+-----------+----------------+ | 12196 | last-96 | first-96 | address-5 | engineer-12196 | +-------+----------+-----------+-----------+----------------+ 1 row in set (0.04 sec) mysql> select * from employee1 where FirstName = "first-95" and Address ="address- +-------+----------+-----------+-----------+----------------+ | id | LastName | FirstName | Address | profile | +-------+----------+-----------+-----------+----------------+ | 39295 | last-95 | first-95 | address-6 | engineer-39295 | +-------+----------+-----------+-----------+----------------+ 1 row in set (0.00 sec) mysql> select * from employee2 where FirstName = "first-95" and Address ="address- +-------+----------+-----------+-----------+----------------+ | id | LastName | FirstName | Address | profile | +-------+----------+-----------+-----------+----------------+ | 39295 | last-95 | first-95 | address-6 | engineer-39295 | +-------+----------+-----------+-----------+----------------+ 1 row in set (0.04 sec) Now let us analyze the performance of these queries in the employee1 and employee2 table. For employees, data is looked up from indexes directly; hence, actual time is quite less in fetching data. Shell mysql> explain analyze select * from employee1 where FirstName = "first-95" and Address ="address-6" and profile = "engineer-39295"; EXPLAIN | -> Index lookup on employee1 using compositeindex1 (FirstName='first-95', Address='address-6', profile='engineer-39295') (cost=0.35 rows=1) (actual time=0.0858..0.0913 rows=1 loops=1) 1 row in set (0.01 sec) Query Plan: --------------- Index Lookup: The database is using an index named compositeindex1 to efficiently find rows in the employee1 table. Columns Covered: The index covers three columns: FirstName, Address, and profile. Search Criteria: The query is looking for rows where: FirstName = 'first-95' Address = 'address-6' profile = 'engineer-39295' Performance Metrics: ---------------------- Cost: The estimated cost of only 0.35 units indicates a very efficient query. Actual Time: The query took between 0.0858 and 0.0913 seconds to execute, which is very fast. Rows Returned: The query found and returned only 1 matching row. Explanation: -------------- The database leverages the composite index to directly locate the matching row without needing to scan the entire employee1 table. This is why the query is so efficient. The low cost and execution time confirm that the query is well-optimized and the index is being used effectively. For employees: the index doesn't exist; complete table is scanned, which leads to delay. Shell mysql> explain analyze select * from employee2 where FirstName = "first-95" and Address ="address-6" and profile = "engineer-39295"; | EXPLAIN | | -> Filter: ((employee2.`profile` = 'engineer-39295') and (employee2.Address = 'address-6') and (employee2.FirstName = 'first-95')) (cost=4061 rows=40.1) (actual time=42.7..43.5 rows=1 loops=1) -> Table scan on employee2 (cost=4061 rows=40051) (actual time=0.0694..30.6 rows=40000 loops=1) 1 row in set (0.05 sec) Query Plan Breakdown: ---------------------- Filter: The query aims to retrieve rows from the employee2 table that meet all three conditions: profile = 'engineer-39295' Address = 'address-6' FirstName = 'first-95' Table Scan: The database is executing a full scan of the entire employee2 table to locate the matching rows. This means it's examining every single row in the table, even though only a small fraction might satisfy the filter conditions. Performance Analysis: ---------------------- Cost: The estimated cost of 4061 units indicates a relatively expensive query in terms of database resources. Actual Time: The query took 42.7 to 43.5 seconds to execute, which is a significant amount of time for a single query. Rows Scanned: The database scanned all 40,000 rows in the employee2 table, even though it ultimately found only 1 matching row. This highlights a potential inefficiency. Optimization Suggestions: -------------------------- Create a Composite Index: The most impactful optimization would be to create a composite index on the three columns involved in the filter: (profile, Address, FirstName). This index would allow the database to efficiently locate matching rows without having to scan the entire table, leading to a dramatic performance improvement. Evaluate Maintenance Overhead: ------------------------------- Indexes do create some additional overhead for updates and inserts, so it's essential to consider the trade-off between performance gains and maintenance costs. Additional Considerations: --------------------------- Index Specificity: Composite indexes are most beneficial for queries that frequently filter or join data based on the exact column combinations they cover. Order of Columns: The order of columns in a composite index matters. The database will only use the index effectively if the query's conditions match the order of the columns in the index. Now let us check if this composite on three columns can help us even when we run a select query with two filters. Shell mysql> select count(*) from employee1 where FirstName = "first-96" and Address ="address-5"; +----------+ | count(*) | +----------+ | 400 | +----------+ 1 row in set (0.00 sec) mysql> explain analyze select count(*) from employee1 where FirstName = "first-96" and Address ="address-5"; +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Aggregate: count(0) (cost=214 rows=1) (actual time=0.731..0.731 rows=1 loops=1) -> Covering index lookup on employee1 using compositeindex1 (FirstName='first-96', Address='address-5') (cost=174 rows=400) (actual time=0.0337..0.697 rows=400 loops=1) | +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) Query Plan: ------------ Aggregate: count(0): The query aims to count the number of rows that match certain criteria. Covering Index Lookup: The database efficiently locates matching rows using a composite index named compositeindex1 on the employee1 table. Columns Covered: FirstName and Address Search Criteria: FirstName = 'first-96' Address = 'address-5' Performance Metrics: -------------------- Cost: Estimated cost is 214 units (low), indicating good efficiency. Actual Time: Took 0.731 seconds to execute (relatively fast). Rows Returned: Found 400 matching rows and returned the count (1). But if we have filters on columns B and C, then the complete table is scanned, and the composite key we created wouldn't be sufficient. Shell mysql> explain analyze select * from employee1 where Address = "address-5" and profile = "engineer-12196"; +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Filter: ((employee1.`profile` = 'engineer-12196') and (employee1.Address = 'address-5')) (cost=4048 rows=399) (actual time=21.1..68.7 rows=1 loops=1) -> Table scan on employee1 (cost=4048 rows=39920) (actual time=0.0445..53.9 rows=40000 loops=1) | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.07 sec) However, if we have a filter on columns A and C, then the composite key that we created is utilized. Shell mysql> explain analyze select count(*) from employee1 where FirstName = "first-96" and profile ="engineer-12196"; +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EXPLAIN | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | -> Aggregate: count(0) (cost=31.2 rows=1) (actual time=0.0679..0.068 rows=1 loops=1) -> Filter: ((employee1.`profile` = 'engineer-12196') and (employee1.FirstName = 'first-96')) (cost=27.2 rows=40) (actual time=0.0466..0.0631 rows=1 loops=1) -> Covering index skip scan on employee1 using compositeindex1 over FirstName = 'first-96', profile = 'engineer-12196' (cost=27.2 rows=40) (actual time=0.043..0.0594 rows=1 loops=1) | Summary Thus we can summarize that if we have more than one filter in an SQL query then for a large table if composite indexes are not available on those columns that are used as filters we could see performance problems. Also, we found that the same composite key created on three or more columns would also help with fewer filters. Let's say we created a composite key on columns A, B, and C. It will help in query performance not only when filters are on A, B, and C but also when filters are on A and B or A and C. But when filters are on columns B and C then the complete table will be scanned and this composite key will not be utilized. Key points here are: 1. Index Ordering Composite indexes are ordered based on the column order specified during creation. In this case, the index is ordered by A, then B, then C. The database can efficiently use an index to quickly locate rows when the query filters on the leftmost columns of the index. However, if the query filters on columns that are not at the beginning of the index, it might not be able to effectively use the index. 2. Missing Leading Column Because the query filters only on B and C, it's missing a filter on the leading column A. This means the database can't directly use the index to narrow down the search space effectively. It would need to potentially scan through large portions of the index to find the relevant rows, which could be less efficient than a full table scan in some cases. To potentially improve performance in this scenario, consider: Creating a separate index on columns B and C: This would allow the query to directly use the index for filtering, potentially avoiding a table scan. That's it guys! I hope you will find this article helpful and interesting. I used Bard for query analysis.
MongoDB is one of the most reliable and robust document-oriented NoSQL databases. It allows developers to provide feature-rich applications and services with various modern built-in functionalities, like machine learning, streaming, full-text search, etc. While not a classical relational database, MongoDB is nevertheless used by a wide range of different business sectors and its use cases cover all kinds of architecture scenarios and data types. Document-oriented databases are inherently different from traditional relational ones where data are stored in tables and a single entity might be spread across several such tables. In contrast, document databases store data in separate unrelated collections, which eliminates the intrinsic heaviness of the relational model. However, given that the real world's domain models are never so simplistic to consist of unrelated separate entities, document databases (including MongoDB) provide several ways to define multi-collection connections similar to the classical databases relationships, but much lighter, more economical, and more efficient. Quarkus, the "supersonic and subatomic" Java stack, is the new kid on the block that the most trendy and influential developers are desperately grabbing and fighting over. Its modern cloud-native facilities, as well as its contrivance (compliant with the best-of-the-breed standard libraries), together with its ability to build native executables have seduced Java developers, architects, engineers, and software designers for a couple of years. We cannot go into further details here of either MongoDB or Quarkus: the reader interested in learning more is invited to check the documentation on the official MongoDB website or Quarkus website. What we are trying to achieve here is to implement a relatively complex use case consisting of CRUDing a customer-order-product domain model using Quarkus and its MongoDB extension. In an attempt to provide a real-world inspired solution, we're trying to avoid simplistic and caricatural examples based on a zero-connections single-entity model (there are dozens nowadays). So, here we go! The Domain Model The diagram below shows our customer-order-product domain model: As you can see, the central document of the model is Order, stored in a dedicated collection named Orders. An Order is an aggregate of OrderItem documents, each of which points to its associated Product. An Order document also references the Customer who placed it. In Java, this is implemented as follows: Java @MongoEntity(database = "mdb", collection="Customers") public class Customer { @BsonId private Long id; private String firstName, lastName; private InternetAddress email; private Set<Address> addresses; ... } The code above shows a fragment of the Customer class. This is a POJO (Plain Old Java Object) annotated with the @MongoEntity annotation, which parameters define the database name and the collection name. The @BsonId annotation is used in order to configure the document's unique identifier. While the most common use case is to implement the document's identifier as an instance of the ObjectID class, this would introduce a useless tidal coupling between the MongoDB-specific classes and our document. The other properties are the customer's first and last name, the email address, and a set of postal addresses. Let's look now at the Order document. Java @MongoEntity(database = "mdb", collection="Orders") public class Order { @BsonId private Long id; private DBRef customer; private Address shippingAddress; private Address billingAddress; private Set<DBRef> orderItemSet = new HashSet<>() ... } Here we need to create an association between an order and the customer who placed it. We could have embedded the associated Customer document in our Order document, but this would have been a poor design because it would have redundantly defined the same object twice. We need to use a reference to the associated Customer document, and we do this using the DBRef class. The same thing happens for the set of the associated order items where, instead of embedding the documents, we use a set of references. The rest of our domain model is quite similar and based on the same normalization ideas; for example, the OrderItem document: Java @MongoEntity(database = "mdb", collection="OrderItems") public class OrderItem { @BsonId private Long id; private DBRef product; private BigDecimal price; private int amount; ... } We need to associate the product which makes the object of the current order item. Last but not least, we have the Product document: Java @MongoEntity(database = "mdb", collection="Products") public class Product { @BsonId private Long id; private String name, description; private BigDecimal price; private Map<String, String> attributes = new HashMap<>(); ... } That's pretty much all as far as our domain model is concerned. There are, however, some additional packages that we need to look at: serializers and codecs. In order to be able to be exchanged on the wire, all our objects, be they business or purely technical ones, have to be serialized and deserialized. These operations are the responsibility of specially designated components called serializers/deserializers. As we have seen, we're using the DBRef type in order to define the association between different collections. Like any other object, a DBRef instance should be able to be serialized/deserialized. The MongoDB driver provides serializers/deserializers for the majority of the data types supposed to be used in the most common cases. However, for some reason, it doesn't provide serializers/deserializers for the DBRef type. Hence, we need to implement our own one, and this is what the serializers package does. Let's look at these classes: Java public class DBRefSerializer extends StdSerializer<DBRef> { public DBRefSerializer() { this(null); } protected DBRefSerializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public void serialize(DBRef dbRef, JsonGenerator jsonGenerator, SerializerProvider serializerProvider) throws IOException { if (dbRef != null) { jsonGenerator.writeStartObject(); jsonGenerator.writeStringField("id", (String)dbRef.getId()); jsonGenerator.writeStringField("collectionName", dbRef.getCollectionName()); jsonGenerator.writeStringField("databaseName", dbRef.getDatabaseName()); jsonGenerator.writeEndObject(); } } } This is our DBRef serializer and, as you can see, it's a Jackson serializer. This is because the quarkus-mongodb-panache extension that we're using here relies on Jackson. Perhaps, in a future release, JSON-B will be used but, for now, we're stuck with Jackson. It extends the StdSerializer class as usual and serializes its associated DBRef object by using the JSON generator, passed as an input argument, to write on the output stream the DBRef components; i.e., the object ID, the collection name, and the database name. For more information concerning the DBRef structure, please see the MongoDB documentation. The deserializer is performing the complementary operation, as shown below: Java public class DBRefDeserializer extends StdDeserializer<DBRef> { public DBRefDeserializer() { this(null); } public DBRefDeserializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public DBRef deserialize(JsonParser jsonParser, DeserializationContext deserializationContext) throws IOException, JacksonException { JsonNode node = jsonParser.getCodec().readTree(jsonParser); return new DBRef(node.findValue("databaseName").asText(), node.findValue("collectionName").asText(), node.findValue("id").asText()); } } This is pretty much all that may be said as far as the serializers/deserializers are concerned. Let's move further to see what the codecs package brings to us. Java objects are stored in a MongoDB database using the BSON (Binary JSON) format. In order to store information, the MongoDB driver needs the ability to map Java objects to their associated BSON representation. It does that on behalf of the Codec interface, which contains the required abstract methods for the mapping of the Java objects to BSON and the other way around. Implementing this interface, one can define the conversion logic between Java and BSON, and conversely. The MongoDB driver includes the required Codec implementation for the most common types but again, for some reason, when it comes to DBRef, this implementation is only a dummy one, which raises UnsupportedOperationException. Having contacted the MongoDB driver implementers, I didn't succeed in finding any other solution than implementing my own Codec mapper, as shown by the class DocstoreDBRefCodec. For brevity reasons, we won't reproduce this class' source code here. Once our dedicated Codec is implemented, we need to register it with the MongoDB driver, such that it uses it when it comes to mapping DBRef types to Java objects and conversely. In order to do that, we need to implement the interface CoderProvider which, as shown by the class DocstoreDBRefCodecProvider, returns via its abstract get() method, the concrete class responsible for performing the mapping; i.e., in our case, DocstoreDBRefCodec. And that's all we need to do here as Quarkus will automatically discover and use our CodecProvider customized implementation. Please have a look at these classes to see and understand how things are done. The Data Repositories Quarkus Panache greatly simplifies the data persistence process by supporting both the active record and the repository design patterns. Here, we'll be using the second one. As opposed to similar persistence stacks, Panache relies on the compile-time bytecode enhancements of the entities. It includes an annotation processor that automatically performs these enhancements. All that this annotation processor needs in order to perform its enhancements job is an interface like the one below: Java @ApplicationScoped public class CustomerRepository implements PanacheMongoRepositoryBase<Customer, Long>{} The code above is all that you need in order to define a complete service able to persist Customer document instances. Your interface needs to extend the PanacheMongoRepositoryBase one and parameter it with your object ID type, in our case a Long. The Panache annotation processor will generate all the required endpoints required to perform the most common CRUD operations, including but not limited to saving, updating, deleting, querying, paging, sorting, transaction handling, etc. All these details are fully explained here. Another possibility is to extend PanacheMongoRepository instead of PanacheMongoRepositoryBase and to use the provided ObjectID keys instead of customizing them as Long, as we did in our example. Whether you chose the 1st or the 2nd alternative, this is only a preference matter. The REST API In order for our Panache-generated persistence service to become effective, we need to expose it through a REST API. In the most common case, we have to manually craft this API, together with its implementation, consisting of the full set of the required REST endpoints. This fastidious and repetitive operation might be avoided by using the quarkus-mongodb-rest-data-panache extension, which annotation processor is able to automatically generate the required REST endpoints, out of interfaces having the following pattern: Java public interface CustomerResource extends PanacheMongoRepositoryResource<CustomerRepository, Customer, Long> {} Believe it if you want: this is all you need to generate a full REST API implementation with all the endpoints required to invoke the persistence service generated previously by the mongodb-panache extension annotation processor. Now we are ready to build our REST API as a Quarkus microservice. We chose to build this microservice as a Docker image, on behalf of the quarkus-container-image-jib extension. By simply including the following Maven dependency: XML <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-container-image-jib</artifactId> </dependency> The quarkus-maven-plugin will create a local Docker image to run our microservice. The parameters of this Docker image are defined by the application.properties file, as follows: Properties files quarkus.container-image.build=true quarkus.container-image.group=quarkus-nosql-tests quarkus.container-image.name=docstore-mongodb quarkus.mongodb.connection-string = mongodb://admin:admin@mongo:27017 quarkus.mongodb.database = mdb quarkus.swagger-ui.always-include=true quarkus.jib.jvm-entrypoint=/opt/jboss/container/java/run/run-java.sh Here we define the name of the newly created Docker image as being quarkus-nosql-tests/docstore-mongodb. This is the concatenation of the parameters quarkus.container-image.group and quarkus.container-image.name separated by a "/". The property quarkus.container-image.build having the value true instructs the Quarkus plugin to bind the build operation to the package phase of maven. This way, simply executing a mvn package command, we generate a Docker image able to run our microservice. This may be tested by running the docker images command. The property named quarkus.jib.jvm-entrypoint defines the command to be run by the newly generated Docker image. quarkus-run.jar is the Quarkus microservice standard startup file used when the base image is ubi8/openjdk-17-runtime, as in our case. Other properties are quarkus.mongodb.connection-string and quarkus.mongodb.database = mdb which define the MongoDB database connection string and the name of the database. Last but not least, the property quarkus.swagger-ui.always-include includes the Swagger UI interface in our microservice space such that it allows us to test it easily. Let's see now how to run and test the whole thing. Running and Testing Our Microservices Now that we looked at the details of our implementation, let's see how to run and test it. We chose to do it on behalf of the docker-compose utility. Here is the associated docker-compose.yml file: YAML version: "3.7" services: mongo: image: mongo environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: admin MONGO_INITDB_DATABASE: mdb hostname: mongo container_name: mongo ports: - "27017:27017" volumes: - ./mongo-init/:/docker-entrypoint-initdb.d/:ro mongo-express: image: mongo-express depends_on: - mongo hostname: mongo-express container_name: mongo-express links: - mongo:mongo ports: - 8081:8081 environment: ME_CONFIG_MONGODB_ADMINUSERNAME: admin ME_CONFIG_MONGODB_ADMINPASSWORD: admin ME_CONFIG_MONGODB_URL: mongodb://admin:admin@mongo:27017/ docstore: image: quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT depends_on: - mongo - mongo-express hostname: docstore container_name: docstore links: - mongo:mongo - mongo-express:mongo-express ports: - "8080:8080" - "5005:5005" environment: JAVA_DEBUG: "true" JAVA_APP_DIR: /home/jboss JAVA_APP_JAR: quarkus-run.jar This file instructs the docker-compose utility to run three services: A service named mongo running the Mongo DB 7 database A service named mongo-express running the MongoDB administrative UI A service named docstore running our Quarkus microservice We should note that the mongo service uses an initialization script mounted on the docker-entrypoint-initdb.d directory of the container. This initialization script creates the MongoDB database named mdb such that it could be used by the microservices. JavaScript db = db.getSiblingDB(process.env.MONGO_INITDB_ROOT_USERNAME); db.auth( process.env.MONGO_INITDB_ROOT_USERNAME, process.env.MONGO_INITDB_ROOT_PASSWORD, ); db = db.getSiblingDB(process.env.MONGO_INITDB_DATABASE); db.createUser( { user: "nicolas", pwd: "password1", roles: [ { role: "dbOwner", db: "mdb" }] }); db.createCollection("Customers"); db.createCollection("Products"); db.createCollection("Orders"); db.createCollection("OrderItems"); This is an initialization JavaScript that creates a user named nicolas and a new database named mdb. The user has administrative privileges on the database. Four new collections, respectively named Customers, Products, Orders and OrderItems, are created as well. In order to test the microservices, proceed as follows: Clone the associated GitHub repository: $ git clone https://github.com/nicolasduminil/docstore.git Go to the project: $ cd docstore Build the project: $ mvn clean install Check that all the required Docker containers are running: $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7882102d404d quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT "/opt/jboss/containe…" 8 seconds ago Up 6 seconds 0.0.0.0:5005->5005/tcp, :::5005->5005/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 8443/tcp docstore 786fa4fd39d6 mongo-express "/sbin/tini -- /dock…" 8 seconds ago Up 7 seconds 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp mongo-express 2e850e3233dd mongo "docker-entrypoint.s…" 9 seconds ago Up 7 seconds 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp mongo Run the integration tests: $ mvn -DskipTests=false failsafe:integration-test This last command will run all the integration tests which should succeed. These integration tests are implemented using the RESTassured library. The listing below shows one of these integration tests located in the docstore-domain project: Java @QuarkusIntegrationTest @TestMethodOrder(MethodOrderer.OrderAnnotation.class) public class CustomerResourceIT { private static Customer customer; @BeforeAll public static void beforeAll() throws AddressException { customer = new Customer("John", "Doe", new InternetAddress("john.doe@gmail.com")); customer.addAddress(new Address("Gebhard-Gerber-Allee 8", "Kornwestheim", "Germany")); customer.setId(10L); } @Test @Order(10) public void testCreateCustomerShouldSucceed() { given() .header("Content-type", "application/json") .and().body(customer) .when().post("/customer") .then() .statusCode(HttpStatus.SC_CREATED); } @Test @Order(20) public void testGetCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().get("/customer") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName[0]")).isEqualTo("John"); } @Test @Order(30) public void testUpdateCustomerShouldSucceed() { customer.setFirstName("Jane"); given() .header("Content-type", "application/json") .and().body(customer) .when().pathParam("id", customer.getId()).put("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(40) public void testGetSingleCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName")).isEqualTo("Jane"); } @Test @Order(50) public void testDeleteCustomerShouldSucceed() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).delete("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(60) public void testGetSingleCustomerShouldFail() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_NOT_FOUND); } } You can also use the Swagger UI interface for testing purposes by firing your preferred browser at http://localhost:8080/q:swagger-ui. Then, in order to test endpoints, you can use the payloads in the JSON files located in the src/resources/data directory of the docstore-api project. You also can use the MongoDB UI administrative interface by going to http://localhost:8081 and authenticating yourself with the default credentials (admin/pass). You can find the project source code in my GitHub repository. Enjoy!
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte