Performance Resources

DZone's Featured Performance Resources

Indexed View for Aggregating Metrics

By Ankur Peshin

Microsoft Azure SQL is a robust, fully managed database platform designed for high-performance querying, relational data storage, and analytics. For a typical web application with a backend, it is a good choice when we want to consider a managed database that can scale both vertically and horizontally. An application software generates user metrics on a daily basis, which can be used for reports or analytics. Azure SQL is a great choice to consider for storing and querying this data under certain conditions: The analytical queries require joins with other tables (applying filters on UI)You want to combine historical and transactional dataThe data volume is not extremely large, and query performance can be managed by tuning Let's consider an example of a hotel booking site running Azure SQL in the backend. We want to see a UI dashboard for tracking user activity, such as clicks on the site, visits to the hotel description page, bookings made, etc. Let's assume all this telemetry data is dumped for each user on a daily basis in unstructured storage, and we are pulling this data into our database using background jobs, such as Apache Airflow. Below is the schema for users table and a table to store daily metrics. MS SQL CREATE TABLE users ( id INT AUTO_INCREMENT PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(255) UNIQUE NOT NULL, password_hash VARCHAR(255) NOT NULL, age INT , city VARCHAR(100), country VARCHAR(100), currency VARCHAR(10), last_login VARCHAR, hotel_preference VARCHAR(100) ); CREATE TABLE daily_user_metrics ( id BIGINT IDENTITY PRIMARY KEY, -- Unique identifier for the record user_id BIGINT NOT NULL, -- Foreign key to the users table clicks INT DEFAULT 0, -- Number of total site clicks visits INT DEFAULT 0, -- Number of visits to the hotel page bookings INT DEFAULT 0, -- Number of bookings reviews INT DEFAULT 0, -- Number of reviews cancellations INT DEFAULT 0, -- Number of cancellations date_created DATE, -- Daily metrics for each user ); You can draw many insights from the above tables. Let's consider one particular example. We need to aggregate daily activity metrics grouped by date in descending order for customers between the ages of 30 and 40 located in New York City. Below is the query: MS SQL SELECT date_created, SUM(clicks) AS total_clicks, SUM(visits) AS total_visits, SUM(bookings) AS total_bookings, SUM(reviews) AS total_reviews, SUM(cancellations) AS total_cancellations, FROM daily_user_metrics m INNER JOIN users u on m.user_id = u.id WHERE u.age BETWEEN 30 and 40 and u.city ='New York' and m.date_created BETWEEN :startDate and :endDate GROUP BY date_created ORDER BY date_created DESC Now, we can analyze the daily trends from this group of users who are in New York and between the ages of 30 and 40. The table is mostly performant, and we are able to easily perform range queries spread across multiple months. Eventually, our requirements grow. We now want to track user behavior in a weekly or monthly range. But our table stores the data on a daily basis. We now have two options: Query the table and group the date_created weekly or monthly, depending on the ask. Create a couple of views that aggregate the data on a weekly or monthly basis per user. See the query below: MS SQL CREATE VIEW weekly_user_metrics AS SELECT DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) AS week_start, -- Start of the week (Sunday) SUM(clicks) AS total_clicks, SUM(visits) AS total_visits, SUM(bookings) AS total_bookings, SUM(reviews) AS total_reviews, SUM(cancellations) AS total_cancellations, FROM daily_user_metrics m INNER JOIN users u on m.user_id = u.id WHERE u.age BETWEEN 30 and 40 and u.city ='New York' and m.date_created BETWEEN :startDate and :endDate GROUP BY DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) -- Group by week start ORDER BY DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) DESC -- Sort by latest week However, one important thing to consider is that views just provide an abstraction to the underlying query which simply queries the underlying table. Materialized Views are the next thought that comes to mind. However, they need to be refreshed manually or on a schedule, due to which real-time data is not available. To address these issues, Azure SQL Server offers a great feature known as Indexed View. An Indexed View is a physical representation of a view stored in the database with a unique clustered index. Changes to the underlying tables automatically update the indexed view to keep it in sync. It uses a clustered index that organizes the data in the view based on the order of the index keys. The indexed view is ideal for scenarios where we need real-time data, and our query involves complex multi-table joins. It is also suitable for our use case where existing data is rarely updated but queried often, and we have range-based queries and want to do ordered retrieval. There are some things to consider before deciding whether you want to go for indexed views. Indexed views cannot have non-deterministic functions. A non-deterministic function is a function that does not always return the same result for the same input, even when executed with identical arguments and under the same database conditions. Also, an indexed view is an actual structure that requires storage, similar to a Materialized View. The syntax for creating an Indexed View is similar to the View creation query above. However, we cannot have non-deterministic functions while creating an indexed view. The line DATEADD(DAY, -(DATEPART(WEEKDAY, date) - 1), date) AS week_start in the view query above depends on the session-specific SET DATEFIRST setting, which determines the first day of the week. This is considered non-deterministic as it will produce different results for different conditions. Keeping the above things in mind, we can proceed to eliminate the non-deterministic computation by making the column deterministic. We add a week_start column to the underlying table and precompute and fill the week_start value in the table for daily data pulls. So, the rows with dates D1 through D7 belong to W1, D8 to D14 belong to W2, and so on. Now, we can proceed to create an indexed view with the SQL below. MS SQL ALTER TABLE daily_user_metrics ADD week_start DATE;-- Populate this column with first day of the week going forward CREATE VIEW dbo.weekly_user_metric_aggregations_view WITH SCHEMABINDING AS SELECT user_id, week_start, SUM(clicks) AS total_clicks, SUM(visits) AS total_visits, SUM(bookings) AS total_bookings, SUM(reviews) AS total_reviews, SUM(cancellations) AS total_cancellations, COUNT_BIG(*) AS row_count --SQL Server requires COUNT_BIG(*) in indexed views to handle scenarios where the count exceeds the range of an INT data type. FROM dbo.daily_user_metrics GROUP BY user_id, week_start; CREATE UNIQUE CLUSTERED INDEX IX_weekly_user_metric_aggregations_view ON dbo.weekly_user_metric_aggregations_view (user_id, week_start); After this indexed view is created, we can query it as follows: MS SQL SELECT week_start, SUM(total_clicks) AS total_clicks, SUM(total_visits) AS total_visits, SUM(total_bookings) AS total_bookings, SUM(total_reviews) AS total_reviews, SUM(total_cancellations) AS total_cancellations, FROM weekly_user_metric_aggregations_view mv INNER JOIN users u on mv.user_id = u.id WHERE u.age BETWEEN 30 and 40 and u.city ='New York' and m.date_created BETWEEN :startDate and :endDate GROUP BY week_created ORDER BY week_created DESC Conclusion An indexed view will have a significantly faster query time than a basic view or querying the table directly for large datasets where data has grown up to a million rows or more. The end user will have a low latency experience, and queries to the database will be optimized. Since we aggregated a week's worth of data in 1 row, we have cut the aggregation time taken to 1/7th. More

Optimizing Front-End Performance

By Ravikumar Bojja

This is especially important in the modern world of web development, where it can be challenging for a site to load in a reasonable amount of time. End customers experience delays when visiting slow-loading sites, higher bounce rates, and lost business prospects. It's to overcome this challenge that front-end performance has become the norm or optimization among developers. It entails optimizing relative to the performance standards of the user interface elements of a site. Through optimization methods, companies can present value propositions that engaged users will be able to navigate with minimal hiccups Lazy Loading: Efficient Content Delivery Lazy loading is a front-end optimization concept that loads front-end sources such as images, videos, iframes, and others on a website when the page is loaded. Lazy loading serves more content than the original page load; it loads only as much content as the user can see when the page is loaded. Other content is loaded on the page as the user scrolls down the page, thus making the process an Ajax function. Reducing the number of resources downloaded on the first run results in faster page loading time and a better user experience enabled by lazy loading. For example, on a page with several images, the technique enables only those images at the top to be loaded to the webpage when they first load. Those images not quickly seen are only inter-alia loaded as the user scrolls towards or arrives at that part of the website. Why Use Lazy Loading? Improved load time. By loading only the visible content, lazy loading decreases the time it takes to render the page. This can make the website feel faster, improving the user experience.Reduced bandwidth usage. Lazy loading helps prevent unnecessary downloads. As a result, users who do not scroll through the entire page won't have to download all the content, saving bandwidth, especially on mobile networks.Better user experience. The perception of a faster website can lead to greater engagement. Users can begin interacting with the visible content while the remaining resources load in the background. How to Implement Lazy Loading Lazy loading is simple to implement using modern HTML and JavaScript. For images, you can use the loading= "lazy" attribute in HTML, which instructs the browser to only load images when they are about to come into view. This method significantly reduces the number of requests made during page load. Image Optimization: The Key to Faster Load Times As noted, images are usually the biggest components of a web page, and their usage significantly influences page loading times. Optimizing images entails weighing an image so that it has the least resolution possible for it to be used on a website and still look good. This is vital for sites with many image uploads, such as online businesses or blog sites. Image optimization must be used to enhance the site's speed. This is because large unoptimized images have big file sizes, and when compressed, the file sizes are much smaller, making the loading time much smaller, too. This is particularly important for mobile users, especially those using data bundles, which may be limited in amount as small image sizes relieve mobile networks. Thirdly, the images loaded on web pages have been known to help Google use page speed as a ranking factor in SEO. In optimizing images, the ability to select suitable file formats is crucial. Jpeg is perfect for photographs, PNG is perfect for images that need to be transparent, and SVG is perfect for scaled graphics such as icons. Web, already recognized for effective compression, is steadily attracting attention due to its possibility of significant file size reduction while maintaining excellent quality. Compression techniques are of two types (lossy) and (lossless), enabling a file size reduction without losing image quality. Programs such as Tiny PNG or Image Optima do this for us. Responsive images created using the secret attribute deliver smaller images to mobile devices more than to desktops and larger images to desktops more than to mobiles, which makes loading quicker. An image CDN shortens the distance over which images must be transmitted, which also helps to reduce the time it takes to load them. Last of all, headers use image lazy loading, so one only requests the images that are visible to the user while other images are supplied successively. This cuts the number of requests and saves bandwidth, especially for image-intensive websites, which are commonplace these days, such as news sites or online shopping sites. Front-end performance optimization workflow Code Minification: Reducing File Sizes Minification is removing unnecessary characters from code — such as whitespace, comments, and newline characters — to reduce the file size. This technique is essential for improving website performance, as it helps reduce the amount of data that needs to be transferred and processed by the browser. Why Minify Code? Reduced file size. Minification significantly decreases the size of HTML, CSS, and JavaScript files by removing unnecessary characters. Smaller files mean less data to download, which reduces load time.Faster rendering. Browsers parse and execute minified files more quickly, improving the speed at which pages are rendered.Improved SEO. Faster loading times are beneficial for SEO. Google uses page speed as a ranking factor, and websites that load faster are more likely to perform better in search results. How to Minify Code CSS minification. Tools like CSS Nano or Clean CSS can minimize your CSS files by removing spaces and comments while maintaining the style rules intact.JavaScript minification. JavaScript minifiers like Terser or UglifyJS help shrink JavaScript files by removing redundant characters and optimizing code. These tools also help eliminate unused code, which further reduces the file size.HTML minification. HTMLMinifier is a popular tool for minifying HTML files. It removes unnecessary whitespace and comments from HTML documents, speeding up the page's load time. Other Techniques for Front-End Optimization Although lazy loading front-end optimization and code minification are potent strategies, several other approaches can enhance front-end performance even more. Complete Front-End Delivery Networks A CDN essentially refers to any web distribution service designed to deliver content through a network of servers located in various parts of the world. CDNs help decrease latency by delivering website assets — images, scripts, and style sheets — from the nearest server. They also relieve the origin server of much load and are more reliable because they are backed up. Browser Caching Browser caching is a way through which a web browser will find it possible to store all static assets such as images, JavaScript, and CSS files locally so that they will not be required to be downloaded again whenever required. This does make the subsequent visits faster as the pages take less time to load than when the visit is the first one. You even specify the lapse for which the asset must be cached so that repeated requests are not made. Conclusion Optimizations are critical to the frontend of a responsive website and improve the users' experience. For instance, a website's performance can improve using lazy loading, optimizing images, and minifying codes; all these optimizations repeatedly slow down the loading time, save bandwidth, and increase user interaction. However, extra optimizations such as working with CDN, configuring browser cache, and using asynchronous JS loading can also increase a site's speed. Front-end performance optimization is never a front-end process because the goal is always to meet users' new and challenging demands. All these optimization issues will help businesses provide the best browsing experience, optimize SEO, and increase customers' time on the site. More

Implement a Geographic Distance Calculator Using TypeScript

By horus he

Understanding the Two Schools of Unit Testing

By Alexander Rumyantsev

Streamline npm Packages: Optimize and Boost Performance

By Aishwarya Murali

JMeter Plugin HTTP Simple Table Server (STS) In-Depth

The Need for the Creation of the STS Plugin From a Web Application in Tomcat The idea of having a server to manage the dataset was born during the performance tests of the income tax declaration application of the French Ministry of Public Finance in 2012. The dataset consisted of millions of lines to simulate tens of thousands of people who filled out their income tax return form per hour and there were a dozen injectors to distribute the injection load of a performance shot. The dataset was consumed, that is to say, once the line with the person's information was read or consumed, we could no longer take the person's information again. The management of the dataset in a centralized way had been implemented with a Java web application (war) running in Tomcat. Injectors requesting a row of the dataset from the web application. To a Plugin for Apache JMeter The need to have centralized management of the dataset, especially with an architecture of several JMeter injectors, was at the origin of the creation in 2014 of the plugin for Apache JMeter named HTTP Simple Table Server or STS. This plugin takes over the features of the dedicated application for Tomcat mentioned above but with a lighter and simpler technical solution based on the NanoHTTPD library. Manage the Dataset With HTTP Simple Table Server (STS) Adding Centralized Management of the Dataset Performance tests with JMeter can be done with several JMeter injectors or load generators (without interface, on a remote machine) and a JMeter controller (with user interface or command line interface on the local machine). JMeter script and properties are sent by RMI protocol to injectors. The results of the calls are returned periodically to the JMeter controller. However, the dataset and CSV files are not transferred from the controller to the injectors. It is natively not possible with JMeter to read the dataset randomly or in reverse order. However, there are several external plugins that allow you to randomly read a data file, but not in a centralized way. It is natively not possible to save data created during tests such as file numbers or new documents in a file. The possibility of saving values can be done with the Groovy script in JMeter, but not in a centralized way if you use several injectors during the performance test. The main idea is to use a small HTTP server to manage the dataset files with simple commands to retrieve or add data lines to the data files. This HTTP server can be launched alone in an external program or in the JMeter tool. The HTTP server is called the "Simple Table Server" or STS. The name STS is also a reference to the Virtual Table Server (VTS) program of LoadRunner with different functionalities and a different technical implementation, but close in the use cases. STS is very useful in a test with multiple JMeter injectors, but it also brings interesting features with a single JMeter for performance testing. Some Possibilities of Using the HTTP Simple Table Server Reading Data in a Distributed and Collision-Free Way Some applications do not tolerate that 2 users connect with the same login at the same time. It is often recommended not to use the same data for 2 users connected with the same login at the same time to avoid conflicts on the data used. The STS can easily manage the dataset in a distributed way to ensure that logins are different for each virtual user at a given time T of the test. A dataset with logins/passwords with a number of lines greater than the number of active threads at a given time is required. We manually load at the start of the STS or by script the logins/password file, and the injectors ask the STS for a line with login/password by the command READ, KEEP=TRUE, and READ_MODE=FIRST. We can consider that the reading of the dataset is done in a circular way. Reading Single-Use Data The dataset can be single-use, meaning that it is used only once during the test. For example, people who register on a site can no longer register with the same information because the system detects a duplicate. Documents are awaiting validation by an administrator. When the documents are validated, they are no longer in the same state and can no longer be validated again. To do this, we will read a data file in the memory of the HTTP STS and the virtual users will read by deleting the value at the top of the list. When the test is stopped, we can save the values that remain in memory in a file (with or without a timestamp prefix) or let the HTTP STS run for another test while keeping the values still in memory. Producers and Consumers of a Queue With the STS, we can manage a queue with producers who deposit in the queue and consumers who consume the data of the queue. In practice, we start a script with producers who create data like new documents or new registered people. The identifiers of created documents or the information of newly registered people are stored in the HTTP STS by the ADD command in mode ADD_MODE=LAST. The consumers start a little later so that there is data in the queue. It is also necessary that the consumers do not consume too quickly compared to the producers or it is necessary to manage the case where the list no longer contains a value by detecting that there is no more value available and waiting a few seconds before repeating in a loop. The consumers consume the values by the commands READ, FIRST, KEEP=FALSE. Here is a schema to explain how the producers ADD and consumers READ in the same queue. Producer and Consumer With Search By FIND This is a variation of the previous solution. Producer A user with rights limited to a geographical sector creates documents and adds a line with his login + the document number (ex: login23;D12223). login1;D120000 login2;D120210login23;D12223 login24;D12233 login2;D120214 login23;D12255 Consumer A user will modify the characteristics of the document, but to do so, he must choose from the list in memory only the lines with the same login as the one he currently uses for questions of rights and geographic sector. Search in the list with the FIND command and FIND_MODE=SUBSTRING. The string searched for is the login of the connected person (ex: login23) so LINE=login23; (with the separator ;). In this example, the returned line will be the 1st line login23 and the file number will be used in searching for files in the tested application. Here, the result of the FIND by substring (SUBSTRING) is: login23;D12223. The line can be consumed with KEEP=FALSE or kept and placed at the end of the list with KEEP=TRUE. Enrich the Dataset as the Shot Progresses The idea is to start with a reduced dataset that is read at the start of the shot and to add new lines to the initial dataset as the shot progresses in order to increase or enrich the dataset. The dataset is therefore larger at the end of the shot than at the beginning. At the end of the shot, the enriched dataset can be saved with the SAVE command and the new file can be the future input file for a new shot. For example, we add people by a scenario. We add to the search dataset these people in the list so the search is done on the first people of the file read at the beginning but also the people added as the performance test progresses. Verification of the Dataset The initial dataset can contain values that will generate errors in the script because the navigation falls into a particular case or the entered value is refused because it is incorrect or already exists. We read the dataset by INITFILE or at the start of the STS, we read (READ) and use the value if the script goes to the end (the Thread Group is configured with "Start Next Thread Loop on Sampler Error"), we save the value by ADD in a file (valuesok.csv). At the end of the test, the values of the valuesok.csv file are saved (for example, with a "tearDown Thread Group"). The verified data of the valuesok.csv file will then be used. Storing the Added Values in a File In the script, the values added to the application database are saved in a file by the ADD command. At the end of the test, the values in memory are saved in a file (for example, with a "tearDown Thread Group"). The file containing the created values can be used to verify the additions to the database a posteriori or to delete the created values in order to return to an initial state before the test. A script dedicated to creating a data set can create values and store them in a data file. This file will then be used in the JMeter script during performance tests. On the other hand, a dedicated erase script can take the file of created values to erase them in a loop. Communication Between Software The HTTP STS can be used as a means to communicate values between JMeter and other software. The software can be a heavy client, a web application, a shell script, a Selenium test, another JMeter, a LoadRunner, etc. We can also launch 2 command line tests with JMeter in a row by using an "external standalone" STS as a means to store values created by the first test and used in the second JMeter test. The software can add or read values by calling the URL of the HTTP STS, and JMeter can read these added values or add them itself. This possibility facilitates performance tests but also non-regression tests. Enrich the Dataset as the Shot Progresses Gradual Exiting the Test on All JMeter Injectors In JMeter, there is no notion as in LoadRunner of "GradualExiting," that is to say, to indicate to the vuser at the end of the iteration if it should continue or not to repeat and therefore stop. We can simulate this "GradualExiting" with the STS and a little code in the JMeter script. With the STS we can load a file "status.csv," which contains a line with a particular value like the line "RUN." The vuser asks at the beginning of the iteration the value of the line of the file "status.csv." If the value is equal to "STOP" then the vuser stops. If the value is zero or different from STOP like "RUN" then the user continues. Gradual Exiting After Fixed Date Time We can also program the stop request after a fixed time with this system. We indicate as a parameter the date and time to change the value of the status to STOP; e.g., 2024-07-31_14h30m45s by a JMeter script that runs in addition to the current load testing. The script is launched, and we calculate the number of milliseconds before the indicated date of the requested stop. The vuser is put on hold for the calculated duration. Then the "status.csv" file in the STS is deleted to put the STOP value, which will allow the second JMeter script that is already running to read the status value if status == "STOP" value and to stop properly on all the JMeter injectors or the JMeter alone. Start the HTTP Simple Table Server Declaration and Start of the STS by the Graphical Interface The Simple Table Server is located in the "Non-Test Elements" menu: Click on the "Start" button to start the HTTP STS. By default, the directory containing the files is <JMETER_HOME>/bin. Start From the Command Line It is possible to start the STS server from the command line. Shell <JMETER_HOME>\bin\simple-table-server.cmd (Windows OS) <JMETER_HOME>/bin/simple-table-server.sh (Linux OS) Or automatically when starting JMeter from the command line (CLI) by declaring the file these 2 lines in jmeter.properties: jsr223.init.file=simple-table-server.groovyjmeterPlugin.sts.loadAndRunOnStartup=true If the value is false, then the STS is not started when launching JMeter from command line without GUI. The default port is 9191. It can be changed by the property: # jmeterPlugin.sts.port=9191 If jmeterPlugin.sts.port=0, then the STS does not start when launching JMeter in CLI mode. The property jmeterPlugin.sts.addTimestamp=true indicates if the backup of the file (e.g., info.csv) will be prefixed by the date and time (e.g.: 20240112T13h00m50s.info.csv); otherwise, jmeterPlugin.sts.addTimestamp=false writes/overwrites the file info.csv# jmeterPlugin.sts.addTimestamp=true. The property jmeterPlugin.sts.daemon=true is used when the STS is launched as an external application with the Linux nohup command (example: nohup ./simple-table-server.sh &). In this case, the STS does not listen to the keyboard. Use the <ENTER> key to exit. The STS enters an infinite loop, so you must call the /sts/STOP command to stop the STS or the killer. When jmeterPlugin.sts.daemon=false, the STS waits for the entry of <ENTER> to exit. This is the default mode. Loading Files at Simple Table Server Startup The STS has the ability to load files into memory at STS startup. Loading files is done when the STS is launched as an external application (<JMETER_HOME>\bin\simple-table-server.cmd or <JMETER_HOME>/bin/simple-table-server.sh) and also when JMeter is launched from the command line without GUI or via the JMeter Maven Plugin. Loading files is not done with JMeter in GUI mode. The files are read in the directory indicated by the property jmeterPlugin.sts.datasetDirectory, and if this property is null, then in the directory <JMETER_HOME>/bin. The declaration of the files to be loaded is done by the following properties: jmeterPlugin.sts.initFileAtStartup=article.csv,filename.csv jmeterPlugin.sts.initFileAtStartupRegex=false or jmeterPlugin.sts.initFileAtStartup=.+?\.csv jmeterPlugin.sts.initFileAtStartupRegex=true When jmeterPlugin.sts.initFileAtStartupRegex=false then the property jmeterPlugin.sts.initFileAtStartup contains the list of files to be loaded with the comma character “,” as the file name separator. (e.g., jmeterPlugin.sts.initFileAtStartup=article.csv,filename.csv). The STS at startup will try to load (INITFILE) the files articles.csv then filename.csv. When jmeterPlugin.sts.initFileAtStartupRegex=true then the property jmeterPlugin.sts.initFileAtStartup contains a regular expression that will be used to match the files in the directory of the property jmeterPlugin.sts.datasetDirectory (e.g.,jmeterPlugin.sts.initFileAtStartup=.+?\.csv loads into memory (INITFILE) all files with the extension ".csv". The file name must not contain special characters that would allow changing the reading directory such as ..\..\fichier.csv, /etc/passwd, or ../../../tomcat/conf/server.xml. The maximum size of a file name is 128 characters (without taking into account the directory path). Management of the Encoding of Files to Read/Write and the HTML Response It is possible to define the encoding when reading files or writing data files. The properties are as follows: Read (INITFILE) or write (SAVE) file with an accent from the text file in the charset like UTF-8 or ISO8859_15All CSV files need to be in the same charset encoding: Properties files jmeterPlugin.sts.charsetEncodingReadFile=UTF-8 jmeterPlugin.sts.charsetEncodingWriteFile=UTF-8 Files will be read (INITFILE) with the charset declared by the value of jmeterPlugin.sts.charsetEncodingReadFile.Files will be written (SAVE) with the charset declared by the value of jmeterPlugin.sts.charsetEncodingWriteFile.The default value jmeterPlugin.sts.charsetEncodingReadFile corresponds to the System property: file.encoding.The default value jmeterPlugin.sts.charsetEncodingWriteFile corresponds to the System property: file.encoding. All data files must be in the same charset if they contain non-ASCII characters.To respond in HTML to different commands, especially READ, the charset found in the response header is indicated by jmeterPlugin.sts.charsetEncodingHttpResponse.jmeterPlugin.sts.charsetEncodingHttpResponse=<charset> (Use UTF-8): In the HTTP header add "Content-Type:text/html; charset=<charset>"The default value is the JMeter property: sampleresult.default.encodingThe list of charsets is declared in the HTML page (take the java.io API column)For the name of the charset look, see Oracle docs for Supported Encodings.Column Canonical Name for java.io API and java.lang API Help With Use The URL of an STS command is of the form: <HOSTNAME>:<PORT>/sts/<COMMAND>?<PARAMETERS>. The commands and the names of the parameters are in uppercase (case sensitive). If no command is indicated then the help message is returned: http://localhost:9191/sts/. Commands and Configuration The following is a list of commands and configuration of the HTTP STS with extracts from the documentation of the JMeter-plugins.org site. The calls are atomic (with synchronized) => Reading or adding goes to the end of the current processing before processing the next request. The commands to the Simple Table Server are performed by HTTP GET and/or POST calls depending on the command. Documentation is available on the JMeter plugin website. Distributed Architecture for JMeter The Simple Table Server runs on the JMeter controller (master) and load generators (slaves) or injectors make calls to the STS to get, find, or add some data. At the beginning of the test, the first load generator will load data in memory (initial call) and at the end of the test, it asks for the STS saving values in a file. All the load generators ask for data from the same STS which is started on the JMeter controller. The INITFILE can also be done at STS startup time (without the first load generator initial call). Example of a dataset file logins.csv: Plain Text login1;password1 login2;password2 login3;password3 login4;password4 login5;password5 INITFILE: Load File in Memory http://hostname:port/sts/INITFILE?FILENAME=logins.csv HTML <html><title>OK</title> <body>5</body> => number of lines read </html> Linked list after this command: Plain Text login1;password1 login2;password2 login3;password3 login4;password4 login5;password5 The files are read in the directory indicated by the property: jmeterPlugin.sts.datasetDirectory; if this property is null, then in the directory<JMETER_HOME>/bin/. READ: Get One Line From List http://hostname:port/sts/ HTML <html><title>OK</title> <body>login1;password1</body> </html> Available options: READ_MODE=FIRST => login1;password1READ_MODE=LAST => login5;password5READ_MODE=RANDOM => login?;password?KEEP=TRUE => The data is kept and put to the end of the listKEEP=FALSE => The data is removed READMULTI: Get Multi Lines From List in One Request GET Protocol http://hostname:port/sts/READMULTI?FILENAME=logins.csv&NB_LINES={Nb lines to read}&READ_MODE={FIRST, LAST, RANDOM}&KEEP={TRUE, FALSE}Available options: NB_LINES=Number of lines to read : 1 <= Nb lines (Integer) and Nb lines <= list sizeREAD_MODE=FIRST => Start to read at the first lineREAD_MODE=LAST => Start to read at the last line (reverse)READ_MODE=RANDOM => read n lines randomlyKEEP=TRUE => The data is kept and put to the end of listKEEP=FALSE => The data is removed ADD: Add a Line Into a File (GET OR POST HTTP Protocol) FILENAME=dossier.csv, LINE=D0001123, ADD_MODE={FIRST, LAST}HTML format: HTML <html><title>OK</title> <body></body> </html> Available options: ADD_MODE=FIRST => Add to the beginning of the listADD_MODE=LAST => Add to the end of the listFILENAME=dossier.csv => If doesn't already exist it creates a LinkList in memoryLINE=1234;98763 =>Tthe line to addUNIQUE => Do not add a line if the list already contains such a line (return KO)HTTP POST request: Method GET: GET Protocol: http://hostname:port/sts/ADD FILENAME=dossier.csv&LINE=D0001123&ADD_MODE={FIRST, LAST} FIND: Find a Line in the File (GET OR POST HTTP Protocol) Command FIND Find a line (LINE) in a file (FILENAME) (GET or POST HTTP protocol)The LINE to find is for FIND_MODE: A string: SUBSTRING (Default, ALineInTheFile contains the stringToFind) or EQUALS (stringToFind == ALineInTheFile)A regular expression with REGEX_FIND (contains) and REGEX_MATCH (entire region matches the pattern)KEEP=TRUE => The data is kept and put to the end of the listKEEP=FALSE => The data is removedGET Protocol: http://hostname:port/sts/FIND?FILENAME=colors.txt&LINE=(BLUE|RED)&[FIND_MODE=[SUBSTRING,EQUALS,REGEX_FIND,REGEX_MATCH]]&KEEP={TRUE, FALSE} If find return the first line found, start reading at the first line in the file (linked list): HTML <html><title>OK</title> <body>RED</body> </html> If NOT found, return title KO and message Error: Not found! in the body. HTML <html><title>KO</title> <body>Error : Not find !</body> </html> LENGTH: Return the Number of Remaining Lines of a Linked List http://hostname:port/sts/LENGTH?FILENAME=logins.csvHTML format: HTML <html><title>OK</title> <body>5</body> => remaining lines </html> STATUS: Display the List of Loaded Files and the Number of Remaining Lines http://hostname:port/sts/STATUSHTML format: HTML <html><title>OK</title> <body> logins.csv = 5<br /> dossier.csv = 1<br /> </body> </html> SAVE: Save the Specified Linked list in a File to the datasetDirectory Location http://hostname:port/sts/SAVE?FILENAME=logins.csvIf jmeterPlugin.sts.addTimestamp is set to true, then a timestamp will be added to the filename. The file is stored in jmeterPlugin.sts.datasetDirectory or if null in the <JMETER_HOME>/bin directory: 20240520T16h33m27s.logins.csv.You can force the addTimestamp value with parameter ADD_TIMESTAMP in the URL like :http://hostname:port/sts/SAVE?FILENAME=logins.csv&ADD_TIMESTAMP={true,false}HTML format: HTML <html><title>OK</title> <body>5</body> => number of lines saved </html> RESET: Remove All Elements From the Specified List: http://hostname:port/sts/RESET?FILENAME=logins.csvHTML format: HTML <html><title>OK</title> <body></body> </html> The Reset command is often used in the “setUp Thread Group” to clear the values in the memory Linked List from a previous test. It always returns OK even if the file does not exist. STOP: Shutdown the Simple Table Server http://hostname:port/sts/STOPThe stop command is used usually when the HTTP STS server is launched by a script shell and we want to stop the STS at the end of the test.When the jmeterPlugin.sts.daemon=true, you need to call http://hostname:port/sts/STOP or kill the process to stop the STS. CONFIG: Display STS Configuration http://hostname:port/sts/CONFIGDisplay the STS configuration, e.g.: Plain Text jmeterPlugin.sts.loadAndRunOnStartup=false startFromCli=true jmeterPlugin.sts.port=9191 jmeterPlugin.sts.datasetDirectory=null jmeterPlugin.sts.addTimestamp=true jmeterPlugin.sts.demon=false jmeterPlugin.sts.charsetEncodingHttpResponse=UTF-8 jmeterPlugin.sts.charsetEncodingReadFile=UTF-8 jmeterPlugin.sts.charsetEncodingWriteFile=UTF-8 jmeterPlugin.sts.initFileAtStartup= jmeterPlugin.sts.initFileAtStartupRegex=false databaseIsEmpty=false Error Response KO When the command and/or a parameter are wrong, the result is a page html status 200 but the title contains the label KO. Examples: Send an unknown command. Be careful as the command a case sensitive (READ != read). HTML <html><title>KO</title> <body>Error : unknown command !</body> </html> Try to read the value from a file not yet loaded with INITFILE. HTML <html><title>KO</title> <body>Error : logins.csv not loaded yet !</body> </html> Try to read the value from a file but no more lines in the Linked List. HTML <html><title>KO</title> <body>Error : No more line !</body> </html> Try to save lines in a file that contains illegal characters like “..”, “:” HTML <html><title>KO</title> <body>Error : Illegal character found !</body> </html> Command FIND: HTML <html><title>KO</title> <body>Error : Not find !</body> </html> Command FIND and FIND_MODE=REGEX_FIND or REGEX_MATCH: HTML <html><title>KO</title> <body>Error : Regex compile error !</body> </html> Modifying STS Parameters From the Command Line You can override STS settings using command-line options: Plain Text -DjmeterPlugin.sts.port=<port number> -DjmeterPlugin.sts.loadAndRunOnStartup=<true/false> -DjmeterPlugin.sts.datasetDirectory=<path/to/your/directory> -DjmeterPlugin.sts.addTimestamp=<true/false> -DjmeterPlugin.sts.daemon=<true/false> -DjmeterPlugin.sts.charsetEncodingHttpResponse=<charset like UTF-8> -DjmeterPlugin.sts.charsetEncodingReadFile=<charset like UTF-8> -DjmeterPlugin.sts.charsetEncodingWriteFile=<charset like UTF-8> -DjmeterPlugin.sts.initFileAtStartup=<files to read when STS startup, e.g : article.csv,users.csv> -DjmeterPlugin.sts.initFileAtStartupRegex=false=<false : no regular expression, files with comma separator, true : read files matching the regular expression> jmeter.bat -DjmeterPlugin.sts.loadAndRunOnStartup=true -DjmeterPlugin.sts.port=9191 -DjmeterPlugin.sts.datasetDirectory=c:/data -DjmeterPlugin.sts.charsetEncodingReadFile=UTF-8 -n –t testdemo.jmx STS in the POM of a Test With the jmeter-maven-plugin It is possible to use STS in a performance test launched with the jmeter-maven-plugin. To do this: Put your CSV files in the <project>/src/test/jmeter directory (e.g., logins.csv).Put the simple-table-server.groovy (Groovy script) in the <project>/src/test/jmeter directory.Put your JMeter script in <project>/src/test/jmeter directory (e.g., test_login.jmx).Declare in the Maven build section, in the configuration <jmeterExtensions> declare the artifact kg.apc:jmeter-plugins-table-server:<version>.Declare user properties for STS configuration and automatic start.If you use a localhost and a proxy configuration, you could add a proxy configuration with <hostExclusions>localhost</hostExclusions>.Extract pom.xml dedicated to HTTP Simple Table Server : XML <build> <plugins> <plugin> <groupId>com.lazerycode.jmeter</groupId> <artifactId>jmeter-maven-plugin</artifactId> <version>3.8.0</version> ... <configuration> <jmeterExtensions> <artifact>kg.apc:jmeter-plugins-table-server:5.0</artifact> </jmeterExtensions> <propertiesUser>  <jmeterPlugin.sts.port>9191</jmeterPlugin.sts.port> <jmeterPlugin.sts.addTimestamp>true</jmeterPlugin.sts.addTimestamp> <jmeterPlugin.sts.datasetDirectory>${project.build.directory}/jmeter/testFiles</jmeterPlugin.sts.datasetDirectory> <jmeterPlugin.sts.loadAndRunOnStartup>true</jmeterPlugin.sts.loadAndRunOnStartup> <jsr223.init.file>${project.build.directory}/jmeter/testFiles/simple-table-server.groovy</jsr223.init.file> </propertiesUser> </configuration> </plugin> </plugins> </build> Links jmeter-plugins.org documentation

By Vincent DABURON

Overcoming the Retry Dilemma in Distributed Systems

“Insanity is doing the same thing over and over again, but expecting different results” - Source unknown As you can see in the quote above, humans have this tendency to retry things even when results are not going to change. This was manifested in systems designs as well where we pushed these biases when designing systems. If you look closely there are two broad categories of failures: Cases where retry makes senseWhere retries don’t make sense In the first category, transient failures like network glitches or intermittent service overloads are examples for which retrying makes sense. For the second, where the failure originates from something like the request itself is malformed, requests are getting throttled (429s), or service is load-shedding (5xx), it doesn’t make much sense to retry. While both of these categories need special attention and appropriate handling, in this article, we will primarily focus on category 1, where it makes sense to retry. The Case for Retries In the modern world, a typical customer-facing service is made up of various microservices. A customer query to a server goes through a complex call graph and typically deals with many services. On a happy day, a single customer query meets no error (or independent failure) to give an illusion of a successful call event. For example, a service that is dependent on 5 services with 99.99% availability, can only achieve a max of 99.95% of availability (reference doc) for its callers. The key point here is that even though each individual dependency has an excellent availability of 99.99%, the cumulative effect of depending on 5 such services results in a lower overall availability for the main service. This is because the probability of all 5 dependencies succeeding on a single request is the product of their individual probabilities. Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies) Now, by using the same formula, we can see that to maintain a 99.99% availability at the main service without any retries, all the dependencies need to have an availability higher than 99.998%. So, this begs a question: how do you achieve 99.99% availability at the main service? The answer is we need retries! Retries Sweet Spot We discussed above that the maximum availability that you can achieve without retries is 99.95% with those 5 dependencies. Now, if we expand our above formula and try to model the overall availability to 99.99% of the main service, it will include retries as a factor in considering it. So, the formula becomes: Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies + # of Retries) If you plug these values, it gives you 99.99% = 1 - (1 - 0.9999) ^ (5 + Number of Retries). This gives you # of retries = 2, which means that by adding two retries at the main service, you will be able to achieve 99.99% availability. This demonstrates how retries can be an effective strategy to overcome the effect of cumulative availability reduction when relying on multiple dependencies and help achieve the desired overall service-level objectives. This is great and obvious! So why this article!? Retry Storm and Call Escalation While these retries help, they also bring trouble with them. The main problem with retries is when one of the services you depend on is in trouble or having a bad day. Now when you retry when it's already down, it is like you are kicking it where it hurts! — potentially delaying the service’s recovery. Now think of a scenario where this call graph is multi-level deep; for example, the main service depends on 5 sub-services which in turn depend on another 5 services. Now when there is a failure, you retry at the top, and this will lead to 5 * 5 = 25 retries. What if those are further dependent on 5 services? So for one retry at the top, you may end up with 5 * 5 * 5 retries, and so on. While retries don’t help with faster recovery, they can take down the service that was operating partially with this extra generated load. Now, when those fail, they further increase the failure leading to more failures and this retry storm starts and creates long-lasting outages. At the lowest level, the call volume reaches 2 ^ N, which is catastrophic and would have recovered much faster had there been no retries at all. This brings us to the meat of the article where we say we like retries, but they are also dangerous if not done right. So, how can we do it right? Strategies for Limiting Retries To benefit from retries while keeping the retry storms and cascading failures from happening, we need a way to stop the excessive retries. As highlighted above, even a single retry at the top can cause too much damage when the call graph is deep (2 ^ N). We need to limit retries in aggregate. These are some of the practical tips on how this can be done: Bounded Retries The idea with the bounded retries is that you have an upper bound in how many retries you can do. The upper bound can be decided based on the time; for example, every minute you can make 10 retries or it’s based on the success rate for every 1000 success calls, you give service a single retry credit and you keep getting until you reach the fixed upper bound. Circuit Breaking The philosophy of the circuit-breaking technique is to don’t hammer whats already down. With the circuit breaker pattern, what you do is when you meet an error, you close the connection, stop making calls to the server, and give it breathing room. To check for the recovery, you have a thread that makes a single invoke to the service on a periodic basis to check if it has recovered. Once it has recovered, you gradually start the traffic and go back to normal operation mode. This gives the receiving service the much-needed time to recover. This topic is covered in much detail in Martin Fowler’s article. Retry Strategies There are techniques in TCP congestion control like AIMD (Additive Increase and Multiplicative decrease) that can be employed. AIMD basically says that you slowly increase the traffic to a service (think of it like connection + 1), but you immediately reduce the traffic when faced with an error (think of it like active Connection / 2). You keep doing that till you equilibrium. Exponential backoff is another technique where you back off for a period of time upon meeting an error and subsequently increase the time you back off up to some maximum time. The subsequent increase is generally like 1, 2, 4, 8.. 2^retryCount, but could also be Fibonacci-based like 1,2,3,5,8…. There is another gotcha while retrying, which is to keep the jitter in mind. Here is a great article from Marc Brooker who goes into more depth about exponential backoff and retries. While these techniques talk about client-side protection, we could also employ some guardrails on the server side. Some of the things that we can consider are: Explicit backpressure contract: When under load, reject caller request and pass on the metadata that you are failing because of overload and ask it to pass to its upstream and so on. Avoid wastage work: In the case where you expect service to be under load, avoid doing wastage work. You can check whose caller has timed out and don’t need an answer to drop the work altogether. Load shedding: When under stress, load shed aggressively. Investigate in mechanism where you can identify duplicate requests and discard one of them. Use signals like CPU, memory, etc. to compliment the load-shedding decision-making. Conclusion In distributed systems, transient failures in remote interactions are unavoidable. To build higher availability systems, we rely on retries. Retries can lead to big outages because of call escalations. Clients can adopt various approaches to avoid overloading the system in case of failures. Services should also employ techniques to protect themselves in case a client goes rogue. Disclaimer The guidance provided here offers principles and practices that could broadly improve the reliability of services in many conditions. However, I would advise that you do not view this as a one-size-fits-all mandate. Instead, I suggest that you and your team evaluate these recommendations through the lens of your specific needs and circumstances.

By Rajesh Pandey

Java Performance Tuning: Adjusting GC Threads for Optimal Results

Garbage Collection (GC) plays an important role in Java’s memory management. It helps to reclaim memory that is no longer in use. A garbage collector uses its own set of threads to reclaim memory. These threads are called GC threads. Sometimes JVM can end up either with too many or too few GC threads. In this post, we will discuss why JVM can end up having too many/too few GC threads, the consequences of it, and potential solutions to address them. How To Find Your Application’s GC Thread Count You can determine your application’s GC thread count by doing a thread dump analysis as outlined below: Capture thread dump from your production server.Analyze the dump using a thread dump analysis tool.The tool will immediately report the GC thread count, as shown in the figure below. Figure 1: fastThread tool reporting GC Thread count How To Set GC Thread Count You can manually adjust the number of GC threads by setting the following two JVM arguments: -XX:ParallelGCThreads=n: Sets the number of threads used in the parallel phase of the garbage collectors-XX:ConcGCThreads=n: Controls the number of threads used in concurrent phases of garbage collectors What Is the Default GC Thread Count? If you don’t explicitly set the GC thread count using the above two JVM arguments, then the default GC thread count is derived based on the number of CPUs in the server/container. –XX:ParallelGCThreads Default: On Linux/x86 machines, it is derived based on the formula: if (num of processors <=8) { return num of processors; } else { return 8+(num of processors-8)*(5/8); } So if your JVM is running on a server with 32 processors, then the ParallelGCThread value is going to be: 23(i.e. 8 + (32 – 8)*(5/8)). -XX:ConcGCThreads Default: It’s derived based on the formula: max((ParallelGCThreads+2)/4, 1) So if your JVM is running on a server with 32 processors, then: ParallelGCThread value is going to be: 23 (i.e. 8 + (32 – 8)*(5/8)).ConcGCThreads value is going to be: 6 (i.e. max(25/4, 1). Can JVM End Up With Too Many GC Threads? It’s possible for your JVM to unintentionally have too many GC threads, often without your awareness. This typically happens because the default number of GC threads is automatically determined based on the number of CPUs in your server or container. For example, on a machine with 128 CPUs, the JVM might allocate around 80 threads for the parallel phase of garbage collection and about 20 threads for the concurrent phase, resulting in a total of approximately 100 GC threads. If you’re running multiple JVMs on this 128-CPU machine, each JVM could end up with around 100 GC threads. This can lead to excessive resource usage because all these threads are competing for the same CPU resources. This problem is particularly noticeable in containerized environments, where multiple applications share the same CPU cores. It will cause JVM to allocate more GC threads than necessary, which can degrade overall performance. Why Is Having Too Many GC Threads a Problem? While GC threads are essential for efficient memory management, having too many of them can lead to significant performance challenges in your Java application. Increased context switching: When the number of GC threads is too high, the operating system must frequently switch between these threads. This leads to increased context switching overhead, where more CPU cycles are spent managing threads rather than executing your application’s code. As a result, your application may slow down significantly.CPU overhead: Each GC thread consumes CPU resources. If too many threads are active simultaneously, they can compete for CPU time, leaving less processing power available for your application’s primary tasks. This competition can degrade your application’s performance, especially in environments with limited CPU resources.Memory contention: With an excessive number of GC threads, there can be increased contention for memory resources. Multiple threads trying to access and modify memory simultaneously can lead to lock contention, which further slows down your application and can cause performance bottlenecks.Increased GC pause times and lower throughput: When too many GC threads are active, the garbage collection process can become less efficient, leading to longer GC pause times where the application is temporarily halted. These extended pauses can cause noticeable delays or stutters in your application. Additionally, as more time is spent on garbage collection rather than processing requests, your application’s overall throughput may decrease, handling fewer transactions or requests per second and affecting its ability to scale and perform under load.Higher latency: Increased GC activity due to an excessive number of threads can lead to higher latency in responding to user requests or processing tasks. This is particularly problematic for applications that require low latency, such as real-time systems or high-frequency trading platforms, where even slight delays can have significant consequences.Diminishing returns: Beyond a certain point, adding more GC threads does not improve performance. Instead, it leads to diminishing returns, where the overhead of managing these threads outweighs the benefits of faster garbage collection. This can result in degraded application performance, rather than the intended optimization. Why Is Having Too Few GC Threads a Problem? While having too many GC threads can create performance issues, having too few GC threads can be equally problematic for your Java application. Here’s why: Longer Garbage Collection times: With fewer GC threads, the garbage collection process may take significantly longer to complete. Since fewer threads are available to handle the workload, the time required to reclaim memory increases, leading to extended GC pause times.Increased application latency: Longer garbage collection times result in increased latency, particularly for applications that require low-latency operations. Users might experience delays, as the application becomes unresponsive while waiting for garbage collection to finish.Reduced throughput: A lower number of GC threads means the garbage collector can’t work as efficiently, leading to reduced overall throughput. Your application may process fewer requests or transactions per second, affecting its ability to scale under load.Inefficient CPU utilization: With too few GC threads, the CPU cores may not be fully utilized during garbage collection. This can lead to inefficient use of available resources, as some cores remain idle while others are overburdened.Increased risk of OutOfMemoryErrors and memory leaks: If the garbage collector is unable to keep up with the rate of memory allocation due to too few threads, it may not be able to reclaim memory quickly enough. This increases the risk of your application running out of memory, resulting in OutOfMemoryErrors and potential crashes. Additionally, insufficient GC threads can exacerbate memory leaks by slowing down the garbage collection process, allowing more unused objects to accumulate in memory. Over time, this can lead to excessive memory usage and further degrade application performance. Solutions To Optimize GC Thread Count If your application is suffering from performance issues due to an excessive or insufficient number of GC threads, consider manually setting the GC thread count using the above-mentioned JVM arguments: -XX:ParallelGCThreads=n -XX:ConcGCThreads=n Before making these changes in production, it’s essential to study your application’s GC behavior. Start by collecting and analyzing GC logs using tools. This analysis will help you identify if the current thread count is causing performance bottlenecks. Based on these insights, you can make informed adjustments to the GC thread count without introducing new issues Note: Always test changes in a controlled environment first to confirm that they improve performance before rolling them out to production. Conclusion Balancing the number of GC threads is key to ensuring your Java application runs smoothly. By carefully monitoring and adjusting these settings, you can avoid potential performance issues and keep your application operating efficiently.

By Ram Lakshmanan

CORE

Why Replace External Database Caches?

Teams often consider external caches when the existing database cannot meet the required service-level agreement (SLA). This is a clear performance-oriented decision. Putting an external cache in front of the database is commonly used to compensate for subpar latency stemming from various factors, such as inefficient database internals, driver usage, infrastructure choices, traffic spikes, and so on. Caching might seem like a fast and easy solution because the deployment can be implemented without tremendous hassle and without incurring the significant cost of database scaling, database schema redesign, or even a deeper technology transformation. However, external caches are not as simple as they are often made out to be. In fact, they can be one of the more problematic components of a distributed application architecture. In some cases, it’s a necessary evil, such as when you require frequent access to transformed data resulting from long and expensive computations, and you’ve tried all the other means of reducing latency. But in many cases, the performance boost just isn’t worth it. You solve one problem, but create others. Here are some often-overlooked risks related to external caches and ways to achieve a performance boost plus cost savings by replacing their core database and external cache. Why Not Cache? We’ve worked with countless teams struggling with the costs, hassles, and limits of traditional attempts to improve database performance. Here are the top struggles we’ve seen teams experience with putting an external cache in front of their database. An External Cache Adds Latency A separate cache means another hop on the way. When a cache surrounds the database, the first access occurs at the cache layer. If the data isn’t in the cache, then the request is sent to the database. This adds latency to an already slow path of uncached data. One may claim that when the entire data set fits the cache, the additional latency doesn’t come into play. However, unless your data set is considerably small, storing it entirely in memory considerably magnifies costs and is thus prohibitively expensive for most organizations. An External Cache is an Additional Cost Caching means expensive DRAM, which translates to a higher cost per gigabyte than solid-state disks (see this P99 CONF talk by Grafana’s Danny Kopping for more details on that). Rather than provisioning an entirely separate infrastructure for caching, it is often best to use the existing database memory, and even increase it for internal caching. Modern database caches can be just as efficient as traditional in-memory caching solutions when sized correctly. When the working set size is too large to fit in memory, then databases often shine in optimizing I/O access to flash storage, making databases alone (no external cache) a preferred and cheaper option. External Caching Decreases Availability No cache’s high availability solution can match that of the database itself. Modern distributed databases have multiple replicas; they also are topology-aware and speed-aware and can sustain multiple failures without data loss. For example, a common replication pattern is three local replicas, which generally allows for reads to be balanced across such replicas to efficiently make use of your database’s internal caching mechanism. Consider a nine-node cluster with a replication factor of three: essentially every node will hold roughly a third of your total data set size. As requests are balanced among different replicas, this grants you more room for caching your data, which could completely eliminate the need for an external cache. Conversely, if an external cache happens to invalidate entries right before a surge of cold requests, availability could be impeded for a while since the database won’t have that data in its internal cache (more on this below). Caches often lack high availability properties and can easily fail or invalidate records depending on their heuristics. Partial failures, which are more common, are even worse in terms of consistency. When the cache inevitably fails, the database will get hit by the unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache itself has some high availability features, it can’t coordinate handling such failure with the persistent database it is in front of. The bottom line: rely on the database, rather than making your latency SLAs dependent on a cache. Application Complexity: Your Application Needs to Handle More Cases External caches introduce application and operational complexity. Once you have an external cache, it is your responsibility to keep the cache up to date with the database. Irrespective of your caching strategy (such as write-through, caching aside, etc.), there will be edge cases where your cache can run out of sync from your database, and you must account for these during application development. Your client settings (such as failover, retry, and timeout policies) need to match the properties of both the cache as well as your database to function when the cache is unavailable or goes cold. Usually, such scenarios are hard to test and implement. External Caching Ruins the Database Caching Modern databases have embedded caches and complex policies to manage them. When you place a cache in front of the database, most read requests will reach only the external cache and the database won’t keep these objects in its memory. As a result, the database cache is rendered ineffective. When requests eventually reach the database, its cache will be cold and the responses will come primarily from the disk. As a result, the round-trip from the cache to the database and then back to the application is likely to add latency. External Caching Might Increase Security Risks An external cache adds a whole new attack surface to your infrastructure. Encryption, isolation, and access control on data placed in the cache are likely to be different from the ones at the database layer itself. External Caching Ignores The Database Knowledge And Database Resources Databases are quite complex and built for specialized I/O workloads on the system. Many of the queries access the same data, and some amount of the working set size can be cached in memory to save disk accesses. A good database should have sophisticated logic to decide which objects, indexes, and accesses it should cache. The database also should have eviction policies that determine when new data should replace existing (older) cached objects. An example is scan-resistant caching. When scanning a large data set, say a large range or a full-table scan, a lot of objects are read from the disk. The database can realize this is a scan (not a regular query) and choose to leave these objects outside its internal cache. However, an external cache (following a read-through strategy) would treat the result set just like any other and attempt to cache the results. The database automatically synchronizes the content of the cache with the disk according to the incoming request rate, and thus the user and the developer do not need to do anything to make sure that lookups to recently written data are performant and consistent. Therefore, if, for some reason, your database doesn’t respond fast enough, it means that: The cache is misconfiguredIt doesn’t have enough RAM for cachingThe working set size and request pattern don’t fit the cacheThe database cache implementation is poor A Better Option: Let the Database Handle It How can you meet your SLAs without the risks of external database caches? Many teams have found that by moving to a faster database with a specialized internal cache, they’re able to meet their latency SLAs with less hassle and lower costs. Although external caches are a great companion for reducing latencies (such as serving static content and personalization data not requiring any level of durability), they often introduce more problems than benefits when placed in front of a database. The top tradeoffs include elevated costs, increased application complexity, additional round trips to your database, and an additional security surface area. By rethinking your existing caching strategy and switching to a modern database providing predictable low latencies at scale, teams can simplify their infrastructure and minimize costs. At the same time, they can still meet their SLAs without the extra hassles and complexities introduced by external caches.

By Felipe Mendes

MLOps: How to Build a Toolkit to Boost AI Project Performance

Numerous AI projects launched with promise fail to set sail. This is not usually because of the quality of the machine learning (ML) models. Poor implementation and system integration sink 90% of projects. Organizations can save their AI endeavors. They should adopt adequate MLOps practices and choose the right set of tools. This article discusses MLOps practices and tools that can save sinking AI projects and boost robust ones, potentially doubling project launch speed. MLOps in a Nutshell MLOps is a mix of machine learning application development (Dev) and operational activities (Ops). It is a set of practices that helps automate and streamline the deployment of ML models. As a result, the entire ML lifecycle becomes standardized. MLOps is complex. It requires harmony between data management, model development, and operations. It may also need shifts in technology and culture within an organization. If adopted smoothly, MLOps allows professionals to automate tedious tasks, such as data labeling, and make deployment processes transparent. It helps ensure that project data is secure and compliant with data privacy laws. Organizations enhance and scale their ML systems through MLOps practices. This makes collaboration between data scientists and engineers more effective and fosters innovation. Weaving AI Projects From Challenges MLOps professionals transform raw business challenges into streamlined, measurable machine learning goals. They design and manage ML pipelines, ensuring thorough testing and accountability throughout an AI project's lifecycle. In the initial phase of an AI project called use case discovery, data scientists work with businesses to define the problem. They translate it into an ML problem statement and set clear objectives and KPIs. MLOps framework Next, data scientists team up with data engineers. They gather data from various sources, and then clean, process, and validate this data. When data is ready for modeling, data scientists design and deploy robust ML pipelines, integrated with CI/CD processes. These pipelines support testing and experimentation and help track data, model lineage, and associated KPIs across all experiments. In the production deployment stage, ML models are deployed in the chosen environment: cloud, on-premises, or hybrid. Data scientists monitor the models and infrastructure, using key metrics to spot changes in data or model performance. When they detect changes, they update the algorithms, data, and hyperparameters, creating new versions of the ML pipelines. They also manage memory and computing resources to keep models scalable and running smoothly. MLOps Tools Meet AI Projects Picture a data scientist developing an AI application to enhance a client’s product design process. This solution will accelerate the prototyping phase by providing AI-generated design alternatives based on specified parameters. Data scientists navigate through diverse tasks, from designing the framework to monitoring the AI model in real-time. They need the right tools and a grasp of how to use them at every step. Better LLM Performance, Smarter AI Apps At the core of an accurate and adaptable AI solution are vector databases and these key tools to boost LLMs performance: Guardrails is an open-source Python package that helps data scientists add structure, type, and quality checks to LLM outputs. It automatically handles errors and takes actions, like re-querying the LLM, if validation fails. It also enforces guarantees on output structure and types, such as JSON.Data scientists need a tool for efficient indexing, searching, and analyzing large datasets. This is where LlamaIndex steps in. The framework provides powerful capabilities to manage and extract insights from extensive information repositories.The DUST framework allows LLM-powered applications to be created and deployed without execution code. It helps with the introspection of model outputs, supports iterative design improvements, and tracks different solution versions. Track Experiments and Manage Model Metadata Data scientists experiment to better understand and improve ML models over time. They need tools to set up a system that enhances model accuracy and efficiency based on real-world results. MLflow is an open-source powerhouse, useful to oversee the entire ML lifecycle. It provides features like experiment tracking, model versioning, and deployment capabilities. This suite lets data scientists log and compare experiments, monitor metrics, and keep ML models and artifacts organized.Comet ML is a platform for tracking, comparing, explaining, and optimizing ML models and experiments. Data scientists can use Comet ML with Scikit-learn, PyTorch, TensorFlow, or HuggingFace — it will provide insights to improve ML models.Amazon SageMaker covers the entire machine-learning lifecycle. It helps label and prepare data, as well as build, train, and deploy complex ML models. Using this tool, data scientists quickly deploy and scale models across various environments.Microsoft Azure ML is a cloud-based platform that helps streamline machine learning workflows. It supports frameworks like TensorFlow and PyTorch, and it can also integrate with other Azure services. This tool helps data scientists with experiment tracking, model management, and deployment.DVC (data version control) is an open-source tool meant to handle large data sets and machine learning experiments. This tool makes data science workflows more agile, reproducible, and collaborative. DVC works with existing version control systems like Git, simplifying how data scientists track changes and share progress on complex AI projects. Optimize and Manage ML Workflows Data scientists need optimized workflows to achieve smoother and more effective processes on AI projects. The following tools can assist: Prefect is a modern open-source tool that data scientists use to monitor and orchestrate workflows. Lightweight and flexible, it has options to manage ML pipelines (Prefect Orion UI and Prefect Cloud).Metaflow is a powerful tool for managing workflows. It is meant for data science and machine learning. It eases focusing on model development without the hassle of MLOps complexities.Kedro is a Python-based tool that helps data scientists keep a project reproducible, modular, and easy to maintain. It applies key software engineering principles to machine learning (modularity, separation of concerns, and versioning). This helps data scientists build efficient, scalable projects. Manage Data and Control Pipeline Versions ML workflows need precise data management and pipeline integrity. With the right tools, data scientists stay on top of those tasks and handle even the most complex data challenges with confidence. Pachyderm helps data scientists automate data transformation and offers robust features for data versioning, lineage, and end-to-end pipelines. These features can run seamlessly on Kubernetes. Pachyderm supports integration with various data types: images, logs, videos, CSVs, and multiple languages (Python, R, SQL, and C/C++). It scales to handle petabytes of data and thousands of jobs.LakeFS is an open-source tool designed for scalability. It adds Git-like version control to object storage and supports data version control on an exabyte scale. This tool is ideal for handling extensive data lakes. Data scientists use this tool to manage data lakes with the same ease as they handle code. Test ML Models for Quality and Fairness Data scientists focus on developing more reliable and fair ML solutions. They test models to minimize biases. The right tools help them assess key metrics, like accuracy and AUC, support error analysis and version comparison, document processes, and integrate seamlessly into ML pipelines. Deepchecks is a Python package that assists with ML models and data validation. It also eases model performance checks, data integrity, and distribution mismatches.Truera is a modern model intelligence platform that helps data scientists increase trust and transparency in ML models. Using this tool, they can understand model behavior, identify issues, and reduce biases. Truera provides features for model debugging, explainability, and fairness assessment.Kolena is a platform that enhances team alignment and trust through rigorous testing and debugging. It provides an online environment for logging results and insights. Its focus is on ML unit testing and validation at scale, which is key to consistent model performance across different scenarios. Bring Models to Life Data scientists need reliable tools to efficiently deploy ML models and serve predictions reliably. The following tools help them achieve smooth and scalable ML operations: BentoML is an open platform that helps data scientists handle ML operations in production. It helps streamline model packaging and optimize serving workloads for efficiency. It also assists with faster setup, deployment, and monitoring of prediction services.Kubeflow simplifies deploying ML models on Kubernetes (locally, on-premises, or in the cloud). With this tool, the entire process becomes straightforward, portable, and scalable. It supports everything from data preparation to prediction serving. Simplify the ML Lifecycle With End-To-End MLOps Platforms End-to-end MLOps platforms are essential for optimizing the machine learning lifecycle, offering a streamlined approach to developing, deploying, and managing ML models effectively. Here are some leading platforms in this space: Amazon SageMaker offers a comprehensive interface that helps data scientists handle the entire ML lifecycle. It streamlines data preprocessing, model training, and experimentation, enhancing collaboration among data scientists. With features like built-in algorithms, automated model tuning, and tight integration with AWS services, SageMaker is a top pick for developing and deploying scalable machine learning solutions.Microsoft Azure ML Platform creates a collaborative environment that supports various programming languages and frameworks. It allows data scientists to use pre-built models, automate ML tasks, and seamlessly integrate with other Azure services, making it an efficient and scalable choice for cloud-based ML projects.Google Cloud Vertex AI provides a seamless environment for both automated model development with AutoML and custom model training using popular frameworks. Integrated tools and easy access to Google Cloud services make Vertex AI ideal for simplifying the ML process, helping data science teams build and deploy models effortlessly and at scale. Signing Off MLOps is not just another hype. It is a significant field that helps professionals train and analyze large volumes of data more quickly, accurately, and easily. We can only imagine how this will evolve over the next ten years, but it's clear that AI, big data, and automation are just beginning to gain momentum.

By Alexander Simonov

Optimizing Container Synchronization for Frequent Writes

Efficient data synchronization is crucial in high-performance computing and multi-threaded applications. This article explores an optimization technique for scenarios where frequent writes to a container occur in a multi-threaded environment. We’ll examine the challenges of traditional synchronization methods and present an advanced approach that significantly improves performance for write-heavy environments. The method in question is beneficial because it is easy to implement and versatile, unlike pre-optimized containers that may be platform-specific, require special data types, or bring additional library dependencies. Traditional Approaches and Their Limitations Imagine a scenario where we have a cache of user transactions: C++ struct TransactionData { long transactionId; long userId; unsigned long date; double amount; int type; std::string description; }; std::map<long, std::vector<TransactionData>> transactionCache; // key - userId In a multi-threaded environment, we need to synchronize access to this cache. The traditional approach might involve using a mutex: C++ class SimpleSynchronizedCache { public: void write(const TransactionData&& transaction) { std::lock_guard<std::mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::lock_guard<std::mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::mutex cacheMutex; }; As system load increases, especially with frequent reads, we might consider using a shared_mutex: C++ class CacheWithSharedMutex { public: void write(const TransactionData&& transaction) { std::lock_guard<std::shared_mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::shared_lock<std::shared_mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::shared_mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::shared_mutex cacheMutex; }; However, when the load is primarily generated by writes rather than reads, the advantage of a shared_mutex over a regular mutex becomes minimal. The lock will often be acquired exclusively, negating the benefits of shared access. Moreover, let’s imagine that we don’t use read() at all — instead, we frequently write incoming transactions and periodically flush the accumulated transaction vectors using pop(). As pop() involves reading with extraction, both write() and pop() operations would modify the cache, necessitating exclusive access rather than shared access. Thus, the shared_lock becomes entirely useless in terms of optimization over a regular mutex, and maybe even performs worse — its more intricate implementation is now used for the same exclusive locks that a faster regular mutex provides. Clearly, we need something else. Optimizing Synchronization With the Sharding Approach Given the following conditions: A multi-threaded environment with a shared containerFrequent modification of the container from different threadsObjects in the container can be divided for parallel processing by some member variable. Regarding point 3, in our cache, transactions from different users can be processed independently. While creating a mutex for each user might seem ideal, it would lead to excessive overhead in maintaining so many locks. Instead, we can divide our cache into a fixed number of chunks based on the user ID, in a process known as sharding. This approach reduces the overhead and yet allows the parallel processing, thereby optimizing performance in a multi-threaded environment. C++ class ShardedCache { public: ShardedCache(size_t shardSize): _shardSize(shardSize), _transactionCaches(shardSize) { std::generate( _transactionCaches.begin(), _transactionCaches.end(), []() { return std::make_unique<SimpleSynchronizedCache>(); }); } void write(const TransactionData& transaction) { _transactionCaches[transaction.userId % _shardSize]->write(transaction); } std::vector<TransactionData> read(const long& userId) { _transactionCaches[userId % _shardSize]->read(userId); } std::vector<TransactionData> pop(const long& userId) { return std::move(_transactionCaches[userId % _shardSize]->pop(userId)); } private: const size_t _shardSize; std::vector<std::unique_ptr<SimpleSynchronizedCache>> _transactionCaches; }; This approach allows for finer-grained locking without the overhead of maintaining an excessive number of mutexes. The division can be adjusted based on system architecture specifics, such as size of a thread pool that works with the cache, or hardware concurrency. Let’s run tests where we check how sharding accelerates cache performance by testing different partition sizes. Performance Comparison In these tests, we aim to do more than just measure the maximum number of operations the processor can handle. We want to observe how the cache behaves under conditions that closely resemble real-world scenarios, where transactions occur randomly. Our optimization goal is to minimize the processing time for these transactions, which enhances system responsiveness in practical applications. The implementation and tests are available in the GitHub repository. C++ #include <thread> #include <functional> #include <condition_variable> #include <random> #include <chrono> #include <iostream> #include <fstream> #include <array> #include "SynchronizedContainers.h" const auto hardware_concurrency = (size_t)std::thread::hardware_concurrency(); class TaskPool { public: template <typename Callable> TaskPool(size_t poolSize, Callable task) { for (auto i = 0; i < poolSize; ++i) { _workers.emplace_back(task); } } ~TaskPool() { for (auto& worker : _workers) { if (worker.joinable()) worker.join(); } } private: std::vector<std::thread> _workers; }; template <typename CacheImpl> class Test { public: template <typename CacheImpl = ShardedCache, typename ... CacheArgs> Test(const int testrunsNum, const size_t writeWorkersNum, const size_t popWorkersNum, const std::string& resultsFile, CacheArgs&& ... cacheArgs) : _cache(std::forward<CacheArgs>(cacheArgs)...), _writeWorkersNum(writeWorkersNum), _popWorkersNum(popWorkersNum), _resultsFile(resultsFile), _testrunsNum(testrunsNum), _testStarted (false) { std::random_device rd; _randomGenerator = std::mt19937(rd()); } void run() { for (auto i = 0; i < _testrunsNum; ++i) { runSingleTest(); logResults(); } } private: void runSingleTest() { { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = false; } // these pools won’t just fire as many operations as they can, // but will emulate real-time occuring requests to the cache in multithreaded environment auto writeTestPool = TaskPool(_writeWorkersNum, std::bind(&Test::writeTransactions, this)); auto popTestPool = TaskPool(_popWorkersNum, std::bind(&Test::popTransactions, this)); _writeTime = 0; _writeOpNum = 0; _popTime = 0; _popOpNum = 0; { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = true; _testStartCv.notify_all(); } } void logResults() { std::cout << "===============================================" << std::endl; std::cout << "Writing operations number per sec:\t" << _writeOpNum / 60. << std::endl; std::cout << "Writing operations avg time (mcsec):\t" << (double)_writeTime / _writeOpNum << std::endl; std::cout << "Pop operations number per sec: \t" << _popOpNum / 60. << std::endl; std::cout << "Pop operations avg time (mcsec): \t" << (double)_popTime / _popOpNum << std::endl; std::ofstream resultsFilestream; resultsFilestream.open(_resultsFile, std::ios_base::app); resultsFilestream << _writeOpNum / 60. << "," << (double)_writeTime / _writeOpNum << "," << _popOpNum / 60. << "," << (double)_popTime / _popOpNum << std::endl; std::cout << "Results saved to file " << _resultsFile << std::endl; } void writeTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 5 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 5000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto generatedUser = userDistribution(_randomGenerator); TransactionData dummyTransaction = { 5477311, generatedUser, 1824507435, 8055.05, 0, "regular transaction by " + std::to_string(generatedUser)}; std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); _cache.write(dummyTransaction); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_writeOpNum; _writeTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 5ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(5) - std::chrono::steady_clock::now()); } } void popTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 100 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 100000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto requestedUser = userDistribution(_randomGenerator); std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); auto userTransactions = _cache.pop(requestedUser); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_popOpNum; _popTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 100ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(100) - std::chrono::steady_clock::now()); } } CacheImpl _cache; std::atomic<long> _writeTime; std::atomic<long> _writeOpNum; std::atomic<long> _popTime; std::atomic<long> _popOpNum; size_t _writeWorkersNum; size_t _popWorkersNum; std::string _resultsFile; int _testrunsNum; bool _testStarted; std::mutex _testStartSync; std::condition_variable _testStartCv; std::mt19937 _randomGenerator; }; void testCaches(const size_t testedShardSize, const size_t workersNum) { if (testedShardSize == 1) { auto simpleImplTest = Test<SimpleSynchronizedCache>( 10, workersNum, workersNum, "simple_cache_tests(" + std::to_string(workersNum) + "_workers).csv"); simpleImplTest.run(); } else { auto shardedImpl4Test = Test<ShardedCache>( 10, workersNum, workersNum, "sharded_cache_" + std::to_string(testedShardSize) + "_tests(" + std::to_string(workersNum) + "_workers).csv", 4); shardedImpl4Test.run(); } } int main() { std::cout << "Hardware concurrency: " << hardware_concurrency << std::endl; std::array<size_t, 7> testPlan = { 1, 4, 8, 32, 128, 4096, 100000 }; for (auto i = 0; i < testPlan.size(); ++i) { testCaches(testPlan[i], 4 * hardware_concurrency); } // additional tests with diminished load to show limits of optimization advantage std::array<size_t, 4> additionalTestPlan = { 1, 8, 128, 100000 }; for (auto i = 0; i < additionalTestPlan.size(); ++i) { testCaches(additionalTestPlan[i], hardware_concurrency); } } We observe that with 2,000 writes and 300 pops per second (with a concurrency of 8) — which are not very high numbers for a high-load system — optimization using sharding significantly accelerates cache performance, by orders of magnitude. However, evaluating the significance of this difference is left to the reader, as, in both scenarios, operations took less than a millisecond. It’s important to note that the tests used a relatively lightweight data structure for transactions, and synchronization was applied only to the container itself. In real-world scenarios, data is often more complex and larger, and synchronized processing may require additional computations and access to other data, which can significantly increase the time of operation itself. Therefore, we aim to spend as little time on synchronization as possible. The tests do not show the significant difference in processing time when increasing the shard size. The greater the size the bigger is the maintaining overhead, so how low should we go? I suspect that the minimal effective value is tied to the system's concurrency, so for modern server machines with much greater concurrency than my home PC, a shard size that is too small won’t yield the most optimal results. I would love to see the results on other machines with different concurrency that may confirm or disprove this hypothesis, but for now I assume it is optimal to use a shard size that is several times larger than the concurrency. You can also note that the largest size tested — 100,000 — effectively matches the mentioned earlier approach of assigning a mutex to each user (in the tests, user IDs were generated within the range of 100,000). As can be seen, this did not provide any advantage in processing speed, and this approach is obviously more demanding in terms of memory. Limitations and Considerations So, we determined an optimal shard size, but this is not the only thing that should be considered for the best results. It’s important to remember that such a difference compared to a simple implementation exists only because we are attempting to perform a sufficiently large number of transactions at the same time, causing a “queue” to build up. If the system’s concurrency and the speed of each operation (within the mutex lock) allow operations to be processed without bottlenecks, the effectiveness of sharding optimization decreases. To demonstrate this, let’s look at the test results with reduced load — at 500 writes and 75 pops (with a concurrency of 8) — the difference is still present, but it is no longer as significant. This is yet another reminder that premature optimizations can complicate code without significantly impacting results. It’s crucial to understand the application requirements and expected load. Also, it’s important to note that the effectiveness of sharding heavily depends on the distribution of values of the chosen key (in this case, user ID). If the distribution becomes heavily skewed, we may revert to performance more similar to that of a single mutex — imagine all of the transactions coming from a single user. Conclusion In scenarios with frequent writes to a container in a multi-threaded environment, traditional synchronization methods can become a bottleneck. By leveraging the ability of parallel processing of data and predictable distribution by some specific key and implementing a sharded synchronization approach, we can significantly improve performance without sacrificing thread safety. This technique can prove itself effective for systems dealing with user-specific data, such as transaction processing systems, user session caches, or any scenario where data can be logically partitioned based on a key attribute. As with any optimization, it’s crucial to profile your specific use case and adjust the implementation accordingly. The approach presented here provides a starting point for tackling synchronization challenges in write-heavy, multi-threaded applications. Remember, the goal of optimization is not just to make things faster, but to make them more efficient and scalable. By thinking critically about your data access patterns and leveraging the inherent structure of your data, you can often find innovative solutions to performance bottlenecks.

By Alexander Iskhakov

Event Sourcing Explained: Building Robust Systems With Immutable Event Logs

An architectural pattern named Event Sourcing is gaining more and more recognition from developers who aim for strong and scalable systems. In this article, we’ll take a closer look at the concept of it: what it entails, its benefits, general flow, and key concepts. Moreover, we will discuss how to implement ES — some details on the technologies that make adoption easy. This article is aimed at software architects, system developers, and project managers who might be contemplating or are already engaged in integrating Event Sourcing into their systems. What Is Event Sourcing? The basic idea of Event Sourcing is to store the history of changes in an application state as a sequence of events. These events can't be changed once stored in an immutable log, which means that the system's current state can be recovered by playing back these events. The idea is different from what happens in typical systems where the state itself is kept and not how it came to be. It is important to also outline the most fundamental principles of ES: State as a sequence of events: Store changes as a series of immutable events rather than the current stateEvent immutability: Events cannot be changed or deleted upon creationEvent replay: Reconstruct the current state or a view by replaying events from the beginning Event Sourcing has a number of advantages that make the system strong and adaptable. One key advantage is that it ensures data integrity and keeps everything well-organized with a clear audit trail, making it easy to trace any changes. This ensures that every change made is unchangeable, hence providing an elaborate history of changes made, which makes it valuable during system failures. The other benefit includes easy horizontal scaling where the load can be distributed to multiple parallel handlers of the events thus complemented by architectural flexibility. Furthermore, an introduction to Event Sourcing would be incomplete without mentioning its perfect fit in systems needing an exhaustive audit trail such as fintech projects. Key Terms in Event Sourcing Command: An action request (imperative verb) performed by a user or external actor (external system).Event: The result of a command (verb in the past tense).Aggregate Root: A key concept from Domain-Driven Design (DDD) representing the main entity in a group of closely related entities (aggregate). This entity handles commands and generates events based on the state of the aggregate (command, aggregateState) => events[].AggregateState: The state of the aggregate. It doesn't need to contain all data, only the data required for validation and decision-making on which events to generate.Saga: An entity opposite to an aggregate - it listens to events and generates commands. A saga is typically a state machine and often requires persistence for its state.View: An entity that listens to events and performs side effects, such as creating a database table with data optimized for fast frontend access.Query: A data read request to the aggregateState or to the data produced in a view. General Flow The process of updating an event-sourced entity can be represented in pseudocode: (0): Restoring state from snapshot or by replaying events from the beginning.(1): Transform a command into events or an error based on the current state.(2): Persist events in the log.(3): Apply each event sequentially to the state generating a new state.(4): Persist the new state snapshot for quick access.(5): Respond to the caller with a report on the completed work. Additionally, events are applied asynchronously to views and sagas, enabling flexible, decoupled architectures that can evolve over time. In the image below, you can see the typical data flow for event-sourced applications when updating data by sending commands to the aggregate. For reading data, you can send a query to the aggregate for a strongly consistent but potentially slower result, or send a query to the view for an optimized but usually eventually consistent result. Benefits System state recovery allows the restoration of the system state at any given point (perhaps even before an unfortunate failure).Audit and traceability ensure that a detailed history of all changes made is provided for further analysis.Scalability can be achieved through horizontal scaling of event processing. Flexibility: Easy adaptation to new requirements and changes. Views can be rebuilt from scratch, and aggregate states can be reconstructed from scratch. Challenges On the other hand, Event Sourcing brings along with it several challenges that need to be managed with finesse for its effective implementation. Designing correct events and aggregates: Mistakes in the design of aggregates can be costly. The need for Event Storming and the adaptation of DDD modeling arises to find good boundaries.Storage requirements: Events can accumulate over time. Solution: Consider databases — NoSQL stores or specialized solutions for the right event storage.Performance issues: Optimizing event store queries.Ensuring consistency: Since views are updated eventually, implementing eventual consistency in the UI can be challenging, requiring careful management of delays to ensure they do not disrupt the user experience.Testing: Developing tests for systems based on ES.Single-writer pattern implementation in a cluster or multi-datacenter: Implementing ES in a cluster can be challenging because you need to create transactional boundaries to make sure only one node must write events (step (2) in the example) for a single entity at a time. ES in a multi-datacenter is even more challenging, often requiring modeling of additional commands and events for explicit cross-data center synchronization processes. Where To Use Event Sourcing Event Sourcing is a powerful architectural pattern that excels in specific scenarios: Complex Business Domains If there are lots of state changes, coupled with convoluted business logic, Event Sourcing creates a perfect audit trail. This is mostly applicable to cases of compliance, and even more to financial services (including banking and trading systems) as well as healthcare (comprising patient records and treatment histories). Auditing and Compliance The immutable event log of Event Sourcing is highly beneficial in industries that have strict regulatory requirements because every change is documented. This practice finds its primary applications in insurance (for keeping records of claim processing and policy changes) as well as government (for legislative processes and public records management). Debugging and Troubleshooting A key benefit of Event Sourcing is that it allows the recreation of past states by replaying events— this can help developers greatly with error analysis as well as testing. It lets you trace back through the sequence of events that led up to any particular issue, enabling root cause identification or even running simulations based on an event playback. Event-Driven Architectures Event-driven Architectures work hand in hand with Event Sourcing, elevating microservices further through the facilitation of state synchronization and empowering real-time applications. This happens with live updates and collaboration features, thereby ensuring that the entire system remains performant despite growing demands placed upon it due to scalability needs. Scalability and Performance Decoupling read and write models optimizes performance, making Event Sourcing ideal for high-volume systems like e-commerce platforms and social networks, as well as for data analysis, performing complex queries on historical data. Technologies for Implementation The proper implementation of Event Sourcing implies the need for a collection of technologies suited to various aspects of architectural demands. Among the solutions that can be recommended without hesitation for event stream persistence is EventStoreDB. It is well-suited for Event Sourcing and comes with a variety of tools related to different aspects of event management — storage, retrieval, and real-time event handling — that are instrumental in sustaining the performance and consistency of event-driven systems. Two other frameworks play an important role in this regard: Akka and Pekko, besides specialized storage The frameworks assist in managing this sophisticated flow of asynchronous data — including system resilience (ability to remain operational) and scalability (ability to handle increased loads) — which must be introduced with a dynamic environment typical for event-driven architectures. When it comes to organizations with unique needs that generic solutions fail to address, implementing Event Sourcing on a custom basis offers the needed adaptability. But these bespoke systems have to be put together; only a good grasp of ES principles will keep them away from typical traps like performance bottlenecks or data consistency problems. Be ready for a heavy investment in development and testing if you're going to create your own Event Sourcing system — ensure that the system is not only robust but also scalable. Conclusion The implementation of Event Sourcing is a pattern that greatly improves the quality and traceability of applications if implemented correctly; it also increases reliability and scalability. It demands a thorough understanding of its concepts and careful planning but offers substantial long-term benefits. If you are looking at building systems that are ready not only to meet the current needs but also any future changes and challenges, think about embracing Event Sourcing.

By Dmitrii Pakhomov

Istio Ambient Mesh Performance Test and Benchmarking

Istio is the most popular service mesh, but the DevOps and SREs community constantly complain about its performance. Istio Ambient is a sidecar-less approach by the Istio committee (majorly driven by SOLO.io) to improve performance. Since there are many promotions about Ambient mesh being production-ready, many of our prospects and enterprises are generally eager to try or migrate to Ambient mesh. Architecturally, the Istio Ambient mesh is a great design that improves performance. But whether it performs quickly is still a question. We have tried Istio Ambient Mesh and observed the performance countless times between January 2024 and July 2024, and we have yet to see any significant performance gains. Below is the lab setup on which we ran our experiments. Lab Setup to Load Test Istio Ambient Mesh Load testing tool: FortioApplication configuration: Bookinfo ApplicationTotal requests fired: 1000 queries/second (QPS), 10 connections, and for 30 secondsCluster configuration: Azure (AKS) clusters with 3 nodesNode configuration: 2 VCPU and 7GB memory for each nodeCNI used: Kube CNI and Cilium (We did not use Flannel because it was not working well with AKS.) Note: We have kept all the applications and Fortio in different nodes.We have exposed the Rating microservice and NOT Details service to handle external traffic. Because the Details microservice is written in Ruby, it is unfit for handling higher QPS. We sent a load of 100 QPS and 1000 QPS to the Details service without Istio, and the p99 latency for 100 QPS is around 6 ms, but it goes up to 50 ms for 1000 QPS. Performance Test on Istio Ambient Mesh With Kube CNI and Cilium We have carried out the performance or load test for the following cases: Kube CNI Kube CNI + Istio sidecar (mTLS enabled)Kube CNI + Istio Ambient mesh (mTLS enabled)Cilium CNICilium CNI + Istio sidecar (mTLS enabled)Cilium CNI + Istio Ambient mesh (mTLS enabled) Although we have tested the load for each case multiple times, we have attached only one screenshot to showcase the standard deviation of P99 latency in each case. Please refer to the load test results in the next section. Load Test Results for Kube CNI Without Istio Observed (Median) P99 latency: 1.12ms Figure 1: Kube CNI + Without Istio Load Test of Kube CNI and Istio Sidecar (mTLS Enabled) Observed (Median) P99 latency: 4.72 ms Figure 2: Kube CNI + With Istio Sidecar (mtLS enabled) Load Test of Kube CNI and Istio Ambient Mesh (mTLS Enabled) Observed (Median) P99 latency: 3.6 ms Figure 3: Kube CNI + With Istio Ambient (mtLS enabled) Load Test of Cilium CNI Without Istio Observed (Median) P99 latency: 4.5 ms Figure 4: Cilium CNI + Without Istio Load Test of Cilium CNI and Istio Sidecar (mTLS Enabled) Observed (Median) P99 latency: 8.8 ms Figure 5: Cilium CNI + With Istio Sidecar Load Test of Cilium CNI and Istio Ambient Mesh (mTLS Enabled) Observed (Median) P99 latency: 6.8 ms Figure 6: Cilium CNI + With Istio Ambient Final Load Test Results and Benchmarking of Rating Service With and Without Istio Here are the benchmarking results for the p99 latency of the Rating service with and without Istio (sidecar and Ambient mesh). Sl No Cases p99 latency(ms) 1 Kube CNI 1.12 2 Kube CNI + Istio sidecar (mTLS enabled) 4.72 3 Kube CNI + Istio Ambient mesh (mTLS enabled) 3.6 4 Cilium CNI 4.5 5 Cilium CNI + Istio sidecar (mTLS enabled) 8.8 6 Cilium CNI + Istio Ambient mesh (mTLS enabled) 6.8 Conclusion Three items are concluded from this extensive load test of Istio Ambient Mesh: The performance of Istio Ambient mesh will never give you thunderbolt improvements over latency when compared with plain Kube CNI. Note that using Ztunnel for encryption still involves network hops, which will increase the latency. Yes, it is better than Istio sidecar architecture. Regardless of the CNI used, the performance (p99 latency) of the Istio Ambient Mesh is 20% better than that of the Istio sidecar. Combining Cilium and Istio (sidecar or Ambient) produces undesirable results. If you are looking for performance improvements, you should avoid this mix.

By Debasree Panda

Improving Snowflake Performance by Mastering the Query Profile

Having worked with over 50 Snowflake customers across Europe and the Middle East, I've analyzed hundreds of Query Profiles and identified many issues including issues around performance and cost. In this article, I'll explain: What is the Snowflake Query Profile, and how to read and understand the componentsHow the Query Profile reveals how Snowflake executes queries and provides insights about Snowflake and potential query tuningWhat to look out for in a Query Profile and how to identify and resolve SQL issues By the end of this article, you should have a much better understanding and appreciation of this feature and learn how to identify and resolve query performance issues. What Is a Snowflake Query Profile? The Snowflake Query Profile is a visual diagram explaining how Snowflake has executed your query. It shows the steps taken, the data volumes processed, and a breakdown of the most important statistics. Query Profile: A Simple Example To demonstrate how to read the query profile, let's consider this relatively simple Snowflake SQL: select o_orderstatus, sum(o_totalprice) from orders where year(o_orderdate) = 1994 group by all order by 1; The above query was executed against a copy of the Snowflake sample data in the snowflake_sample_data.tpch_sf100.orders table, which holds 150m rows or about 4.6GB of data. Here's the query profile it produced. We'll explain the components below. Query Profile: Steps The diagram below illustrates the Query Steps. These are executed from the bottom up, and each step represents an independently executing process that processes a batch of a few thousand rows of data in parallel. There are various types available, but the most common include: TableScan [4] - Indicating data being read from a table; Notice this step took 94.8% of the overall execution time. This indicates the query spent most of the time scanning data. Notice we cannot tell from this whether data was read from the virtual warehouse cache or remote storage.Filter[3] - This attempts to reduce the number of rows processed by filtering out the data. Notice the Filter step takes in 22.76M rows and outputs the same number. This raises the question of whether the where clause filtered any results.Aggregate [2] - This indicates a step summarizing results. In this case, it produced the sum(orders.totalprice). Notice that this step received 22.76M rows and output just one row.Sort [1] - Which represents the order by orderstatus. This sorts the results before returning them to the Result Set. Note: Each step also includes a sequence number to help identify the sequence of operation. Read these from highest to lowest. Query Profile: Overview and Statistics Query Profile: Overview The diagram below summarises the components of the Profile Overview, highlighting the most important components. The components include: Total Execution Time: This indicates the actual time in seconds the query took to complete. Note: The elapsed time usually is slightly longer as it includes other factors, including compilation time and time spent queuing for resources.Processing %: Indicates the percentage of time the query spends waiting for the CPU; When this is a high percentage of total execution time, it indicates the query is CPU-bound — performing complex processing.Local Disk I/O %: Indicates the percentage of time waiting for SSDRemote Disk I/O %: This indicates the percentage of time spent waiting for Remote Disk I/O. A high percentage indicates the query was I/O bound. This suggests that the performance can be best improved by reducing the time spent reading from the disk.Synchronizing %: This is seldom useful and indicates the percentage of time spent synchronizing between processes. This tends to be higher as a result of sort operations.Initialization %: Tends to be a low percentage of overall execution time and indicates time spent compiling the query; A high percentage normally indicates a potentially over-complex query with many sub-queries but a short execution time. This suggests the query is best improved by simplifying the query design to reduce complexity and, therefore, compilation time. Query Profile Statistics The diagram below summarises the components of the Profile Statistics, highlighting the most important components. The components include: Scan Progress: This indicates the percentage of data scanned. When the query is still executing, this can be used to estimate the percentage of time remaining.Bytes Scanned: This indicates the number of bytes scanned. Unlike row-based databases, Snowflake fetches only the columns needed, and this indicates the data volume fetched from Local and Remote storage.Percentage Scanned from Cache: This is often mistaken for a vital statistic to monitor. However, when considering the performance of a specific SQL statement, Percentage Scanned from Cache is a poor indicator of good or bad query performance and should be largely ignored.Partitions Scanned: This indicates the number of micro partitions scanned and tends to be a critical determinant of query performance. It also indicates the volume of data fetched from remote storage and the extent to which Snowflake could partition eliminate — to skip over partitions, explained below.Partitions Total: Shows the total number of partitions in all tables read. This is best read in conjunction with Partitions Scanned and indicates the efficiency of partition elimination. For example, this query fetched 133 of 247 micro partitions and scanned just 53% of the data. A lower percentage indicates a higher rate of partition elimination, which will significantly improve queries that are I/O bound. A Join Query Profile While the simple example above illustrates how to read a query profile, we need to know how Snowflake handles JOIN operations between tables to fully understand how Snowflake works. The SQL query below includes a join of the customer and orders tables: SQL select c_mktsegment , count(*) , sum(o_totalprice) , count(*) from customer , orders where c_custkey = o_custkey and o_orderdate between ('01-JAN-1992') and ('31-JAN-1992') group by 1 order by 1; The diagram below illustrates the relationship between these tables in the Snowflake sample data in the snowflake_sample_data.tpch_sf100 schema. The diagram below illustrates the Snowflake Query Plan used to execute this query, highlighting the initial steps that involve fetching data from storage. One of the most important insights about the Query Profile above is that each step represents an independently operating parallel process that runs concurrently. This uses advanced vectorized processing to fetch and process a few thousand rows at a time, passing them to the next step to process in parallel. Snowflake can use this architecture to break down complex query pipelines, executing individual steps in parallel across all CPUs in a Virtual Warehouse. It also means Snowflake can read data from the ORDERS and CUSTOMER data in parallel using the Table Scan operations. How Does Snowflake Execute a JOIN Operation? The diagram below illustrates the processing sequence of a Snowflake JOIN operation. To read the sequence correctly, always start from the Join step and take the left leg, in this case, down to the TableScan of the ORDERS table, step 5. The diagram above indicates the steps were as follows: TableScan [5]: This fetches data from the ORDERS table, which returns 19.32M rows out of 150M rows. This reduction is explained by the Snowflake's ability to automatically partition eliminate - to skip over micro-partitions, as described in the article on Snowflake Cluster Keys. Notice that the query spent 9.3% of the time processing this step.Filter [4]: Receives 19.32M rows and logically represents the following line in the above query: SQL and o_orderdate between ('01-JAN-1992') and ('31-JAN-1992') This step represents filtering rows from the ORDERS table before passing them to the Join [3] step above. Surprisingly, this step appears to do no actual work as it receives and emits 19.32M rows. However, Snowflake uses Predicate Pushdown, which filters the rows in the TableScan [4] step before reading them into memory. The output of this step is passed to the Join step. Join [3]: Receives ORDERS rows but needs to fetch corresponding CUSTOMERentries. We, therefore, need to skip down to the TableScan [7] step.TableScan [7]: Fetches data from the CUSTOMER table. Notice this step takes 77.7% of the overall execution time and, therefore, has the most significant potential benefit from query performance tuning. This step fetches 28.4M rows, although Snowflake automatically tunes this step, as there are 1.5 Bn rows on the CUSTOMER table.JoinFilter [6]: This step represents an automatic Snowflake performance tuning operation that uses a Bloom Filter to avoid scanning micro-partitions on the right-hand side of a Join operation. In summary, as Snowflake has already fetched the CUSTOMER entries, it only needs to fetch ORDERS for the matching CUSTOMER rows. This explains the fact the TableScan [7] returns only 28M of the 1.5Bn possible entries. It's worth noting this performance tuning is automatically applied, although it could be improved using a Cluster Key on the ORDERS table on the join columns.Join [3]: This represents the actual join of data in the CUSTOMER and ORDERStables. It's important to understand that every Snowflake Join operation is implemented as a Hash Join. What Is a Snowflake Hash Join? While it may appear we're disappearing into the Snowflake internals, bear with me. Understanding how Snowflake executes JOIN operations highlights a critical performance-tuning opportunity. The diagram below highlights the essential statistics to watch out for in any Snowflake Join operation. The diagram above shows the number of rows fed into the JOIN and the total rows returned. In particular, the left leg delivered fewer rows (19.32M) than the right leg (28.4M). This is important because it highlights an infrequent but critical performance pattern: The number of rows fed into the left leg of a JOIN must always be less than the right. The reason for this critical rule is revealed in the way Snowflake executes a Hash Join, which is illustrated in the diagram below: The above diagram illustrates how a Hash Join operation works by reading an entire table into memory and generating a unique hash key for each row. It then performs a full table scan, which looks up against the in-memory hash key to join the resulting data sets. Therefore, it's essential to correctly identify the smaller of the two data sets and read it into memory while scanning the larger of the two, but sometimes Snowflake gets it wrong. The screen-shot below illustrates the situation: In the above example, Snowflake needs to read eight million entries into memory, create a hash key for each row, and perform a full scan of just 639 rows. This leads to very poor query performance and a join that should take seconds but often takes hours. As I have explained previously in an article on Snowflake Performance Tuning, this is often the result of multiple nested joins and group by operations, which makes it difficult for Snowflake to identify the cardinality correctly. While this happens infrequently, it can lead to extreme performance degradation and the best practice approach is to simplify the query, perhaps breaking it down into multiple steps using transient or temporary tables. Identifying Issues Using the Query Profile Query Profile Join Explosion The screenshot below illustrates a common issue that often leads to both poor query performance and (more importantly) incorrect results. Notice the output of the Join [4] step doesn't match the values input on the left or right leg despite the fact the query join clause is a simple join by CUSTKEY? This issue is often called a "Join Explosion" and is typically caused by duplicate values in one of the tables. As indicated above, this frequently leads to poor query performance and should be investigated and fixed. Note: One potential way to automatically identify Join Explosion is to use the Snowflake function GET_OPERATOR_QUERY_STATS , which allows programmatic access to the query profile. Unintended Cartesian Join The screenshot below illustrates another common issue easily identified in the Snowflake query profile: a cartesian join operation. Similar to the Join Explosion above, this query profile is produced by a mistake in the SQL query. This mistake produces an output that multiplies the size of both inputs. Again, this is easy to spot in a query profile, and although it may, in some cases, be intentional, if not, it can lead to very poor query performance. Disjunctive OR Query Disjunctive database queries are queries that include an OR in the query WHEREclause. This is an example of a valid use of the Cartesian Join, but one which can be easily avoided. Take, for example, the following query: SQL select distinct l_linenumber from snowflake_sample_data.tpch_sf1.lineitem, snowflake_sample_data.tpch_sf1.partsupp where (l_partkey = ps_partkey) or (l_suppkey = ps_suppkey); The above query produced the following Snowflake Query Profile and took 7m 28s to complete on an XSMALL warehouse despite scanning only 28 micro partitions. However, when the same query was rewritten (below) to use a UNION statement, it took just 3.4 seconds to complete, a 132 times performance improvement for very little effort. SQL select l_linenumber from snowflake_sample_data.tpch_sf1.lineitem join snowflake_sample_data.tpch_sf1.partsupp on l_partkey = ps_partkey union select l_linenumber from snowflake_sample_data.tpch_sf1.lineitem join snowflake_sample_data.tpch_sf1.partsupp on l_suppkey = ps_suppkey; Notice the Cartesian Join operation accounted for 95.8% of the execution time. Also, the Profile Overview indicates that the query spent 98.9% of the time processing. This is worth noting as it demonstrates a CPU-bound query. Wrapping Columns in the WHERE Clause While this issue is more challenging to identify from the query profile alone, it illustrates one the most important statistics available, the Partitions Scanned compared to Partitions Total. Take the following SQL query as an example: SQL select o_orderpriority, sum(o_totalprice) from orders where o_orderdate = to_date('1993-02-04','YYYY-MM-DD') group by all; The above query was completed in 667 milliseconds on an XSMALL warehouse and produced the following profile. Notice the sub-second execution time and that the query only scanned 73 of 247 micro partitions. Compare the above situation to the following query, which took 7.6 seconds to complete - 11 times slower than the previous query to produce the same results. SQL select o_orderpriority, sum(o_totalprice) from orders where to_char(o_orderdate, 'YYYY-MM-DD') = '1993-02-04' group by all; The screenshot above shows the second query was 11 times slower because it needed to scan 243 micro-partitions. The reason lies in the WHERE clause. In the first query, the WHERE clause compares the ORDERDATE to a fixed literal. This meant that Snowflake was able to perform partition elimination by date. SQL where o_orderdate = to_date('1993-02-04','YYYY-MM-DD') In the second query, the WHERE clause modified the ORDERDATE field to a character string, which reduced Snowflake's ability to filter out micro-partitions. This meant more data needed to be processed which took longer to complete. SQL where to_char(o_orderdate, 'YYYY-MM-DD') = '1993-02-04' Therefore, the best practice is to avoid wrapping database columns with functions, especially not user-defined functions, which severely impact query performance. Identifying Spilling to Storage in the Snowflake Query Profile As discussed in my article on improving query performance by avoiding spilling to storage, this tends to be an easy-to-identify and potentially resolve issue. Take, for example, this simple benchmark SQL query: SQL select ss_sales_price from snowflake_sample_data.TPCDS_SF100TCL.STORE_SALES order by SS_SOLD_DATE_SK, SS_SOLD_TIME_SK, SS_ITEM_SK, SS_CUSTOMER_SK, SS_CDEMO_SK, SS_HDEMO_SK, SS_ADDR_SK, SS_STORE_SK, SS_PROMO_SK, SS_TICKET_NUMBER, SS_QUANTITY; The above query sorted a table with 288 billion rows and took over 30 hours to complete on a SMALL virtual warehouse. The critical point is that the Query Profile Statistics showed that it spilled over 10 TB to local storage and 8 TB to remote storage. Furthermore, because it took so long, it cost over $183 to complete. The screenshot above shows the query profile, execution time, and bytes spilled to local and remote storage. It's also worth noting that the query spent 70.9% of the time waiting for Remote Disk I/O, consistent with the data volume spilled to Remote Storage. Compare the results above to the screenshot below. This shows the same query executed on an X3LARGE warehouse. The Query Profile above shows that the query was completed in 38 minutes and produced no remote spilling. In addition to completing 48 times faster than on the SMALL warehouse also cost $121.80, a 66% reduction in cost. Assuming this query was executed daily, that would amount to an annual savings of over $22,000. Conclusion The example above illustrates my point in an article on controlling Snowflake costs. Snowflake Data Engineers and Administrators tend to put far too much emphasis on tuning performance. However, Snowflake has changed the landscape, and we need to focus on both maximizing query performance and controlling costs. The task of managing cost while maximizing performance may seem at odds with each other, but using the Snowflake Query Profile and the techniques described in this article, there's no reason why we can't deliver both.

By John Ryan

Leveraging Time Series Databases for Cutting-Edge Analytics: Specialized Software for Providing Timely Insights at Scale

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Time series data has become an essential part of data collection in various fields due to its ability to capture trends, patterns, and anomalies. Through continuous or periodic observation, organizations are able to track how key metrics are changing over time. This simple abstraction powers a broad range of use cases. The widespread adoption of time series data stems from its versatility and applicability across numerous domains. For example: Financial institutions analyze market trends and predict future movements. IoT devices continuously generate time-stamped data to monitor the telemetry of everything from industrial equipment to home appliances.IT infrastructure relies on temporal data to track system performance, detect issues, and ensure optimal operation. As the volume and velocity of time series data have surged, traditional databases have struggled to keep pace with the unique demands placed by such workloads. This has led to the development of specialized databases, known as time series databases (TSDBs). TSDBs are purpose built to handle the specific needs of ingesting, storing, and querying temporal data. Core Features and Advantages of Time Series Databases TSDBs work with efficient data ingestion and storage capabilities, optimized querying, and analytics to manage large volumes of real-time data. Data Ingestion and Storage TSDBs utilize a number of optimizations to ensure scalable and performant loading of high-volume data. There are several of these optimizations that stand out as key differentiators: Table 1. Ingestion and storage optimizations FeatureDescriptionExpected ImpactAdvanced compressionColumnar compression techniques such as delta, dictionary, and run length and LZ array-basedDramatically reduces the amount of data that needs to be stored on disk and, consequently, scanned at query timeData aggregation and downsamplingCreation of summaries over specified intervalsReduces data volumes without a significant loss in informationHigh-volume write optimizationA suite of features such as append-only logs, parallel ingestion, and asynchronous write pathEnsures that there are no bottlenecks in the write path and that data can continuously arrive and be processed by features working together Optimized Querying and Analytics To ensure fast data retrieval at query time, several optimizations are essential. These include specialized time-based indexing, time-based sharding/partitioning, and precomputed aggregates. These techniques take advantage of the time-based, sequential nature of the data to minimize the amount of data scanned and reduce the computation required during queries. An overview of these techniques are highlighted below. Indexing Various indexing strategies are employed across TSDBs to optimize data retrieval. Some TSDBs use an adapted form of the inverted index, which allows for rapid indexing into relevant time series by mapping metrics or series names to their locations within the dataset. Others implement hierarchical structures, such as trees, to efficiently index time ranges, enabling quick access to specific time intervals. Additionally, some TSDBs utilize hash-based indexing to distribute data evenly and ensure fast lookups, while others may employ bitmap indexing for compact storage and swift access. These diverse strategies enhance the performance of TSDBs, making them capable of handling large volumes of time-stamped data with speed and precision. Partitioning Partitioning consists of separating logic units of time into separate structures so that they can be accessed independently. Figure 1. Data partitioning to reduce data scan volume Pre-Computed Aggregates A simplified version of pre-computation is shown below. In practice, advanced statistical structures (e.g., sketches) may be used so that more complex calculations (e.g., percentiles) can be performed over the segments. Figure 2. Visualizing pre-computation of aggregates Scalability and Performance Several tactics and features ensure TSDBs remain reliable and performant as data velocity and volume increase. These are summarized in the table below: Table 2. Scalability tactics and features FeatureDescriptionExpected ImpactDistributed architectureProvides seamless horizontal scalingAllows for transparently increasing the amount of processing power to both producing and consuming applicationsPartitioning and shardingAllows for data to be isolated to distributed processing unitsEnsures that both write and read workloads can fully utilize the distributed clusterAutomated data managementEnables data to move through different tiers of storage automatically based on its temporal relevanceGuarantees that the most frequently used data is automatically stored in the fastest access path, while less used data has retention policies automatically applied Time Series Databases vs. Time Series in OLAP Engines Due to the ubiquity of time series data within businesses, many databases have co-opted the features of TSDBs in order to provide at least some baseline of the capabilities that a specialized TSDB would offer. And in some cases, this may satisfy the use cases of a particular organization. However, outlined below are some key considerations and differentiating features to evaluate when choosing whether an existing OLAP store or a time-series-optimized platform best fit a given problem. Key Considerations An organization's specific requirements will drive which approach makes the most sense. Understanding the three topics below will provide the necessary context for an organization to determine if bringing in a TSDB can provide a high return on investment. Data Volume and Ingestion Velocity TSDBs are designed to handle large volumes of continuously arriving data, and they may be a better fit in cases where the loading volumes are high and the business needs require low latency from event generation to insight. Typical Query Patterns It is important to consider whether the typical queries are fetching specific time ranges of data, aggregating over time ranges, performing real-time analytics, or frequently downsampling. If they are, the benefits of a TSDB will be worth introducing a new data framework into the ecosystem. Existing Infrastructure and Process When considering introducing a TSDB into an analytic environment, it is worthwhile to first survey the existing tooling since many query engines now support a subset of temporal features. Determine where any functionality gaps exist within the existing toolset and use that as a starting point for assessing fit for the introduction of a specialized back end such as TSDB. Differentiating Features There are many differences in implementation, and the specific feature differences will vary depending on the platforms being considered. However, generally, the two feature sets are emphasized broadly in TSDBs: time-based indexing and data management constructs. This emphasis stems from the fact that both feature sets are tightly coupled with time-based abstractions. Use of a TSDB will be most successful when these features can be best leveraged. Time-Based Indexing Efficient data access is achieved through constructs that leverage the sequential nature of time series data, allowing for fast retrieval while maintaining low ingest latency. This critical feature allows TSDBs to excel in use cases where traditional databases struggle to scale effectively. Data Management Constructs Time-based retention policies, efficient compression, and downsampling simplify the administration of large datasets by reducing the manual work required to manage time series data. These specialized primitives are purposefully designed to manage and analyze time series data, and they include functionality that traditional databases typically lack. Use Cases of Time Series Databases in Analytics There are various uses for time series data across all industries. Furthermore, emerging trends such as edge computing are putting the power of real-time time series analytics as close to the source of data generation as possible, thereby reducing the time to insight and removing the need for continuous connectivity to centralized platforms. This opens up a host of applications that were previously difficult or impossible to implement until recently. A few curated use cases are described below to demonstrate the value that can be derived from effectively leveraging temporal data. Telemetry Analysis and Anomaly Detection One of the most common use cases for TSDBs is the observation and analytics on real-time metrics. These metrics come from a variety of sources, and a few of the most prominent sources are described below. IT and Infrastructure Monitoring TSDBs enable real-time monitoring of servers, networks, and application performance, allowing for immediate detection and response to issues. This real-time capability supports performance optimization by identifying bottlenecks, determining capacity needs, and detecting security intrusions. Additionally, TSDBs enhance alert systems by identifying anomalous patterns and breaches of predefined thresholds, proactively informing staff to prevent potential problems. They also support custom dashboards and visualizations for quick and effective data interpretation, making them an invaluable tool for modern IT operations. IoT and Sensor Data TSDBs are vital for telemetry analysis and anomaly detection in IoT and sensor data applications, particularly when aligned with edge computing. They efficiently handle the large volumes of temporal data generated by IoT devices and sensors, enabling real-time monitoring and analysis at the edge of the network. This proximity allows for immediate detection of anomalies, such as irregular patterns or deviations from expected behavior, which is crucial for maintaining the health and performance of IoT systems. By processing data locally, TSDBs reduce latency and bandwidth usage, enhancing the responsiveness and reliability of IoT operations. Smart Cities and Utilities Extreme weather and the need for quick time to action has driven a growth in the usage of temporal data within city and utility infrastructures. Quickly deriving insights from deviations in normal operations can make a significant impact in these applications. TSDBs enable this through both the ability to ingest large volumes of data quickly as well as natively providing highly performant real-time analytic capabilities. For instance, it can mean the difference between high winds causing live wire breakages, which increase fire risk, and an automated shutdown that significantly reduces such risks. Furthermore, better information about energy generation and demand can be used to improve the efficiency of such systems by ensuring that supply and demand are being appropriately matched. This is particularly important during times when there is heavy strain on the energy grid, such as periods of unusual heat or cold, when effective operation can save lives. Trend Analysis The usefulness of TSDBs is not limited to real-time analytics; they are also used for performing long-term trend analysis and often provide the most value when identifying real-time deviations from longer term trends. The optimizations mentioned above, such as pre-computation and partitioning, allow TSDBs to maintain high performance, even if data volumes grow dramatically. Financial Analytics In the realm of financial analytics, TSDBs are indispensable for trend analysis. Analysts can identify patterns and trends over time, helping to forecast market movements and inform investment strategies. The ability to process and analyze this data in real time allows for timely decision making, reducing the risk of losses and capitalizing on market opportunities. Additionally, TSDBs support the integration of various data sources, providing a comprehensive view of financial markets and enhancing the accuracy of trend analysis. Healthcare and Biometric Data Medical devices and wearables generate vast amounts of time-stamped data, including heart rates, glucose levels, and activity patterns. TSDBs facilitate the storage and real-time analysis of this data, allowing healthcare providers to monitor patients continuously and detect any deviations from normal health parameters promptly. Trend analysis using TSDBs can also help in predicting the onset of diseases, monitoring the effectiveness of treatments, and tailoring personalized healthcare plans. This proactive approach not only improves patient outcomes but also enhances the efficiency of healthcare delivery. Industrial Predictive Maintenance Industries deploy numerous sensors on equipment to monitor parameters such as vibration, temperature, and pressure. By collecting and analyzing time-stamped data, TSDBs enable the identification of patterns that indicate potential equipment failures. This trend analysis allows maintenance teams to predict when machinery is likely to fail and schedule timely maintenance, thereby preventing costly unplanned downtimes. Moreover, TSDBs support the optimization of maintenance schedules based on actual usage and performance data, enhancing overall operational efficiency and extending the lifespan of industrial equipment. Conclusion Time series databases offer tools that simplify working with temporal data, thereby enabling businesses to improve operational efficiency, predict failures, and enhance security. The expanding capabilities of TSDBs highlight the value of real-time analytics and edge processing. Features like time-based partitioning, fast ingestion, and automated data retention — now found in traditional databases — encourage TSDB adoption by allowing proof of concepts on existing infrastructure. This demonstrates where investing in TSDBs can yield significant benefits, pushing the boundaries of temporal data management and optimizing analytics ecosystems. Integration with machine learning and AI for advanced analytics, enhanced scalability, and adoption of cloud-native solutions for flexibility are driving forces ensuring future adoption. TSDBs will support edge computing and IoT for real-time processing, strengthen security and compliance, and improve data retention management. Interoperability with other tools and support for open standards will create a cohesive data ecosystem, while real-time analytics and advanced visualization tools will enhance data interpretation and decision making. Together, these factors will ensure that TSDBs continue to be an essential piece of data infrastructure for years to come. This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report

By Ted Gooch

Performance

DZone's Featured Performance Resources

Top Performance Experts

The Latest Performance Topics