Delphix deploys as a virtual appliance in VMware, AWS, or OpenStack. It uses storage assigned to it like any other VM, and relies on that storage for persistence and performance. Before deploying Delphix into a customer environment, we help the customer understand the performance characteristics of the storage provisioned to Delphix. This helps ensure a successful deployment and helps determine any course-correction needed in case the storage turns out to be inadequate for the load that is expected in the environment. As any DBA or Storage admin will tell you, it is a lot cheaper to address storage issues before deployment than after.
IO Report Card
In order to evaluate the performance of the storage and its capability to handle Delphix workload, we built a tool called the “IO Report Card”. This tool is run before every Delphix deployment and has proven invaluable to our deployments since its inception a little over two years ago. With our latest release, Delphix 4.2, the IO Report card is now integrated into the product. Customers can run this tool in a self-service model using the Delphix CLI. Note that the tool can only be run at the beginning of a deployment, before adding any data onto the Delphix Engine. The tool can be run on the new storage to check for its performance.
The tool generates a variety of synthetic workloads so that we can measure how the storage will perform under each scenario. To generate that load we use fio, a trusted open-source tool which was in active development and has a large user-community.
The IO Report Card uses four synthetic tests that we purpose-built to mimic specific DB workloads.
- Small Block (8KB) Random Reads
- Relevant workload for most OLTP style applications
- An important workload to understand I/O operations per second (IOPS) and latency characteristics of the storage
- Large Block (1MB) Sequential Reads
- Representative workload for batch processing style of applications
- Helps understand the read bandwidth available from the storage
- Small Block Sequential 1KB writes
- Representative of responsiveness to Write Ahead Logging activity
- Large Block Sequential 128KB writes
- Typical workload for Data Migration activities
- Understand peak write throughput
Metrics and Statistics
For each of these workloads we present both a quantitative and a qualitative analysis. We present the user with the full distribution of latencies. The key number we focus on is the 95th percentile latency. Storage systems can exhibit highly variable performance. The 95th percentile effectively measures the predictable performance of the underlying storage system. Why not use the average? The average (or arithmetic mean) is ostensibly easy to understand but can hide important details. A distribution with a large number of outlying samples might have a low average, but performance is not predictable.
Figure 1 shows the Frequency distribution of Responses from two different storage systems. Notice the average latency for both systems is around 610ms, but Storage system II suffers a few outliers with response times more than 128 secs and 256 secs. These handful of requests will account for almost 20% of the total IO time in the system. We’ve seen that such latency outliers can significantly impact performance on computer systems of all kinds. While convention has been to use the average, we’ve seen that the 95th percentile tells a much more accurate story in a single number.
The report card produces letter grades for each test being run. The letter grades are based on industry standards for different classes of storage available today. Over the years we tried to keep the grades commensurate with the changes in the storage offerings available to our customers.
Each test produces a report similar to Figure 1. Each row in the report has 3 parts.
- Latency grade
- Latency histogram
- Load scaling
The latency graph shows two lines, the “average latency” and the 95th percentile latency. The grade in the latency graph is based off of the 95th percentile latency. This represents the latency of the slowest 5% of the requests. Grades range from “A+” through “D”.. If the 95% latency line is slower than the range displayed on the chart, then the line is displayed in red at the far right of the chart.
The tool does not stop with the average and the 95th percentile latency. It will also show a histogram of response times. A histogram is intended to demonstrate any caching/tiering present in the storage. The presence of bi-modal or tri-modal distributions in the histogram demonstrates the mileage we are getting out of each storage tier and the latency we can expect. The grade may not mean much, since it might indicate the performance of the bottom or top tier. In reality, the performance of tiered storage will depend on how much usage our workload is getting from each storage tier.
The third part of the result shows the latency scaling of the storage. The load scaling metric shows how the storage performs as the load increases. Scaling is obtained using the 95th percentile latency of the test at a higher user load as below:
- Random Reads: Load increases from 16 to 32 users
- Sequential Reads: Load increases from 1 to 8 users
- Writes: Load increases from 4 to 16 users
A value of 0 means the latency stayed the same, meaning throughput increased. A value of +1 means the latency went up in proportion to the load thus the throughput stayed the same. A value over -1 means the throughput was worse at the higher load.
Sample Report Card
When we added support for AWS, we analyzed the block storage options available in AWS, detailed in my blog about EBS Storage in AWS. Here are a few report cards from our customers to give a sense of what to expect from different storage options available in the market today.
Delphix deploys into a wide variety of environments with varying characteristics. Understanding the capabilities of these environments helps ensure our customers can get the most value out of their Delphix deployment in the quickest possible way. IO Report card is a tool we use to evaluate the characteristics of the Storage that is provisioned to Delphix. With our latest release Delphix 4.2, we incorporated the IO Report card into the product as a self-service tool. Having a tool that filters complex performance data into simple to understand results helped enormously in getting a consensus. We encourage our customers to run the tool before every Delphix deployment. We found this tool invaluable for understanding deficiencies of storage provisioned to Delphix and also in addressing issues before they become problems. As we come across different types of storage, we collect and analyze the data. This helps improve overall performance and our ability to maximally utilize any and all storage that is provisioned to us.