The Metric Data Storage Problem
Every day New Relic engineers face large challenges in the area of scalability, performance and data storage. Maintaining a 24/7 operation for a high volume of customers and keeping that service performant is a daunting task.
As our user base has grown, the volume of data maintained and handled by the system has grown too. Storage space is now measured in terabytes and traffic is measured in the millions (sometimes billions) of requests per hour, all day, every day.
Over the past few years a major problem crept up on us: the metric data storage process had become almost unmanageable. Storage space requirements were increasing daily and the amount of time it took to clean up old metric data was taking longer, too. It got to the point where the process that ran to clean up old account data was taking most of a day.
Dropping tables and deleting files holding that data was approaching the point where it could not get completed within a 24 hour period. Furthermore, those processes were having a negative impact on database performance, which translated to a slower experience for users. Once the ‘purge process’ hit the 24 hour mark, the situation would become unsustainable. Innovation and some calculated risk-taking were required to find a better way.
The Problem: Too Many Tables and a Massive Table Drop Process
Each customer has an account. An account can support multiple agent connections. Each agent sends metric data accumulated every minute up to the ‘collector’, which takes the data, aggregates it by application, and stores it into a ‘metric timeslice table’. These tables were being created every hour for each account.
For active customers writing metric data, the system would create 24 hourly timeslice tables. Depending on a customer’s subscription level, those tables would be rolled up (summated) from per-minute chunks down to per-hour chunks for longer range metric views (e.g., monthly or weekly). After a period of time, we would drop the per-minute tables and eventually even the per-hour tables.
The Solution: 24 Hour Metric Tables Mean 1/24 the Tables
In March of 2012, we upgraded our database hosts to use very large SSD disks. Using these disks made it possible to start writing metric ‘per-minute’ tables in 24 hour chunks instead of one hour chunks. This meant 1/24 the tables and faster access to metric data because the New Relic web application didn’t need to query multiple tables when rendering charts and tables showing customer metric data.
Given the number of customers we provide service to, it was not possible to make this change in ‘one fell swoop’. It had to be done carefully, starting with a few accounts and then to larger groups of accounts progressively. Developing a clear plan for making this migration and verifying quality along the way was a big part of this project.
The migration started in early November, 2012 and took almost a month to complete.
Performance and Storage Benefits
The first thing we noticed after switching to 24 hour tables was an increase in free disk space. As clients migrated to the new table structure, the old tables started to drop out. One disadvantage of using so many tables is that MySQL InnoDB has some minimum table size requirements (~9MB per table for the metric timeslice tables). Having 24 tables per day, per customer was turning MySQL into a massive disk space hog even for mostly idle customers.
But luckily, we drink our own champagne and use New Relic to monitor New Relic! You can see in this chart that by November 24 the system had saved almost 30% of total disk space across the data cluster — almost 30TB of space.
Besides better disk space utilization, another important benefit, was related to how long it took the system to remove old metric storage data. Prior to the upgrade, the system would spend over 16 hours per day dropping metric timeslice tables. This process was taking longer each day due to New Relic customer growth.
After most of the old one hour tables were gone, the process went down to about 90 minutes. The “table drop’ process has a major impact on system database performance. You can see a correlation between the disk space curve and general database query performance:
Here’s the chart with database response time isolated:
Database query performance in the New Relic UI has improved overall (almost two times on average) because the system uses fewer tables to fetch data and the tables have an improved index structure. New Relic database performance is also much better when the UI and the collector are not having to compete during the account purge process.
Prior to implementing this enhancement, we spent some time modeling and testing the MySQL write characteristics in custom simulation tools to ensure it would work. Even with analysis and simulation, predicting exact database performance at such large scales can be a bit of a black art. It can be difficult to predict exactly what will happen in the real world.
New Relic engineers including QA, support staff and development are constantly working to improve the quality of our service. The timeslice table migration from hourly to daily was an example of a creative and somewhat daring change in architecture that has yielded a significant improvement in overall performance.