Optimizing Data Queries for Time Series Applications
Optimizing Data Queries for Time Series Applications
Paying attention to indexing, query scope, retention policies, and cardinality may not solve all of your problems, but the more you know about your data, the better you'll be able to craft queries.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Now that we understand what time series data is and why we want to store it in a time series database, we've reached a new challenge. Like any application, we want to ensure our database queries are smart and performant, so let's talk about how we can avoid some common pitfalls.
Indexing, the oft-recommended and rarely understood solution to all attempts at optimization, is applicable to most databases. Whether the time series database you're using is built on Cassandra or MySQL or its own unique architecture, indexing affects your queries. Essentially, an index is a data structure that stores the values from a specific column, meaning that when we search by an indexed field, we have a handy shortcut to the values. When we search by unindexed fields, we have to discover the full path to the value, with no shortcuts or magic tricks. Searching unindexed fields is like having to watch Frodo walk through Middle Earth unedited: it takes a long time.
While indexing is not unique to time series databases, we have to remember that the index is a data structure that becomes oversized if we have too many indexed columns or fields. An index structure that is too large ends up eating memory and slowing down processes, negating its advantages. The time series problem here is that there's no convention around which pieces should be indexed, so we need to be aware of our schema at all times.
When a query gets me down, I usually hop into the command line. I'm happy there. When I was first discovering time series databases, I did just that. I skipped into my InfluxDB command line tool and typed:
SELECT * FROM 'cpu'
And my life flashed before my eyes. Memories of small batches of user data brought tears to my eyes. My terminal turned into the kind of screen shown by a "hacker" in a crime TV show.
One of the distinctive qualities of time series data is that it is more valuable in higher volume — we store millions of points. Running a query using
* (all) can potentially lock up your database while it retrieves points.
There are a few options to limit your query while also improving it.
- Use a time range. Many time series application queries aggregate data from a window, so use that to your advantage.
- Add a sub-query. This will limit the scope of your query by adding parameters, and ensure you only get relevant results.
The key to scoping your queries is to filter them: be as specific as possible to avoid data overload in your application, your terminal, and your mind.
In the world of time series data, data points age like the bagged salad in my crisper drawer: I might keep it longer than I should, but eventually, I'm going to need to throw it away. The high number of points makes it difficult to store time series data indefinitely, and even if disk space allows for an immense amount of data, the queries then have to run through a huge dataset.
Let's say you're ignoring some of my previous advice, and you need to run a query without a time window or a sub-query. You can control the amount of data just by setting up processes to delete expired data. This is another piece whose logistics depend on which database you're using, but it's a common time series problem, so solutions abound on the internet for your database of choice. Delete expired data and save yourself some... time.
Even if our query is perfect, high cardinality will slow us down. The number of unique values in a column or series determines cardinality. High cardinality means a high number of unique values. Cardinality tends to increase when we want to query across more and more combinations of attributes, which then leads to time the database spends: finding the appropriate values in a series, performing any necessary functions (i.e. sum the values) on those values, repeating for every relevant, unique series, and then combining them according to the query requirements. As the index and cardinality grow, so does the overhead in running a query.
In a columnar database, we can improve performance by ensuring we have fewer series with more points rather than more series with a fewer number of points. Compression techniques in time series run more efficiently on long runs of values, so if we want to get the most out of our database, we need to follow its rules.
In time series databases built on relational databases, cardinality affects the index more than anything else, so we need to keep an eye on the size of the index so it doesn't suck up our resources.
You got through some heavy stuff here. Remember to take deep breaths and go to a happy place to process all of the information.
Your time series application deserves excellence in its level of efficiency and performance — and you can make it happen. Paying attention to indexing, query scope, retention policies, and cardinality may not solve all of your problems, but the more you know about your data, the better you'll be able to craft queries. We're one step closer to being time series masters.
Published at DZone with permission of Katy Farmer , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.