Navigating the Divide: Distinctions Between Time Series Data and Relational Data
To anyone unfamiliar, time series looks similar to relational data, but it is very different in nature: understanding both is key.
Join the DZone community and get the full member experience.Join For Free
I have coded many applications, both client and web, over my career, and I understand the importance of building a well-developed application from the ground up. Therefore, I found it important to write a piece based on my understanding of time series data versus relational data as someone with a unique understanding of both.
The main objective of relational data is to maintain an accurate representation of the current state of the world with respect to its objects and the relationships between them. This means that a relational database must maintain an accurate representation of that current state by allowing users to change the objects, which changes the relationships between them.
A user is a typical example of relational data. A user has important information such as name and email address that needs to be kept up to date, and also, the user object has relationships with other objects in the system, such as the content that they have created.
Time Series Data
Time series data is very different in nature. Time series is a stream of discrete observations. While these observations may look like relational data and could fit into a relational database, the data points are unrelated to one another. Each individual observation is a point in time and doesn’t particularly relate to the other observations, which are also their own points in time. Time series data isn’t transactional data – it’s rarely if ever, edited or updated. Summarization and analysis are the main objectives of collecting time series data.
Time series data includes events (a user was created), metrics (a sensor reading at a specific time), and logs (the system encountered an unexpected state, and the developer wrote out some debugging messages).
Using a Relational Database for Time Series Leads to Tears
It’s tempting to stuff time series data into the familiar Postgres or MySQL database, but that’s a bad idea for many reasons.
Relational Databases Don’t Scale
The inability of relational databases to scale alongside time series data won’t surface during testing, prototyping, or the early stages of production. During the build and early growth stages, you’ll see exponential increases in time series data as your user base grows. For example, your first user will require storage of ten times more time series data. Adding a second user will require storage for 100 times more time series. By the time a third user joins the site, you’ll need space to hold 1,000 times more time series data.
A relational database will support this growth for some time. However, one day your successful, well-sized production application encounters a workload size where the amount of stored data, plus the number of users or the number of observations, reaches the scalability tipping point.
Time-to-become-readable (TTBR) is the latency between when data is written and when it’s readable. TTBR is how you’ll know your database isn’t keeping up with the speed of ingests, and by the time you see it, it’s a problem. You need fast TTBR with time series because you want to act on anomalies as soon as they appear, either through automation or notifications for manual action.
When the storage volume of a relational database gets too full, the database slows down during backup processes, data ingest time suffers greatly, and TTBRs can increase to a minute or more. This is a deterrent to any immediate action. In contrast, the ingest speed of a time series database is milliseconds, allowing for immediate action should an anomaly appear.
There is, of course, a problem worse than slow ingestion speeds. Relational databases could fail over due to high ingest loads.
In our internal benchmarking, InfluxDB ingests orders of magnitude more data per second, using significantly less CPU and memory than other databases, even those that claim to be tuned for time series.
The shape of time series data can change rapidly. A typical example of this is adding a dimension. Let’s say an application is tracking the health of URLs, and the developer decides they want to start tracking the size of the response object after monitoring these URLs for some time.
Relational databases require an explicit schema, so adding a new data dimension requires a full data migration for every change to the schema. At a minimum, the existing production database tables need a migration to match the new schema to include the new column. Anytime there’s a migration, there’s a risk of data corruption or loss associated with the labor-intensive process, even when incorporating migration tools such as Alemic or Rake. These risky migrations result from the general inflexibility of any database that requires an explicit schema.
InfluxDB is a “schema on write,” meaning that one can add new dimensions and fields by simply adding them to new writes. It’s not necessary to make any changes to either my development or production databases.
Time Series Database Built on Apache Arrow Creates an Interesting Solution
Scaling and schema aren’t the only challenges that exist when using relational databases for time series. In many cases, your code needs to pull over a lot of observations to do interesting calculations that relational databases don’t support. These days if you aren’t working with Apache Arrow, you’re working with large datasets the hard way. Arrow is specifically built to move large amounts of columnar data and to allow tools to effectively operate on that data.
Even the basic task of retrieving large amounts of data and converting it to Arrow format in your client code isn’t workable because it lacks the compression that Arrow provides both across the wire and in your own code. If you aren’t using a database that returns data in Arrow format right from the source, your production application struggles to ensure there’s enough memory to work with large datasets. Transferring a large number of poorly compressed bytes across the wire increases the latency between the database and your code.
Then there is the actual programming model. Upstream libraries in the Apache Arrow project allow users to query large amounts of data, efficiently bring it to their clients, and operate on the data. Once you use an Arrow library and something like Pandas to operate on large data sets, you won’t go back to looping through cursors to transform your data or hand it off to other libraries.
Querying Time Series With SQL
The incorporation of the Apache ecosystem also created an opportunity to build SQL support into InfluxDB. The database uses DataFusion as its query engine, and DataFusion uses SQL as the query language, meaning anyone who knows SQL can query time series.
We also created three time-series-specific functions in DataFusion so anyone within the Apache Arrow community can benefit from or contribute to them.
- date_bin() – A function that creates rows that are time windows of data with an aggregate.
- selector_first(), selector_last() – Functions that provide the first or last row of a table that meet specific criteria
- time_bucket_gapfill() – A function that returns windowed data. If there are windows that lack data, it will fill those gaps.
The Bottom Line
Relational data and time series data are very different data types, each with its own storage requirements, functionality, and workloads. On the surface, relational and time series data look similar, making this a sneaky issue for the unfamiliar. That’s why it’s important to be aware of these differences early in the process. If you wait until issues creep up in production, they become much more challenging to solve.
Time series data needs a time series database, like InfluxDB, to account for low TTBR at high ingestion rates, the flexibility of schema on write data collection and advanced data analysis. Native SQL support in InfluxDB makes time series data workloads more accessible to SQL users.
InfluxDB is purpose-built for time series data and its unique workloads. Adding this technology to your tech stack will help fix or completely avoid the abovementioned pitfalls.
Opinions expressed by DZone contributors are their own.
How AI Will Change Agile Project Management
Java Concurrency: Condition
Building a Flask Web Application With Docker: A Step-by-Step Guide
Decoding eBPF Observability: How eBPF Transforms Observability as We Know It