Temporal Data Processing

Table of Contents

Introduction SQL Specification and Database Support Simple Working Example How it works Query: A specific point in time Query: Find valid rows within a time range Query: All points in time Query: Helpful Syntax Adding History: New Table Add History: Existing Table Add History: Excluding/Including Columns Removing History Replication Limitations

Section 1

Introduction

Dealing with data that changes over time while keeping track of historical data for reporting or other purposes is supported by the SQL specification. This refcard is designed to provide information and tools for handling data that varies over time in relational databases

Section 2

SQL Specification and Database Support

The SQL specification defines various features to define and query across temporal tables. Temporal tables are built around a PERIOD for which a row is valid, defined by two generated columns in the table. The two generated columns, defined as ROW START and ROW END keep track of the time the fields in that row remain the same. If a change occurs, the database will automatically update the column values for the row and generate a new one. For more information, see the walk through in the Simple Working Example section.

Database Support

Support for temporal tables is limited. Currently, Microsoft SQL Server, Oracle, and MariaDB support temporal tables. PostgreSQL has some temporal support via a plugin, but the syntax does not follow the SQL specification. MariaDB is open-source and easily accessible, so it will be used for demonstrating examples in the rest of the refcard. When features are described that are specific only to MariaDB are used, they will be called out as such.

Section 3

Simple Working Example

In order to reason about these features, we’ll use the following example: We want to track the level of proficiency of users of our app. Users will have a proficiency level of NOVICE, COMPETENT, PROFICIENT, EXPERT, or MASTER.

A succinct set of queries that sets up the table, inserts data, and queries data is below. This will be described and elaborated on in the rest of the refcard. Note that I’ll be simplifying the model for readability and will ignore certain performance and modeling enhancements such as breaking level out into its own table and referencing it within the user table.

​x
CREATE TABLE user (
   id int,
   level VARCHAR(20),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING;
​
/* Some time later, a new user joins */
INSERT INTO user (id, status) VALUES (1, 'NOVICE');
​
/* Time passes and we escalate the user’s proficiency level to COMPETENT */
UPDATE USER set status='COMPETENT' WHERE id=1;
​
/* View user 1’s current status */
SELECT * FROM user where id=1;
​
/* View user 1’s status as of a specific point in time */
SELECT * FROM user FOR SYSTEM_TIME AS OF TIMESTAMP'2018-09-25 16:24:00.000000' where id=1;

Section 4

How it works

When you have a table using system versioning, the database keeps track of changes that you make over time using the two additional columns marked by ROW START and ROW END. Changes you make to a particular row are stored as separate rows in in the table with the ROW START column identifying the timestamp of the change. The value of ROW END will either be sometime far in the future, signifying this is the current valid row, or sometime in the past, signifying that other updates have superseded the row. New updates will automatically adjust the ROW END value of the previous row so no overlaps occur.

Step by Step Example

After creating our table (see Simple Working Example), we want to insert some data:

INSERT INTO user (id, status) VALUES (1, 'NOVICE');

The underlying data in this table now looks like:

|id|level|start_timestamp|end_timestamp|
|1|NOVICE|2018-09-25 16:46:41.217007|2038-01-19 03:14:07.999999|

If we update user 1, the end_timestamp field of our existing row with the NOVICE level will be updated to reflect it became invalid the moment our new update occurred.

UPDATE USER set status='COMPETENT' WHERE id=1;
​
|id|level|start_timestamp|end_timestamp|
|1|NOVICE|2018-09-25 16:46:41.217007|2018-09-25 16:46:51.236727|
|1|COMPETENT|2018-09-25 16:46:51.236727|2038-01-19 03:14:07.999999|

Deleting a user will simply set the END ROW value of the current row to the timestamp of the delete query.

Section 5

Query: A specific point in time

Querying for data at a specific point in time can happen in two ways.

If you want to get the current version of data, i.e., if the specific point in time is now, simply write the query you would expect to write with no additional syntax.

SELECT * FROM user WHERE id=1;

If you want to query for a particular point in time in the past, add FOR SYSTEM_TIME AS OF TIMESTAMP'<timestamp string here>' just after the table name.

SELECT * FROM user FOR SYSTEM_TIME AS OF TIMESTAMP'2018-09-25 16:24:00.000000' WHERE id=1;

Section 6

Query: Find valid rows within a time range

To find all valid rows within a time range for a specific query, you can use BETWEEN...AND or FROM...TO syntax.

BETWEEN...AND will show all rows valid between the time range specified and is inclusive of the boundaries.

SELECT * FROM user FOR SYSTEM_TIME BETWEEN TIMESTAMP'2018-09-25 16:24:00.000000' AND TIMESTAMP'2018-09-25 16:46:51.236727';

FROM...TO will show all rows valid between the time range specified and is inclusive of the FROM timestamp but doesn’t include the TO timestamp.

SELECT * FROM user FOR SYSTEM_TIME FROM TIMESTAMP'2018-09-25 16:24:00.000000' TO TIMESTAMP'2018-09-25 16:46:51.236727';

Section 7

Query: All points in time

To find all of the versions of the rows you’re interested in, use the FOR SYSTEM_TIME ALL syntax.

/* Retrieve all historical data for all rows in the table */
SELECT * FROM user FOR SYSTEM_TIME ALL;
​
/* Retrieve all historical data for rows in the table that meet our filter */
SELECT * FROM user FOR SYSTEM_TIME ALL WHERE id=1;

Section 8

Query: Helpful Syntax

There are numerous date and time types and functions available in the SQL spec to help reason about time. Below are a few that cover many of the cases needed when querying temporal data.

TIMESTAMP function

We’ve used a literal TIMESTAMP in previous examples, which requires a full datetime expression. Using the TIMESTAMP function, we can specify a date or datetime, adding some flexibility to how specific we want to define our queries.

/* Get all valid user rows in the month of August */
SELECT * FROM user FOR SYSTEM_TIME FROM TIMESTAMP('2018-08-01') TO TIMESTAMP('2018-09-01');

NOW function

Use NOW() to get the current datetime. Useful when paired with INTERVAL (described below).

/* Get all valid rows from the beginning of September to the time this query is executed */
SELECT * FROM user FOR SYSTEM_TIME BETWEEN TIMESTAMP('2018-09-01') AND NOW();

INTERVAL keyword

The INTERVAL keyword can be used to add to or subtract from temporal types. To use an interval, specify the quantity and unit of the interval, such as INTERVAL 1 MONTH.

/* Valid values as of 30 minutes ago */
SELECT * FROM user FOR SYSTEM_TIME AS OF (NOW() - INTERVAL 30 MINUTE);

Some interval units include: MICROSECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, and YEAR.

Composite units are also supported, such as: INTERVAL '1:30' HOUR_MINUTE

Section 9

Adding History: New Table

To add history at the time of table definition, you need to name two special columns ROW_START and ROW_END, identify them as versioning columns with PERIOD FOR SYSTEM_TIME, and specify that the table should be temporal using WITH SYSTEM VERSIONING. See the previous example again:

CREATE TABLE user (
   id int,
   level VARCHAR(20),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING;

In the above example, start_timestamp and end_timestamp are arbitrarily named, and you can change their names as you see fit as long as you reference them appropriately in the PERIOD. The type of the two rows is flexible, but you usually want to record the date and time of changes at the highest level of precision possible.

Short Form

Some databases, such as MariaDB, support a more succinct form of defining versioning. In this form, the only additional detail that needs to be added is WITH SYSTEM VERSIONING. The ROW START and ROW END columns will be generated as invisible columns and the PERIOD will be defined with those columns.

CREATE TABLE user (
   id int,
   level VARCHAR(20),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING;

The columns won’t return in a SELECT * statement, but instead must be explicitly referenced as ROW_START and ROW_END.

Section 10

Add History: Existing Table

System versioning can be added to an existing table using the ALTER TABLE syntax, after which future changes will be tracked.

Imagine the following, pre-existing table:

CREATE TABLE user (
   id int,
   level VARCHAR(20),
   PRIMARY KEY (id)
);

In order to use system versioning, we need to add the ROW START and ROW END columns, a period, and specify the table should be versioned.

ALTER TABLE user
   ADD COLUMN start_ts TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   ADD COLUMN end_ts TIMESTAMP (6) GENERATED ALWAYS AS ROW END,
   ADD PERIOD FOR SYSTEM_TIME(start_ts, end_ts),
   ADD SYSTEM VERSIONING;

Section 11

Add History: Excluding/Including Columns

You may care about changes to most columns in a table, but are not interested in tracking the changes to some of them. In this case, you can exclude columns using WITHOUT SYSTEM VERSIONING when defining them. Note this is only available in MariaDB.

CREATE TABLE user (
   id int,
   level VARCHAR(20),
   favorite_color VARCHAR(20) WITHOUT SYSTEM VERSIONING,
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING;

When changing the favorite_color value, the system will not create a new row and update timestamps. Instead, it will change the favorite_color value of the current row in place.

You may also have many columns in a table that don’t need versioning but want to version on a small subset of columns. Instead of having to mark the majority of your columns using WITHOUT SYSTEM VERSIONING, you can use WITH SYSTEM VERSIONING on a specific column. This automatically makes the table versioned and excludes other columns from being versioned. This feature is only available in MariaDB.

CREATE TABLE user (
   id int,
   level VARCHAR(20) WITH SYSTEM VERSIONING,
   nickname VARCHAR(20),
   date_of_birth DATE,
   PRIMARY KEY (id)
);

#Storing History Separately Storing historical records alongside current records can impact performance due to the resulting overhead in table scans and index lookups. In order to reduce the impact on queries for current data, add a separate partition for historical rows. There must be exactly one current partition and at least one historical partition.

CREATE TABLE user (
   id int,
   status VARCHAR(20),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME (
   PARTITION user_historical HISTORY,
   PARTITION user_current CURRENT
);

When using multiple historical partitions, it is possible to define when to roll to the next partition using either a row limit or a time interval:

/* Rolling over by row count */
CREATE TABLE user (
   id int,
   status VARCHAR(20),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME LIMIT 500000 (
   PARTITION user_historical_1 HISTORY,
   PARTITION user_historical_2 HISTORY,
   PARTITION user_current CURRENT
);
​
/* Rolling over by interval
CREATE TABLE user (
   id int,
   status VARCHAR(20),
   start_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW START,
   end_timestamp TIMESTAMP(6) GENERATED ALWAYS AS ROW END,
   PERIOD FOR SYSTEM_TIME(start_timestamp, end_timestamp),
   PRIMARY KEY (id)
) WITH SYSTEM VERSIONING
PARTITION BY SYSTEM_TIME INTERVAL 1 MONTH (
   PARTITION user_historical_1 HISTORY,
   PARTITION user_historical_2 HISTORY,
   PARTITION user_current CURRENT
);

Section 12

Removing History

When removing history, you may want to remove system versioning entirely, remove existing history, or remove history earlier than a specific point in time.

Removing system versioning

To completely remove system versioning, drop the ROW START and ROW END columns and drop system versioning:

ALTER TABLE user
   DROP COLUMN start_timestamp,
   DROP COLUMN end_timestamp,
   DROP SYSTEM VERSIONING;

For MariaDB, you must first enable the following setting:

SET @@system_versioning_alter_history = 1;

For SQL Server, you need the CONTROL permission.

Remove all existing history

To remove history, but keep system versioning on and track changes moving forward:

DELETE HISTORY FROM user;

Remove history before a certain point in time

To remove history before a certain point in time, but keep recent history:

/* MariaDB Only */
DELETE HISTORY FROM user BEFORE SYSTEM_TIME '2018-09-25 00:00:00.000000';

Section 13

Replication

In a primary-secondary configuration, it’s possible to replicate non-temporal tables on the primary to temporal equivalents on the secondary. In MariaDB, this only works when using statement-based replication.

Starting from an empty non-temporal table

Configure the non-temporal table on the primary
Configure the temporal table on the secondary
Configure replication as usual

Starting from a pre-existing non-temporal table

Back up the non-temporal table on the leader and restore on the secondary
Alter the table to include additional columns, period, and system versioning
Configure replication as usual

Section 14

Limitations

mysqldump versus MariaDB backup

Although many tools for MySQL can be easily used on MariaDB, mysqldump will not correctly back up temporal tables due to the lack of support in MySQL. To get around this limitation, use mariabackup instead.

Temporal tables record transaction time

The temporal columns are automatically generated when data is added, updated, and removed. It is not possible to modify the columns directly. This means that temporal tables record transaction time, not valid time. Valid time typically refers to the time that a particular fact is valid in the real world. There may be some delay between when a user is recognized as an expert and the database is updated to reflect that. This is not a limitation per-se, but something to keep in mind while thinking about latency between a real-world event and the resulting modification of data in the database.