Pushing Database Versioning to Its Limits
Pushing Database Versioning to Its Limits
Push database versioning to its limits by means of a novel sliding snapshot algorithm and efficient time travel queries.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
As most, current database systems still simply store current state or past states within one big relational table. We investigated what the performance drivers are and how to improve on the current state-of-the-art. We implemented an open-source storage system called Sirix(.io) from scratch, which stores small sized snapshots as well as supports sophisticated time-travel queries while competing with the efficiency of non-temporal database systems.
What Is a Temporal Database System?
It is a term used to describe that a system is capable of retrieving past states of your data. Typically, a temporal database stores both valid time, how long a fact is true in the real world, as well as transaction time when the data actually is committed to the database.
Questions such as, "Give me last month’s history of the Dollar-Pound Euro exchange rate. What was the customer's address on July 12th in 2015 as it was recorded in the day? Did they move or did we correct an error? Did we have errors in the database, which were corrected later on?"
Let’s turn our focus to the question of why historical data hasn’t been retained in the past and how new storage advances in recent years made it possible to build sophisticated solutions to help answer these questions without the hurdle state-of-the-art systems bring.
Advantages and Disadvantages of Flash Drives as for instance SSDs
As Marc Kramis points out in his paper “Growing Persistent Trees into the 21st Century”:
The switch to flash drives keenly motivates to shift from the “current state’’ paradigm towards remembering the evolutionary steps leading to this state.
The main insight is that flash drives as for instance SSDs, which are common nowadays have zero seek time while not being able to do in-place modifications of the data. Flash drives are organized into pages and blocks, whereas blocks Due to their characteristics they are able to read data on a fine-granular page-level, but can only erase data at the coarser block-level. Blocks first have to be erased, before they can be updated. Thus, updated data first is written to another place. A garbage collector marks the data, which has been rewritten to the new place as erased, such that new data can be stored in the future. Furthermore, index-structures are updated to locate the data in the new place.
Furthermore, Marc points out that small modifications, because of clustering requirements due to slow random reads of traditionally mechanical disk head seek times, usually involve writing not only the modified data but also all other records in the modified page as well as a number of pages with unmodified data. This clearly is an undesired effect.
How We Built an Open Source Storage System Based on These Observations From Scratch
Sirix stores per revision and per page-deltas.
Due to zero seek time of flash drives, we do not have to cluster data. Sirix only ever clusters data during transaction commits. It is based on append-only storage. Data is never modified in-place.
Instead, it is copied and appended to a file in a post-order traversal of the internal tree-structure in batches once a transaction commits.
We borrowed ideas from the filesystem ZFS as for instance checksums stored in parent database-pages/page-fragments, which forms a self-validating merkle-tree as well as our internal tree-structure of databases-pages.
In stark contrast to other copy-on-write (COW) approaches, however, we do not simply copy the whole record-page, which is a waste of storage space. Depending on the used versioning algorithm, we only copy a number of records from the page (everytime the changed records itself).
Versioning Algorithms for Storing and Retrieving Record-Level Snapshots
For most database systems, we store a fixed number of records, which is the actual data per database-page (currently 512 records at most). The records themselves are of variable sizes. Overlong records, which exceed a predefined length in bytes, are stored in additional overflow pages and only referenced in the record pages.
We implemented a number of versioning strategies best known from backup systems for copy-on-write operations of record pages. Namely, we either copy:
- the full record-pages, which is any record in the page (full)
- only the changed records in a record page regarding the former version (incremental)
- only the changed records in a record page since a full-page dump (differential)
It is well known that each of these versioning strategies has its advantages and drawbacks. Simply storing the whole page (full) is very efficient for reading operations. However, write performance in comparison to all other approaches is the worst, as we simply copy all unchanged records in addition to all changed records.
Incremental versioning is the other extreme and write-performance is best, as it stores the optimum (only changed records), but on the other hand, reconstructing a page needs intermittent full snapshots of pages such that the performance doesn’t deteriorate with each new revision of the page as the number of increments increases with each new version.
Differential versioning tries to balance reads and writes a bit better but is still not optimal. Each time records in a page are modified, a new page is written with all changed records since a past full dump of the page. This means that only every two revisions of the page-fragment have to be read to reconstruct a record-page. However, write-performance also deteriorates with each new revision of the page.
Incremental versioning in regards to write performance, due to the requirement of intermittent full dumps of the page, results in write-peaks. Differential versioning also suffers from a similar problem. Without an intermittent full dump, a lot of data would have to be duplicated on each new write.
Marc Kramis came up with the idea of a novel sliding snapshot algorithm, which balances read/write-performance to circumvent any write-peaks.
The algorithm makes use of sliding windows. First, any changed record must be stored. Second, any record that is older than a predefined length N of the window and that has not been changed during these N-revisions. Only these N-revisions at max have to be read. The fetching of the page-fragments could be done in parallel or we simply stop once the full-page has been reconstructed starting with the most recent revision.
Once we made sure our storage system scaled linearly for fetching old-revisions as well as the most recent revision and logarithmic for fetching and storing single records as well as whole revisions we focused our attention to upper layers.
DOM Alike API
We then invested a lot of work to implement a persistent DOM-interface (for instance to store XML-documents and in the future JSON-documents natively).
Our records are stored with stable identifiers, which never change, regardless of the fact if the records are updated or not and where they physically reside. Markers are inserted for deleted records. The encoding is simply first-child-, left-sibling-, right-sibling-, parent- and node-ID in order to store a kind of DOM representation of currently XML/XDM-nodes.
Versioned, Typed, User-Defined Indexes
We then shifted our focus again to implement versioned user-defined, index-structures.
During each transaction commit a snapshot not only of the stored data but also of the indexes is generated. The indexes currently are based on AVL-trees/AVL-nodes stored in record-pages in different subtrees of our internal ZFS alike tree-structure.
A path summary of all paths in the resource is kept up-to-date at all times.
To enable users to make the most of our temporal database system and to actually easily answer the aforementioned questions one would likely be able to get answered by a temporal database system we extended a query-compiler called Brackit.
Users are now able to open specific revisions, navigate in the DOM-alike tree-structure to select nodes in a specific revision and then navigate in time. Through novel time-based axis it is for instance easily possible to analyse how the selected or a sequence of records/the nodes looks like in the next revision, the previous revision, the first or the last revision, the past- or future-revisions, all-revisions…
Furthermore we are able to query a range of revisions either based on given timestamps or the IDs of the revisions.
But what if we want to import several revisions of preexisting documents or compare any revision stored in our system independent of the versioning algorithm we chose?
We first implemented a diff-algorithm called fast-matching-simple-edit script (FMSE) to support the import of different versions of a document and commit several revision in our storage system. The algorithm does not rely on node-identifiers. It matches based on similarities of subtrees through first calculating a Longest Common Subsequence (LCF) on the leaf nodes of the tree-structured document to import (currently XML, in the future also JSON). Then it tries to match way bottom up. In the next step, edit-operations are applied to transform the document from one version to another version. The algorithm obviously uses heuristics at a form of the tree-to-tree correction problem and thus NP-hard. It works best and produces a minimal edit-script if the leaf nodes are very distinctive.
ID-based algorithm optionally making use of stored hashes
In order to compute diffs between any revision of Sirix-resources and regardless of the record-page level versioning, we developed an algorithm, which makes use of our stable record-identifiers (which is a based on a sequence generator and never reassigns IDs (from deleted records for instance). If we store hashes of the nodes during insertion, the diff algorithm is able to skip whole-subtrees if the node-identifiers as well as the hashes match.
Non-Blocking, Asynchronous RESTful API
We recently built on top of our XQuery and DOM-API layers a higher level API to communicate with a Sirix server based on Vert.x, Kotlin/Coroutines, and Keycloak for authentication. The implementation and usage examples thereof already were the subject of another article.
A Visual Analytics Approach for Comparing Tree-Structures
As shown in a few screenshots, we once developed a Visual Analytics approach to compare tree-structures stored in Sirix. However, it’s a bit dated and needs to be ported to the web.
What We Are Working On
Next, we’ll investigate how to best store JSON-documents, which simply boils down to the question of how fine-granular we want our records to be (for instance object record nodes, array-index nodes…)
However, we would be more than happy to discuss future directions and ideas. Any help is greatly appreciated.
Published at DZone with permission of Johannes Lichtenberger . See the original article here.
Opinions expressed by DZone contributors are their own.