This article by Choo Yun-cheol appears via Esen Sagynov.
Today I would like to talk about Spanner, a NewSQL distributed relational database by Google. It can distribute and store data in data centers across the world, provide consistency that is as excellent as in RDBMS while enabling to store an amount of data that exceeds the capacity of a single data center.
In this article I will briefly explain when the NewSQL trend has begun, then will introduce Spanner, its features and architecture, how it performs data distribution and rebalancing, how it actually stores the data, and finally how it provides data consistency. You will also learn about Google's TrueTime API which is at the core of Spanner distributed relational database.
NoSQL and NewSQL
NewSQL is on the rise. A wholly different database architecture, differentiated from NoSQL, is beginning to emerge. There are many reasons why NoSQL products have been popular. As there are a variety of NoSQL products, and they have been developed to serve different purposes, it is not easy to list their common features. However, as you can see in Hadoop or Cassandra, one of the main advantages of NoSQL is its horizontal scalability. As these NoSQL products don't provide Strong Consistency, they cannot be used where high-level data consistency is required.
NewSQL has as excellent scalability as NoSQL, and at the same time it guarantees ACID like RDBMS which is performed in a single node. The term NewSQL was first used in 2011 by Matthew Aslett at the 451 Group, a corporate business IT research group. Figure 1 below shows the classification of NewSQL that was made by the 451 Group (it does not include information on Spanner, as it was drawn up in 2011).
Figure 1: Classification of RDBMS, NoSQL and NewSQL Made by the 451 Group.(Source http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/)
Of course, HBase also provides transactions in a limited form (transaction for a single row). However, not all business requirements can be met with such a limited transaction rule.
nBase, which is developed by Storage System Dev. Team here at NHN, is also a NewSQL. Currently, NAVER Mail, probably the most popular email service provider in Korea, uses nBase. nBase supports Strong Consistency from version 1.3.
What is Spanner?
Debuted [video] at the OSDI Conference in 2012, Spanner is a NewSQL created by Google. It is a distributed relational database that can distribute and store data in Google's BigTable storage system in multiple data centers. Spanner meets ACID (of course, it supports transaction) and supports SQL. Currently, F1, Google's advertisement platform, uses Spanner. Gmail and Google Search will also use it soon. F1 has 5 data replicas in the U.S. to prevent the service from stopping even in the event of a natural disasters such as an earthquake, or when one or two data centers have a failure.
Spanner provides the scalability that enables you to store a few trillion database rows in millions of nodes distributed to hundreds of data centers. When you read data, Spanner connects you to the data center that is geographically closest to you, and when you write data, it distributes and stores it to multiple data centers. If the data center you try to access has a failure, of course, you can read the data from another data center that has a replica of the data.
The Spanner client automatically performs a failover between replicas. When the number of servers storing data is changed or a failure occurs in equipment, Spanner automatically re-distributes data through data transfer among the equipments.
The following information on Spanner was obtained from Google's Spanner Paper. In addition, the figures below are also excerpted from the paper.
The Data Model of Spanner as a Semi-relational Data Model
Spanner's data model is semi-relational. It is called semi-relational because it has some features that differentiate it from normal relational databases.
- In Spanner, all tables should have at least one primary key.
- Spanner is a multi-version database that uses a version when storing data in a column. It has evolved from the key-value store that maintains a version like BigTable (and like HBase, too, which was also influenced by BigTable).
- Data is stored in a semi-relational table that has a schema. This means that data has a version, which is given in the form of a time stamp when committing. In applications that use Spanner, you can also read data of past versions.
- Spanner supports transactions for general use and supports SQL as well.
- Spanner provides Strong Consistency. It can read and write data consistently and provides globally consistent read operations for a specific time stamp. This functionality enables you to carry out 'consistent backup', 'MapReduce operation' and 'atomic schema change', even with ongoing transactions. This is possible because Spanner issues a serialized commit time stamp to distributed transactions by using TrueTime API.
Spanner Server Configuration
The universal set of Spanner is called universe. A universe consists of multiple zones. A zone means a unit that can be run with physical independence. A data center may have one or more zones. If you want to store data separately in different server groups, you should make two or more zones in a single data center. You can also create or remove a zone in an operating system.
Figure 2: Spanner Server Configuration.
Figure 2 above shows the configuration of servers in a universe. One zone consists of one zonemaster and hundreds of, or thousands of, spanservers. The zonemaster allocates data to spanservers, while spanservers actually store and process data. Location proxy is called by the client, and shows in which spanserver the target data is stored.
The universe master provides the status of all zones or debugging information, and the placement driver automatically transfers data between zones, and inspects to determine whether there is any data to be moved due to a change in replication settings or load balancing by communicating with spanservers periodically.
Spanserver Software Stack
Figure 3: Spanserver Software Stack.
Each spanserver manages 10 to 1000 data structures called tablet. A tablet has a concept similar to a tablet of BigTable. It can store multiple mappings in the form of
(key:string, timestamp:int64) string.
The difference between the tablet of Spanner and the tablet of BigTable is that Spanner stores a time stamp together with data. This means Spanner is not a simple key-value store but has the characteristics of a multi-version database.
The status of a tablet is stored in Colossus Distributed File System (the successor of Google File System) in the form of B-tree file and write-ahead log (WAL).
Spanner uses Paxos state machine to support data replication among spanservers.
Paxos is a protocol set created for reliable operations in distributed environments. In an distributed environment, you can have a failure at any time, and you are not also guaranteed to receive events from distributed nodes in the order of occurrence. Paxos is used to resolve this type of reliability issue. One of the main issues handled by Paxos is which node is a leader. This is for consistency in the process of data replication.
A spanserver has the transaction manager to support distributed transactions. The transaction manager is not involved in a transaction performed in a single Paxos group, but when a transaction is performed across multiple Paxos groups, one of the participant leaders is selected as coordinator leader, and performs coordination to enable phase-2 commit.
A directory is a set of continuous keys that use the same prefix (you can think of it as a bucket). The Spanner paper says that "bucket" is a more appropriate term, and that the term "directory" is used as they intended to keep using the old term.
A directory is a unit of data allocation. All the data in a directory have identical replication settings, and the transfer of data between Paxos groups is also conducted, as shown in Figure 4 below, in the unit of a directory.
Figure 4: Transfer of a Directory between Paxos Groups.
Spanner moves a directory to reduce the load of a certain Paxos group, groups directories which are frequently called together, or moves a directory into a group that is geographically close to the client that tries to access it. A directory can be transferred even while an operation of the client is in progress.
Transfer of directories between Paxos groups is conducted in the background. Through this transfer work, you can add or delete replicas in a Paxos group. To prevent the blocking of read/write tasks in progress during the transfer of a large amount of data, directory transfer is not performed as a single transaction. Instead, only the start of data transfer is registered to the background when directory transfer is conducted, and after the task is complete, a transaction is used only when the rest of the data is moved.
A directory is the minimum unit of geographical replica allocation. The administrator can specify the number of types of replica and its geographical allocation for each application. For example, you can configure settings to store 3 replicas of the data of User A in Europe, and store 5 replicas of the data of User B in North America.
If the size of a directory is too big, you can split a single directory into multiple fragments. In this case, the unit of directory transfer or allocation among groups also becomes a fragment.
Spanner's data model features a semi-relational table with a schema, a query language that is expanded from SQL, and transactions for general purpose.
An application can create one or more databases in a universe, and one database can have many tables without a limit. A table has rows and columns like an RDBMS table. But unlike an RDBMS table, each data has version information.
Figure 5: An Example of Spanner Schema.
Figure 5 above shows an example of the schema of Spanner. In Spanner's schema definition language, you can express the hierarchical relationship among tables using the
INTERLEAVE IN declaration. The top-level table in the hierarchical relationship is a directory table. According to the key defined in a directory table, its sub-table names are arranged in dictionary order and make up a directory. The
ON DELETE CASCADE statement is used, when a row of a directory table is deleted, to delete the data of related sub-tables together.
In the example, the Users table is specified as a directory table, and according to the value of the
uid column, which is the primary key, data is divided and stored into different directories. As the client specifies the hierarchical relationship among multiple tables, a database can have better performance when data is divided and distributed.
TrueTime is an API that provides time-related information, which consists of the following methods:
TT.now() method is used to get the current time. This method returns the current time in the form of
TTinterval:[earliest, latest] which takes the inaccuracy of time into account.
TTinterval returned by
TT.now() method guarantees that the absolute time when
TT.now() was called is within
TT.after(t) method returns
true if the time is surely after the specified time
t, and returns
false if not.
In contrast, the
TT.before(t) method returns
true if the time is surely before the specified time
t, and returns
false if not.
TrueTime gets time information from GPS and the atomic clock. It gets time information from two different sources because one of them could have a failure and be unable to provide time information. When you get time information through GPS, you may fail to receive it due to problems such as an antenna or reception problem and radio wave interference. In addition, the time you get from an atomic clock may have an error due to frequency errors.
TrueTime consists of multiple time master devices in each data center. Most masters have a GPS receiver equipped with an antenna. Other masters without GPS are called Armageddon master. These masters are equipped with an atomic clock. All masters check the status of time information by comparing their information with each other periodically. In this process, each master synchronizes the time by checking for any error in their clock. To reduce risks from errors by masters, the timeslave daemon that runs for each time master gets information from multiple time masters and identifies any master that provides incorrect information and adjusts its own clock.
For this reason, the inaccuracy of the clock tends to increase from when the time is synchronized until the next synchronization comes. The time inaccuracy of Spanner draws a toothed wheel line going up and down between 1 ms and 7 ms per synchronization period. As the synchronization period is 30 seconds, and the time error increases by 200 μsec per second, the interval between teeth becomes 0-6 ms, and the other 1ms results from communication latency with the time master.
Spanner provides three types of operations: read/write transaction, read transaction and snapshot read operation. A single write operation is performed through a read/write transaction, while a single read operation, not a snapshot read, is performed through a read transaction.
The Process of Read/Write Transaction
A write operation executed in a read/write transaction is buffered on the client until the commit. A read operation in a transaction, therefore, is not affected by the result of a write operation.
Read in a read/write transaction uses a wound-wait method to avoid a deadlock. The client gets a read lock from the leader replica of an appropriate group and reads the latest data. To prevent a timeout while a transaction is open, the client sends a keepalive message to the participant leaders. If the client completes all read tasks and the buffering of the write task is finished, the phase-2 commit is started. The client selects a coordinator group and sends a commit message to all participant leaders. The commit message contains the information on the coordinator and the write task that was buffered.
Leaders other than the coordinator hold a write lock, prepare a time stamp value bigger than all the time stamps allocated to the transaction, log records through Paxos, and then send the time stamp value to the coordinator.
The coordinator leader holds a write lock, and then skips the process of preparing a time stamp. The coordinator receives time stamps from all the participant leaders, and then selects a time stamp for the transaction. The commit time stamp should be equal to, or bigger than, the value of all the time stamps received from the participant leaders, bigger than the
TT.now().latest of the time when the coordinator received the commit message, and bigger than the value of any time stamps leaders allocated to the previous transaction. After that, the coordinator leader logs the commit record through Paxos.
Before applying the commit record to the coordinator replica, the coordinator leader waits until the value of
TT.after (commit time stamp) becomes true to ensure that the time specified by the time stamp has passed. After that, the coordinator sends the commit time stamp to the client and all the participant leaders. The participant leaders that received the commit time stamp log the result of the transaction through Paxos. All participants apply the same time stamp and then release the lock.
Figure 5: The Process of the Phase-2 Commit of a Read/Write Transaction.
The Process of a Read Transaction
As a read transaction is executed after a read time stamp is determined without locking, a write task coming in the course of a read task is not blocked. A read transaction is executed in two phases. First, a read time stamp is determined and then a read is executed from the snapshot of the read time stamp.
To execute a read transaction, you need a scope expression that summarizes the scope of keys to be read by the transaction. If the scope in the scope expression can be executed within a single Paxos group, the client will request the group leader for a read transaction. Paxos leader determines a time stamp for the read transaction, and executes a read operation. If it is a read in a single group, the value of
LastTS(), which is the time stamp value of the last committed write in the Paxos group, is used for a read time stamp instead of the value of
If the scope in the scope expression should be executed across multiple Paxos groups, the client will use the value for
TT.now().latest as a read time stamp. The client waits until
TT.after() becomes true, and then confirms that the time stamp value has passed, and then sends the read task in the transaction to all replicas.
Schema Change Transaction
Spanner also supports atomic schema change by using TrueTime. As there may be millions of groups requiring schema change, it is almost impossible to change schemas by using normal transactions. BigTable supports atomic schema change in a data center, but during the change work, all operations are blocked.
However, you can execute schema change without a block by using a special transaction. First, you should select a future point of time explicitly and create and register a time stamp. Through this, schema change can be performed without affecting any ongoing tasks.
A read/write task related to the schema change work is synchronized with the registered schema change time stamp. If the time stamp of the task is earlier than the schema change time stamp, the task will be performed earlier. If it is after the schema change time stamp, it will be blocked.
Spanner has blended and developed the ideas of two different research communities. First, Spanner accepted familiar, easy-to-use semi-relational interface, transactions and SQL-based query language from the database communities. Second, Spanner also accepted the concepts of scalability, auto segmentation, failure resistance, data replication consistency and wide distribution from the communities. Thanks to 5 years of development efforts, Spanner has gained the critical functionalities of a database, which had been impossible in BigTable under globally distributed environments.
Another key functionality of Spanner is TrueTime. TrueTime provides a functionality based on accurate time synchronization in a distributed system by expressing the inaccuracy of time more specifically in the time API.
By Choo Yun-cheol, Senior Software Engineer at Storage System Dev. Team, NHN Corporation.