Curator's Note: The post below was originally published in September, 2012.
Today I read this recent paper by Google:
Spanner: Google's Globally-Distributed Database
As a globally-distributed database, Spanner provides several interesting features.
- First, the replication conﬁgurations for data can be dynamically controlled at a ﬁne grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance).
- Second, Spanner has two features that are difﬁcult to implement in a distributed database: it provides externally consistent  reads and writes, and globally-consistent reads across the database at a time-stamp. These features enable Spanner to support consistent backups, consistent MapReduce executions , and atomic schema updates, all at global scale, and even in the presence of ongoing transactions.
The very interesting part is that it provides semi-relational tables (which some other NoSQL systems also do), SQL-like query language, and synchronous replications – you can see that many applications within Google use non-optimal data stores just to have the synchronous replication.
Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general- purpose transactions. The move towards supporting these features was driven by many factors.
- The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore . At least 300 applications within Google use Megastore (despite its relatively low per- formance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across data- centers.)
- The need to support a SQL-like query language in Spanner was also clear, given the popularity of Dremel  as an interactive data- analysis tool.
- Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator  was in part built to address this failing. Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.
Essentially, they accomplish by still having several versions of each record (like BigTable, but with timestamps assigned by the server) and guaranteeing timestamps to be within an error margin by using multiple time sources like GPS and atomic clocks. Then there are Paxos groups, of course, and two-phase commits for the transactions.
I found this paper very interesting as I've been thinking of some challenges of implementing these features (like across DC replication and backup) on NoSQL data stores. Quite an interesting paper and accomplishment by this Google team.