What Is Distributed SQL?
In this article, find out what distributed SQL is and look at database architecture, business benefits, and more.
Join the DZone community and get the full member experience.Join For Free
SQL has been the de-facto language for relational databases (aka RDBMS) for almost four decades. Relational databases are, therefore, also known as SQL databases. However, the original SQL databases like Oracle, PostgreSQL, and MySQL are monolithic from an architectural standpoint. They are unable to distribute data and queries across multiple instances automatically. NewSQL databases emerged to make SQL scalable. However, they also introduced their own painful compromises.
You may also like: Working With PostgreSQL Data Types in a Distributed SQL Database
After the introduction of Docker containers and Kubernetes orchestration to create flexible, composable infrastructure starting in 2015, microservices-based applications have been on the rise. Cloud-native principles of built-in scaling, resilience, and geo-distribution are at the center of this architectural shift. Time was ripe for the introduction of a new class of databases called “Distributed SQL”. The defining characteristic of a distributed SQL database is that the entire database cluster (irrespective of the number of nodes in it) looks to applications as a single logical SQL database.
Distributed SQL databases have a three-layer architecture.
1. SQL API
As is evident by the name, a distributed SQL database must have a SQL API for applications to model relational data and also perform queries involving those relations. Typical data modeling constructs that are unique to SQL databases are indexes, foreign key constraints, JOIN queries and multi-row ACID transactions.
2. Distributed Query Execution
Queries should be automatically distributed across multiple nodes of the cluster so that no single node becomes a bottleneck for query processing. Incoming queries should be accepted by any node in the cluster which should then request other nodes to process their portion of their query in such a way that processing latency is minimized, which includes the amount of data transferred between the nodes over the network. The original node accepting the request should then send the aggregated results back to the client application.
3. Distributed Data Storage
Data including indexes should be automatically distributed (aka sharded) across multiple nodes of the cluster so that no single node becomes a bottleneck for ensuring high performance and high availability. Additionally, the database cluster should support strongly consistent replication and multi-row (aka distributed) ACID transactions in order to ensure the single logical database concept.
Strongly Consistent Replication
Supporting a powerful SQL API layer inherently requires the underlying storage layer to be built on strongly consistent replication across the nodes of the database cluster. This means writes to the database would be synchronously committed at multiple nodes in order to guarantee availability during failures. Reads should either serve the last committed write or an error. This property is commonly known as Linearizability. As per the famous CAP Theorem, distributed SQL databases are classified as Consistent and Partition-tolerant (CP).
Distributed ACID Transactions
The database storage layer should also support distributed ACID transactions where transaction coordination across multiple rows located on multiple nodes is required. Usually, this requires the use of a 2 Phase Commit (2PC) protocol. Isolation levels, which stand for the I in ACID, signify how strict the database is with respect to concurrent data access. Distributed SQL databases are expected to support Serializability as the most strict isolation level as well as additional weaker isolation levels such as Snapshot.
The above architecture leads to four key benefits.
1. Developer Agility With SQL and Transactions
Even though NoSQL databases such as Amazon DynamoDB, MongoDB, and FaunaDB are starting to make some operations transactional, application developers continue to keep SQL databases close to their hearts.
One of the reasons for this affinity is the inherent power of SQL as a data modeling language that effortlessly models relational and multi-row operations. For example, SQL goes way beyond traditional key-value NoSQL by allowing multi-row transactions both explicitly (using the BEGIN and END TRANSACTION syntax) and implicitly (using secondary indexes, foreign keys and JOIN queries).
Additionally, developers love the ease with which they can leverage SQL to model (and store) data only once and then change queries by simply changing JOINs as and when business needs change.
2. Ultra Resilience With Native Failover/Repair
The use of techniques such as per-shard distributed consensus replication in distributed SQL databases ensure that each shard (and not simply each instance) remains highly available in the presence of failures.
Infrastructure failures always affect only a subset of data (only those shards whose leaders get partitioned away) and never the entire cluster. And, given the ability of the remaining shard replicas to automatically elect a new leader in seconds, the cluster repairs itself thereby exhibiting self-healing characteristics when subjected to failures.
The application remains transparent to these cluster config changes and continues to work normally without outages or slowdowns.
3. Scale On-Demand With Horizontal Write Scalability
“How Data Sharding Works in a Distributed SQL Database” shows how automatic data sharding is usually implemented in a distributed SQL database. The shards remain automatically balanced across all available nodes as new nodes are added or existing nodes are removed.
Microservices needing write scalability for transactional applications can now rely on the database directly as opposed to adding new infrastructure such as an in-memory cache (that offloads read requests from the database so that it can be preserved for handling write requests) or a NoSQL database (that scales writes but forsakes ACID guarantees).
4. Low User Latency With Geographic Data Distribution
As highlighted in “9 Techniques to Build Cloud-Native, Geo-Distributed SQL Apps with Low Latency”, distributed SQL databases can offer a wide array of techniques to build geo-distributed applications that not only help in tolerating region failures automatically but also lower latency for end-users by bringing data closer to their local region.
Published at DZone with permission of Sid Choudhury. See the original article here.
Opinions expressed by DZone contributors are their own.