{{announcement.body}}
{{announcement.title}}
Refcard #334

Getting Started With Distributed SQL

In recent years, NoSQL distributed databases have become common, as they are built from the ground up to be distributed. Yet they force difficult design choices such as choosing availability over consistency, data integrity, and ease of query to meet their applications’ need for scale. This Refcard serves as a reference to the key characteristics of distributed SQL databases and provides information on the benefits of these databases, as well as insights into query design and execution.

Published: Jul. 23, 2020
3,168
Free PDF for easy Reference

Brought to you by

MemSQL
refcard cover

Written by

author avatar Domenic Ravita Field CTO, MemSQL
asset cover
Refcard #334

Getting Started With Distributed SQL

In recent years, NoSQL distributed databases have become common, as they are built from the ground up to be distributed. Yet they force difficult design choices such as choosing availability over consistency, data integrity, and ease of query to meet their applications’ need for scale. This Refcard serves as a reference to the key characteristics of distributed SQL databases and provides information on the benefits of these databases, as well as insights into query design and execution.

Published: Jul. 23, 2020
3,168
Free PDF for easy Reference

Written by

author avatar Domenic Ravita Field CTO, MemSQL

Brought to you by

MemSQL
Table of Contents

Introduction

The Need for Distributed SQL

Sharding Middleware

Distributed SQL Architecture

Comparing Architectures of Distributed SQL Databases

Additional Characteristics of Distributed SQL Databases

Query Execution Architecture

Distributed DDL

Distributed DML

Distributed Joins

Conclusion

Section 1

Introduction

Modern applications are built as cloud-native, distributed systems. Applications supporting transactional processing are built as stateless, microservice-based distributed systems to support the scale, speed, resilience to failure, and elasticity.  For applications supporting analytical processing, the driving design need is to support the growing dataset sizes while simultaneously supporting higher concurrency. 

To scale the data tier for both types of applications, it’s not uncommon to find sharding middleware managing single-node database instances. But this approach has its limitations and management overhead challenges, which have led to distributed databases gaining in popularity across the workload spectrum. 

In recent years, NoSQL distributed databases have become common, as they are built from the ground up to be distributed. Yet they force difficult design choices such as choosing availability over consistency, data integrity, and ease of query to meet their applications’ need for scale.

Using this approach often means giving up on relational SQL and performing complex logic in the application such as joins. These tend to be error-prone and inefficient compared to traditional RDBMS systems, which provide much better data independence, meaning that query logic in applications is less tightly coupled to the physical structure of the data. These trade-offs characterize NoSQL distributed databases, which were developed to address the scale problems of existing traditional single-node database systems and to take advantage of horizontal scaling. Distributed SQL databases, by contrast, offer the benefits of scale-out, while also providing consistency and an ANSI-compliant SQL interface. 

This Refcard serves as a reference to the key characteristics of distributed SQL databases and provides information on the benefits of these databases, as well as insights into query design and execution.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 2

The Need for Distributed SQL

The speed at which businesses are building new applications in the cloud and are moving legacy applications to the cloud is increasing. The drivers are faster development cycles, the ability to scale on-demand, and the ability to pay-as-you-go, avoiding large capital outlays — and the need for extensive justification — up front. 

Simultaneously, large monolithic applications are being refactored into, or replaced by, distributed microservice-based applications, which offer the benefits of development independence and shortened development cycles. Distributed SQL databases offer an advantage for these new, modern, cloud-native applications. They provide better performance at a lower cost than a solution built using a traditional database like PostgreSQL or MySQL, plus sharding middleware to achieve scale. Also, distributed SQL databases don’t force users to give up on ACID-compliant transactions and joins in the database to achieve availability and scalability. While that trade-off may be acceptable for some use cases and applications, it certainly is not for a broad set of applications like those in financial services.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 3

Sharding Middleware

Sharding middleware supports the distribution of a single-server database, such as MySQL, across multiple independent servers. The user chooses a key — the shard key — that's employed to decide which records go to each server. Each server acts as a separate database, but the sharding middleware logically combines them, so they appear from the outside as a single, distributed database. 

The advantage of using sharding middleware, versus going to a NoSQL solution, is that SQL support is maintained. However, the downsides of using this middleware work directly against the hoped-for benefits. Downsides are likely to include:

  • Shard key choice. Choosing the shard key, or partition key in some systems, correctly is vital. If this is done incorrectly, or if needs change, the entire database must be re-sharded, which takes time, is likely to involve downtime, and may come at some risk to the data stored in the database. 
  • Slower performance. The middleware is an extra layer, so both updates and reads will be slowed to some degree. Joins, in particular, become quite complex. As the join logic must be written in the application, it’s often less efficient and more error-prone than what distributed databases provide.
  • Operational complexity. The sharding middleware introduces more moving parts, which increases the cost of management. It also impacts routine operations such as security patches, backups, and recovery. 
  • Limited scalability. The problems above impose practical limits on the degree of scalability that’s achievable with sharding middleware. 
  • Cost. The sharding middleware may have a license fee of its own, and additional servers, operations personnel, and operations steps are required. Whatever degree of redundancy needed, it is likely to be difficult to achieve without additional spending and staffing. All of this increases cost.  Moreover, developers often resort to writing distributed query processing logic into application software, diminishing their productivity.

This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 4

Distributed SQL Architecture

Distributed SQL database software has automatic but configurable sharding or partitioning included in the database software itself. This reduces or eliminates the problems that sharding middleware brings with it. This software is often referred to as NewSQL database software. It includes databases such as Google Spanner, Google BigQuery, VoltDB, and MemSQL (MongoDB tries to achieve most of the same goals using a NoSQL architecture).

Some database software is targeted more at transactional workloads, and some at analytical workloads — but the ideal for most of these products is to combine both kinds of workloads into one. Distributed SQL databases accomplish this by the use of an architecture that's made up of three layers: (i) SQL API, (ii) distributed query execution, and (iii) distributed storage.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 5

Comparing Architectures of Distributed SQL Databases

System Architecture

Characteristics

Examples

Shared Everything

Not distributed SQL but included for comparison. These are single-node database instances with local CPU, local memory, and local disk storage. 

MySQL, PostgreSQL

Shared Memory

Compute nodes access a common memory address space via high-speed network

High-end computing in scientific simulations. Not common in business practice.

Shared Storage

Distributed SQL database where compute nodes are independent of durable storage. Compute nodes have local memory and buffer pool for ephemeral data. Pays the penalty of not having data locality. Updates require messaging between compute nodes to notify them of the changed state.

Oracle Exadata, Snowflake, Google BigQuery

Shared Nothing

Distributed SQL database where each node has its own local CPUs, memory and local storage. This design offers the best performance and efficiency due to data locality, moving the least amount of data across the network

MemSQL, VoltDB


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 6

Additional Characteristics of Distributed SQL Databases

System Characteristics

Description

Types of Nodes

Uniform or role-based. In distributed SQL systems with uniform nodes, every node is identical. The downside of this approach is that each node must communicate with many nodes in the cluster to obtain sufficient metadata for cluster operation. In systems with role-based nodes, nodes perform in one role out of two or more. One benefit of this approach is the isolation of metadata management to only nodes in that role. 

Distributed Storage

To leverage storage across independent storage devices, a database is partitioned, or sharded, across multiple nodes.

The DBMS executes query fragments on each partition and then combines the results to produce a single answer.  Applications and users may have no knowledge of where data is physically located or how tables are partitioned or replicated. A partitioning scheme should be chosen that maximizes single-node transactions to avoid the need to coordinate the behavior of concurrent transactions running on other nodes.

Distributed Queries

Generally, there are two approaches for where query execution is performed: (1) move the query computation to the data or (2) move the data to the query computation. More modern distributed SQL databases leverage both approaches, choosing the method which provides the greatest efficiency for the query type.

Replication

Data is automatically copied to multiple nodes to increase availability, protecting against node failures. The number of copies depends on the distributed SQL implementation. There are two approaches: (1) master-replica and (2) multi-master.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 7

Query Execution Architecture

Modern query execution engines are capable of handling queries against fast-moving operational data with high performance and low latency. The reference information below covers the Data Definition Language (DDL) and Data Manipulation Language (DML) affecting query design.

Most common distributed query execution serves in one of two roles: master/aggregator nodes or leaf nodes. Aggregators can be thought of as load balancers or network proxies, through which SQL clients interact with the cluster. The only data aggregators store is metadata about the machines in the cluster and the partitioning of the data. The leaves function as storage and compute nodes.

As a user, you interact with an aggregator as if it were “the” database, running queries and updating data as normal via SQL commands. Under the hood, the aggregator queries the leaves, aggregates intermediate results (hence the name), and sends final results back to the client. All of the communication between aggregators and leaves for query execution is also implemented as SQL statements.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 8

Distributed DDL

Traditionally, a schema designer must consider how to lay out columns, types, and indexes in a table. Many of these considerations still apply to a distributed system, with a few new concepts.

Every distributed table has exactly one shard key, or shard index. This functions like a normal table index and can contain any number of columns. This key also determines which partition a given row belongs to.

When you run an INSERT query, the aggregator computes the hash value of the values in the column or columns that make up the shard key, does a modulo operation to get a partition number, and directs the INSERT query to the appropriate partition on a leaf machine.


This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 9

Distributed DML

How a table is partitioned affects the performance of some kinds of SELECT queries. In this section, we will look at common query patterns and how they are executed through the distributed system. You can use the EXPLAIN command, or graphical EXPLAIN in MemSQL Studio, to examine a query’s aggregator-level and leaf-level query plans.

Note: An example to determine how query patterns are executed through the distributed system is similar to the following syntax, but the actual syntax for any particular distributed SQL database may vary.

Let’s assume the following schema:

SQL
 







This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 10

Distributed Joins

Reference Joins. Distributed SQL architecture is fundamentally designed to efficiently execute any join query with a single sharded table and as many reference tables as you’d like. Since reference tables are fully replicated on every machine in the cluster, leaves can join the shards they own against their local copies of reference tables.

Note: An example to execute any join query with a single sharded table and as many reference tables is similar to the following syntax, but the actual syntax for any particular distributed SQL database may vary.

These queries leverage reference joins:

SQL
 







This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

Section 11

Conclusion

Distributed SQL databases have numerous advantages over single-process SQL databases that are scaled through sharding middleware and over NoSQL databases, which lack the expressive power and computational efficiency of relational SQL: 

  • Horizontal Scaling with Ease
  • Elasticity
  • Fault-Tolerance
  • Partitioning/Sharding
  • Data Resilience and High Availability

This is a preview of the Getting Started With Distributed SQL Refcard. To read the entire Refcard, please download the PDF from the link above.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}