Database Sharding Explained in Plain English
Database Sharding Explained in Plain English
Sharding may not make sense in all cases — but for most, if the data model fits into a clean sharding model, then there can be a lot of gains.
Join the DZone community and get the full member experience.Join For Free
Discover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!
Sharding is one of those database topics that most developers have a distant understanding of, but the details aren't always perfectly clear unless you've implemented sharding yourself. In building the Citus database (our extension to Postgres that shards the underlying database), we've followed a lot of the same principles you'd follow if you were manually sharding Postgres yourself. The main difference, of course, is that with Citus, we've done the heavy lifting to shard Postgres and make it easy to adopt, whereas if you were to shard at the application layer, there's a good bit of work needed to re-architect your application.
I've found myself explaining how sharding works to many people over the past year and realized it would be useful (and maybe even interesting) to break it down in plain English.
Defining Your Partition Key (AKA Shard Key/Distribution Key)
Sharding, at its core, is splitting your data up to where it resides in smaller chunks spread across distinct separate buckets. A bucket could be a table, a Postgres schema, or a different physical database. Then, as you need to continue scaling, you're able to move your shards to new physical nodes thus improving performance.
The one step that happens in every sharding implementation is deciding what you're going to shard or partition your data on. There are a number of trade-offs to various keys, and what is right really depends on your application. Once you determine what your sharding key will be, you're going to ensure the sharding key is in place throughout your application. You can do this a few ways; an easy one is materializing that sharding key on all your models. By materializing or de-normalizing it, you allow your application to make fewer queries to define how it needs to route the data.
When a request comes in, if you're doing your sharding at the application layer, your application will need to determine how to route the request. The same principle applies to Citus; you just don't have to do extra work to determine how to route the request. Rather, Citus figures that out for you. If your shard key is on the query itself, Citus sees the shard key and routes the query accordingly.
Bringing this a little more to life, here's an example query that is sent to Citus and how Citus transforms it:
SELECT * FROM visits WHERE occurred_at >= now() - '1 week'::interval AND customer_id = 2
The above gets transformed with the Citus router executor to:
SELECT * FROM visits_007 WHERE occurred_at >= now() - '1 week'::interval AND customer_id = 2
So, there you have it. You're done, right? You know all you need to about sharding! ...well, not quite yet.
Shard Key != Shard Number
A common misconception about sharding is that when you define what your shard key is, the actual value of the shard key is the value that exists within your metadata tables that determine the routing. Let's look at a real-world example. Say you're building a SaaS application. You start in year 1 and get 20 customers; in year 2, you get 100 customers; and in year 3, you get 500 customers. Imagine your SaaS application is a CRM system and you decide to shard by customer because the data for each of your customers needs to be kept separate from others. Because your early-adopter customers will have been using your application over a longer time period, it's likely that customers in year 1 and 2 are likely to have significantly more data than the new customers who onboarded in year 3. One option to define how values map to shards would be the following:
shard 1: customer ids 1-100 shard 2: customer ids 101-200 shard 3: customer ids 201-300 shard 4: customer ids 301-400 shard 5: customer ids 401-500 shard 6: customer ids 501-
Because our older customers will have a chance of accumulating more data since they've been a customer for longer, you now have an uneven skew of data. With customers with IDs 1-100 (those with the most data) all living on the same shard, you can see why this may not be ideal.
Sharding by Customer Still Makes Sense... Say What?
Sharding by customer is super common — especially for multi-tenant applications — and has lots of benefits for performance and scale. But the example above highlights the situation where you can have shards with a very uneven distribution of data — and this uneven distribution of data across shards. The solution lies in the implementation of how you shard. As I like to say, it's just an "implementation detail." In this case, at least, it's more true than not.
The solution is simple, you hash each
customer_id as it comes in, then you have a table that maps which hash values are stored in which shards. Let's say you have eight shards and you start hashing the
customer_ids. Your hash range of integers within Postgres can be from -2147483648 (-2^32) to 2147483647 (2^32-1). If we were to divide that hash range evenly, we would have shards that have the following values:
shardid | min hash value | min hash value ---------+----------------+---------------- 1 | -2147483648 | -1610612737 2 | -1610612736 | -1073741825 3 | -1073741824 | -536870913 4 | -536870912 | -1 5 | 0 | 536870911 6 | 536870912 | 1073741823 7 | 1073741824 | 1610612735 8 | 1610612736 | 2147483647
customer_id s come in, we're going to evaluate the hash value of them. For customer 1, our hash comes out to -1905060026; for customer 2, it comes out to 1134484726; for customer 3, it comes out to -28094569. So, as we can see, our earlier customers from year 1 already start to get evenly distributed among shards.
Shards Are Not Nodes
As we mentioned at the very beginning, briefly, shards are some distinct grouping of data. Too often, it gets associated that a shard is a physical instance. In practice, there is a lot of leverage to begin with a higher number of shards than underlying instances. In the case of Citus, shards are Postgres tables, and then we run nodes which contain at least one but often many more shards. By placing multiple shards in a single node, you're able to easily scale by moving a shard between nodes without having to break the data within the shard apart.
Let's go back to our above example of the eight shards we have. Assuming we have two nodes they might map something like the following:
shardid | min hash value | min hash value nodeid ---------+----------------+----------------+-------- 1 | -2147483648 | -1610612737 | 1 2 | -1610612736 | -1073741825 | 2 3 | -1073741824 | -536870913 | 1 4 | -536870912 | -1 | 2 5 | 0 | 536870911 | 1 6 | 536870912 | 1073741823 | 2 7 | 1073741824 | 1610612735 | 1 8 | 1610612736 | 2147483647 | 2
If we wanted to scale out our setup, we could easily move half of our shards from one node to another without having to change our mapping:
shardid | min hash value | min hash value nodeid ---------+----------------+----------------+-------- 1 | -2147483648 | -1610612737 | 1 2 | -1610612736 | -1073741825 | 2 3 | -1073741824 | -536870913 | 3 4 | -536870912 | -1 | 4 5 | 0 | 536870911 | 1 6 | 536870912 | 1073741823 | 2 7 | 1073741824 | 1610612735 | 3 8 | 1610612736 | 2147483647 | 4
Customer ID vs. Entity ID: The Same Approach Applies
Whether you're sharding by a granular UUID or by something higher in your model hierarchy, like customer ID, the approach of hashing your shard key before you leverage it remains the same. Hashing your partition key and keeping a mapping of how things route is key to a scalable sharding approach. Within Citus, we follow the same standard principles that many others have for years when they hand-rolled their own sharding; you just don't have to put in all that effort now. Under the covers, we automatically evaluate the hash value as it comes in, and within the catalog tables, quickly determine how to route to the appropriate table or shard without you needing to re-write any queries.
Not Sure If Sharding Is Right for You?
Sharding may not make sense in all cases, but for most, if the data model fits into a clean sharding model, then there can be a lot of gains. You can get performance boosts even on smaller datasets, and more importantly, you can rest easier at night not having to worry about when you're going to run into a ceiling as you grow higher.
Published at DZone with permission of Craig Kerstiens , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.