To Shard or Not to Shard
To Shard or Not to Shard
How does sharding work, is it used, and if so, when?
Join the DZone community and get the full member experience.Join For Free
The IT industry is one big word generator. Even better, it is a meaning generator. There are words or acronyms to describe every single functionality, but at the same time, one word can describe a bunch of different functionalities, or one functionality can be described by two or more different words. Recently, I found an old article about the differences between sharding and partitioning in general. This article tries to describe sharding as partitioning. It is not my aim to judge this, but I would like to write a couple of things about the use of sharding.
How Does It Work?
Sharding is simply a division of a big set of data into many small packs. Databases are the specimen — you divide your old big database in many small databases, each located on a separate machine. You rely on the application level layer to decide to which "shard" to send a query (here is a small difference from partitioning, where the decision is made at the database level).
When Do You Use It?
Always! That was a joke — do not do that. Sharding can be useful in many cases, but sometimes it's better not to use it. Let's focus on two scenarios today.
In the first scenario, let's look into an example of a social website with an image upload option. Users can upload a big set of photos at once. The classic master-slave architecture of the database will fail sooner or later because it is designed to handle a huge number of select queries. All insert statements go via master server, so it will become a performance bottleneck. We can use sharding architecture in this example to pass each insert to a separate database. Thanks to this, users will not suffer from "ages taking" uploads. A given master controller application has to decide to which database the insert should go and that's it.
The next scenario will describe the searching problem. Let's take a company with an enormous set of historical data. An analyst from this company wants to analyze it. Data is stored in one database. The analyst created quite an advanced algorithm to find interesting data, and he even translated this algorithm into some complicated PL/SQL procedures, started this procedure, and now he can wait until next year. Such situations happen quite often. Even with the best partitioning, a rocket scientist, and an advanced, SINGLE database, you just cannot do this. You can invest millions of dollars and buy a brand new shiny high-end database cluster, but why do it if you can use your department's PC as grid infrastructure? Put a small set of data in each PC and perform the search again and again. This scenario shows the power of grid-based sharding infrastructure for searching. At the application level, the query is decided to be sent to all nodes or just to some subgroups. Each node searches in its small database and returns results. In a jiffy, you have your results and you saved your money. Power plant engineers are the only ones who noticed that you have been calculating something.
Is It Used?
Yes. Regarding the first example, Flickr is exactly what I am writing about. This is the way they do it. Regarding the second scenario, Google can be used as an example. They are not sharding on databases, but the way they divide data is very similar to sharding from an architectural point of view. Map-reduce operations on cloud/grid/you-name-it architecture are one of the best patterns of how to solve the "big-search" scenario.
The industry faces different problems, but sooner or later, each company will need to work on big datasets or will need more compute power. It is a smart move to introduce scalable infrastructure elements in your company, and database sharding on grid infrastructure can be the first step in the right direction.
Opinions expressed by DZone contributors are their own.