DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]
  • The Bill You Didn't See Coming
  • How Online Databases Replicate Public Records: A Look at Data Aggregation

Trending

  • Pragmatica Aether: Let Java Be Java
  • Why Stable RAG Answers Can Still Hide Unstable Evidence
  • GenAI Implementation Isn't Magic — It’s a Lifecycle
  • Alternative Structured Concurrency
  1. DZone
  2. Data Engineering
  3. Databases
  4. The Difference Between TokuMX Partitioning and Sharding

The Difference Between TokuMX Partitioning and Sharding

By 
Zardosht Kasheff user avatar
Zardosht Kasheff
·
Oct. 11, 22 · Interview
Likes (1)
Comment
Save
Tweet
Share
6.3K Views

Join the DZone community and get the full member experience.

Join For Free

In my last post, I described a new feature in TokuMX 1.5—partitioned collections—that’s aimed at making it easier and faster to work with time series data. Feedback from that post made me realize that some users may not immediately understand the differences between partitioning a collection and sharding a collection. In this post, I hope to clear that up.

On the surface, partitioning a collection and sharding a collection seem similar. Both actions take a collection and break it into smaller pieces for some performance benefit. Also, the terms are sometimes used interchangeably when discussing other technologies. But for TokuMX, the two features are very different in purpose and implementation. In describing each feature’s purpose and implementation, I hope to clarify the differences between the two features.

Let’s address sharding first.

The purpose of sharding is to to distribute a collection across several machines (i.e. “scale-out”) so that writes and queries on the collection will be distributed. The main idea is that for big data, a single machine can only do so much. No matter how powerful your one machine is, that machine will still be limited by some resource, be it IOPS, CPU, or disk space. So, to get better performance for a collection, one can use sharding to distribute the collection across several machines, and thereby improve performance by increasing the amount of hardware.

To perform these tasks, a sharded collection ought to have a relatively even distribution across shards. Therefore, it should have the following properties:

  • User’s writes ought to be distributed amongst machines (or shards). After all, if all writes are targeted at a single shard, then they are not distributed and we are not scaling
  • To keep data distribution relatively even, background process migrate data between shards if a shard is found to have too much or too little data

Because of these properties, each shard contains a random subset of the collection’s data.

Now let’s address partitioning.

The purpose of partitioning is to break the collection into smaller collections so that large chunks of data may be removed very efficiently. A typical example is keeping a rolling period of 6 months of log data for a website. Another example is keeping the last 14 days of oplog data, as we do via partitioning in TokuMX 1.4. In such examples, typically only one partition (the latest one) is getting new data. Periodically, but infrequently, we drop the oldest partition to reclaim space. For the log data example, once a month we may drop a month’s worth of data. For the oplog, once a day we drop a day’s worth of data.

To perform these tasks, we are not concerned with load distribution, as nearly all writes are typically going to the last partition. We are not spreading partitions across machines. With partitioning, each partition holds a continuous range of the data (e.g. all data from the month of February), whereas with sharding, each shard holds small random chunks of data from across the key space.

With all this being said, there are still similarities when thinking of schema design with a partitioned collection and a sharded collection. As I touched on in my last post, designing a partition key has similarities to designing a shard key as far as queries are concerned. Queries on a sharded collection perform better if they target single shards. Similarly, queries on a partitioned collection perform better if they target a single partition. Queries that don’t can be thought of as “scatter/gather” for both sharded and partitioned collections.

Hopefully this illuminates the difference between a partitioned collection and a sharded collection.

TokuMX Data (computing) Database Shard (database architecture)

Published at DZone with permission of Zardosht Kasheff. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A Comprehensive Guide to Database Sharding: Building Scalable Systems
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]
  • The Bill You Didn't See Coming
  • How Online Databases Replicate Public Records: A Look at Data Aggregation

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook