This post was originally written by Raj Bains
With fast-growing startups, businesses, and applications, the application servers are easy to scale, but the operational databases often hit scale issues with high volume and velocity of data. This issue is especially true in the cloud, where the scaling model is horizontal scale-out on commodity hardware. Legacy databases such as MySQL, Microsoft SQL Server, and Oracle scale only by buying bigger servers and moving your database over. When sudden success comes in Silicon Valley and your data needs soar, you don’t want to close the doors on your customers—instead, you typically switch to bigger servers to buy another few weeks until there is no bigger server in the cloud to go to.
An operational database provides two main ways to distribute your data across multiple nodes (or servers/computers). The first is sharding—the legacy approach used extensively in MySQL sharding done by companies such as Facebook and even in some newer databases such as MongoDB. The second is horizontal slicing—as used by other newer databases such as Cassandra and ClustrixDB. The pain and advantages of the chosen approach are usually realized months after the decision is made. Let’s look at some implications of this choice.
Overview of the Approaches
When first setting up a high-availability configuration, both approaches seem very tractable. The high-availability configuration has no single point of failure.
If a node fails, the database stays available and the failed node can be replaced. Note that MySQL server replacement requires dumping the entire database, loading it on the new node, and then setting up replication. On ClustrixDB, you run a single command and you’re done.
Significant divergence in approaches is already happening with real-time analytics. MySQL or Microsoft SQL Server uses a single core, whereas ClustrixDB uses Massively Parallel Processing (MPP) to accelerate your analytic queries.