Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cassandra vs. MongoDB: Which Is Better for Big Data?

DZone's Guide to

Cassandra vs. MongoDB: Which Is Better for Big Data?

In this post we will compare Apache Cassandra vs. MongoDB. Both systems are being used for storing big data but they do it very differently.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

If you are on the fence as to which database sink you want for your big data pipeline, then hopefully this post will give you a good idea of what Cassandra and MongoDB can do for you.

Cassandra vs. MongoDB Introduction

We will compare Apache Cassandra vs. MongoDB to see which one fills your need.

Both solutions store data for you but they do it very different ways. Cassandra stores data using something very similar to database tables and MongoDB stores data using "documents."

We will start by showing the similarities between both.

Apache Cassandra

Obviously, both systems store data for you so that's our first similarity.

Both systems store data in a distributed manner.

Cassandra distributes data using the PRIMARY KEY. Each primary key value uses a single partition. This means that it can only store one row of data per partition.

Not very useful if you have very large amounts of data consisting of thousands or millions of rows. To accommodate this, Cassandra tables can have a CLUSTERING KEY that is a unique value and gives Cassandra the ability to store multiple rows per partition.

Take, for example, this diagram of a Cassandra table representing purchases at a retail store.

This shows that each store location has an ID and this becomes the PRIMARY KEY. But remember that this only gives us one row because of the way that Cassandra stores data in a distributed manner. Each PRIMARY KEY value is assigned a partition.

Cassandra takes that primary key value and puts a hash on it and assigns it a node to store the data. This is how Cassandra can lookup values so quickly.

To get more than row you need to have a CLUSTERING KEY that contains unique values. Rows with unique clustering key values will be stored on the same node as the primary key.

This makes data modeling a little more involved than you are probably used to. You will find (as I did the hard way) that you can't just create tables like you would in a normal SQL system and be able to query the data the same way.

Cassandra is the option I decided to go with to be my data sink for my logging pipeline consisting of Kafka, Cassandra and a Python Application I wrote.

Next, we will see how MongoDB compares.

MongoDB

This is a NoSQL database system. I have heard it called No SQL, non-SQL and non-relational SQL, but essentially what it means is that the data is stored using key/value pairs.

MongoDB stores data in JSON-like objects that are called documents.

Here is a sample document for our retail example above.

{
  "item": "toothpaste",
  "cost": 4.99
},
{
  "item": "soda",
  "cost": 0.99
}

Each row is represented by a document.

A collection of rows is called a collection.

The really cool part about NoSQL databases is that there is no set schema. You can have documents that don't match the same structure in the same collection.

Interacting with a MongoDB database from your favorite language is a breeze because most languages support JSON. Each document is read in from MongoDB and stored as a JSON value in your program.

This makes it super easy to get started with MongoDB.

Just like Cassandra, MongoDB is also a distributed storage system.

MongoDB distributes the documents among the different nodes in the cluster using a SHARD KEY which very similar to the PRIMARY KEY of Cassandra outlined above. It uses the SHARD KEY to know what node to store the data on.

The performance of a MongoDB cluster greatly depends on the shard key you select. You can read more about this on the MongoDB Sharding Documentation.

Conclusion

As you can see, both systems can store your big data in a distributed manner but they do it very differently.

Hopefully, this post helped to clear up which one is better for your situation.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,cassandra ,mongodb

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}