Over a million developers have joined DZone.

Distributed MapReduce with Sharded MongoDB and SpringData

DZone's Guide to

Distributed MapReduce with Sharded MongoDB and SpringData

· Database Zone
Free Resource

Download the Guide to Open Source Database Selection: MySQL vs. MariaDB and see how the side-by-side comparison of must-have features will ease the journey. Brought to you in partnership with MariaDB.

In MongoDB you have several options when it comes to aggregating data. There is the included JavaScript MapReduce Framework, the new Aggregation Framework in v 2.2+ and also the Hadoop Connector for really heavy lifting.

When using the integrated MapReduce Framework you have to be aware of the caveat that MongoDB’s JS Interpreter is single-threaded. This means that regardless of how many cores your server has only one gets utilized. That’s a bit of a bummer because vertical scaling might not decrease execution times significantly. So to really bring down execution times and make use of MapReduce’s parallel computing abilities you have to scale out and shard. This brings mainly two advantages. For one you have now more “threads” and also the dataset that each of these has to deal with is getting smaller in relation of how many shards you have.

The latest of stable version of MongoDB today is 2.0.6. which was initially used for our queries. Using this version queries that we successfully issued against a single MongoDB instance failed on the sharded setup.
It seems that we were hitting an issue similar or equal to https://jira.mongodb.org/browse/SERVER-5536.

As the issue states that it’s fixed in 2.1.2 we switched to the latest nightly build (2.1.2-pre) which worked fine.

Unfortunately after switching to 2.1.2 we were confrontend with https://jira.springsource.org/browse/DATAMONGO-378

The pragmatic albeit not beautiful was a hack that reimplements the following Interfaces: MongoOperations and ApplicationContextAware. It basically works around the type cast to Integer:

((Number) counts.get(“input”)).intValue(),
((Number) counts.get(“emit”)).intValue(),
((Number) counts.get(“output”)).intValue()

All of this resulted in the following performance improvements:

Performance Chart

These numbers show how much parallelism can actually result in a massive performance increase. MongoDB brings excellent out-of-the-box capabilites to simplify sharding and replication.

Interested in reducing database costs by moving from Oracle Enterprise to open source subscription?  Read the total cost of ownership (TCO) analysis. Brought to you in partnership with MariaDB.


Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}