Over a million developers have joined DZone.

Using MongoDB with Hadoop & Spark: Part 1 - Introduction & Setup

DZone's Guide to

Using MongoDB with Hadoop & Spark: Part 1 - Introduction & Setup

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Originally Written by Matt Kalan

Hadoop is a software technology designed for storing and processing large volumes of data distributed across a cluster of commodity servers and commodity storage. Hadoop was initially inspired by papers published by Google outlining its approach to handling large volumes of data as it indexed the Web. Many organizations are now harnessing the power of Hadoop and MongoDB together to create complete big data applications: MongoDB powers the online, real time operational application, while Hadoop consumes data from MongoDB and blends its with data from other operational systems to fuel sophisticated analytics and machine learning.

This is part one of a three-part series on MongoDB and Hadoop:

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

Introduction & Setup of Hadoop and MongoDB

There are many, many data management technologies available today, and that makes it hard to discern hype from reality. Working at MongoDB Inc., I know the immense value of MongoDB as a great real-time operational database for applications; however for analytics and batch operations, I wanted to understand more clearly the options available and when to use some of the other great options like Spark.

I started with a simple example of taking 1 minute time series intervals of stock prices with the opening (first) price, high (max), low (min), and closing (last) price of each time interval and turning them into 5 minute intervals (called OHLC bars). The 1-minute data is stored in MongoDB and is then processed in Hive or Spark via the MongoDB Hadoop Connector, which allows MongoDB to be an input or output to/from Hadoop and Spark.

One might imagine a more typical example is that you record this market data in MongoDB for real-time purposes but then potentially run offline analytical models in another environment. Of course the models would be way more complicated – this is just as a Hello World level example. I chose OHLC bars just because that was the data I found easily.


Use case: aggregating 1 minute intervals of stock prices into 5 minute intervals
Input:: 1 minute stock prices intervals in a MongoDB database
Simple Analysis: performed in:
 - Hive
 - Spark
Output: 5 minute stock prices intervals in Hadoop

Steps to Set Up the Environment

  • Set up Hadoop environment – Hadoop is fairly involved to set up but fortunately Cloudera makes VMs available with their distribution already installed, including both Hive and Spark. I downloaded Virtualbox (open source VM manager) onto my Mac laptop to run the Cloudera VMs from here.
  • Go through tutorials - I went through the tutorials included in the VM which were pretty helpful; sometimes I felt like I was just cutting and pasting and not knowing what was happening though, especially with Spark. One thing that is obvious from the tutorials is that the learning curve for using “Hadoop” includes learning many products in the ecosystem (Sqoop, Avro, Hive, Flume, Spark, etc.). If I were only doing this simple thing, there is no way I would use Hadoop for it with that learning curve but of course some problems justify the effort.
  • Download sample data – I Googled for some sample pricing data and found these 1 minute bars
  • Install MongoDB on the VM – it is really easy with yum on CentOS from this page
  • Start MongoDB – a default configuration file is installed by yum so you can just run this to start on localhost and the default port 27017
     mongod -f /etc/mongod.conf
  • Load sample data – mongoimport allows you to load CSV files directly as a flat document in MongoDB. The command is simply this:
     mongoimport equities-msft-minute-bars-2009.csv -type csv -headerline -d marketdata -c minibars
  • Install MongoDB Hadoop Connector – I ran through the steps at the link below to build for Cloudera 5 (CDH5). One note is that by default my “git clone” put the mongo-hadoop files in /etc/profile.d but your repository might be set up differently. Also one addition to the install steps is to set the path to mongoimport in build.gradle based on where you installed MongoDB. I used yum and the path to mongo tools was /usr/bin/. Install steps are here

For the following examples, here is what a document looks like in the MongoDB collection (via the Mongo shell). You start the Mongo shell simply with the command “mongo” from the /bin directory of the MongoDB installation.

 > use marketdata 
> db.minbars.findOne()
    "_id" : ObjectId("54c00d1816526bc59d84b97c"),
    "Symbol" : "MSFT",
    "Timestamp" : "2009-08-24 09:30",
    "Day" : 24,
    "Open" : 24.41,
    "High" : 24.42,
    "Low" : 24.31,
    "Close" : 24.31,
    "Volume" : 683713

Posts #2 and #3 in this blog series show examples of Hive and Spark using this setup above.

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

To learn more, watch our video on MongoDB and Hadoop. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights.


Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}