DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Building a Data Warehouse for Traditional Industry
  • What Is Data Engineering? Skills and Tools Required
  • go-mysql-mongodb: Replicate Data From MySQL To MongoDB
  • 8 Best Big Data Tools in 2020

Trending

  • Analyzing Stock Tick Data in SingleStoreDB Using LangChain and OpenAI's Whisper
  • Leveraging FastAPI for Building Secure and High-Performance Banking APIs
  • Vector Database: A Beginner's Guide
  • 16 K8s Worst Practices That Are Causing You Pain (Or Will Soon)
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Using MongoDB with Hadoop & Spark: Part 1 - Introduction & Setup

Using MongoDB with Hadoop & Spark: Part 1 - Introduction & Setup

Francesca Krihely user avatar by
Francesca Krihely
·
Feb. 25, 15 · Interview
Like (0)
Save
Tweet
Share
15.14K Views

Join the DZone community and get the full member experience.

Join For Free
Originally Written by Matt Kalan


Hadoop is a software technology designed for storing and processing large volumes of data distributed across a cluster of commodity servers and commodity storage. Hadoop was initially inspired by papers published by Google outlining its approach to handling large volumes of data as it indexed the Web. Many organizations are now harnessing the power of Hadoop and MongoDB together to create complete big data applications: MongoDB powers the online, real time operational application, while Hadoop consumes data from MongoDB and blends its with data from other operational systems to fuel sophisticated analytics and machine learning.

This is part one of a three-part series on MongoDB and Hadoop:

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

Introduction & Setup of Hadoop and MongoDB

There are many, many data management technologies available today, and that makes it hard to discern hype from reality. Working at MongoDB Inc., I know the immense value of MongoDB as a great real-time operational database for applications; however for analytics and batch operations, I wanted to understand more clearly the options available and when to use some of the other great options like Spark.

I started with a simple example of taking 1 minute time series intervals of stock prices with the opening (first) price, high (max), low (min), and closing (last) price of each time interval and turning them into 5 minute intervals (called OHLC bars). The 1-minute data is stored in MongoDB and is then processed in Hive or Spark via the MongoDB Hadoop Connector, which allows MongoDB to be an input or output to/from Hadoop and Spark.

One might imagine a more typical example is that you record this market data in MongoDB for real-time purposes but then potentially run offline analytical models in another environment. Of course the models would be way more complicated – this is just as a Hello World level example. I chose OHLC bars just because that was the data I found easily.

Summary

Use case: aggregating 1 minute intervals of stock prices into 5 minute intervals
Input:: 1 minute stock prices intervals in a MongoDB database
Simple Analysis: performed in:
 - Hive
 - Spark
Output: 5 minute stock prices intervals in Hadoop

Steps to Set Up the Environment

  • Set up Hadoop environment – Hadoop is fairly involved to set up but fortunately Cloudera makes VMs available with their distribution already installed, including both Hive and Spark. I downloaded Virtualbox (open source VM manager) onto my Mac laptop to run the Cloudera VMs from here.
  • Go through tutorials - I went through the tutorials included in the VM which were pretty helpful; sometimes I felt like I was just cutting and pasting and not knowing what was happening though, especially with Spark. One thing that is obvious from the tutorials is that the learning curve for using “Hadoop” includes learning many products in the ecosystem (Sqoop, Avro, Hive, Flume, Spark, etc.). If I were only doing this simple thing, there is no way I would use Hadoop for it with that learning curve but of course some problems justify the effort.
  • Download sample data – I Googled for some sample pricing data and found these 1 minute bars
  • Install MongoDB on the VM – it is really easy with yum on CentOS from this page
  • Start MongoDB – a default configuration file is installed by yum so you can just run this to start on localhost and the default port 27017
     mongod -f /etc/mongod.conf
    
  • Load sample data – mongoimport allows you to load CSV files directly as a flat document in MongoDB. The command is simply this:
     mongoimport equities-msft-minute-bars-2009.csv -type csv -headerline -d marketdata -c minibars
    
  • Install MongoDB Hadoop Connector – I ran through the steps at the link below to build for Cloudera 5 (CDH5). One note is that by default my “git clone” put the mongo-hadoop files in /etc/profile.d but your repository might be set up differently. Also one addition to the install steps is to set the path to mongoimport in build.gradle based on where you installed MongoDB. I used yum and the path to mongo tools was /usr/bin/. Install steps are here

For the following examples, here is what a document looks like in the MongoDB collection (via the Mongo shell). You start the Mongo shell simply with the command “mongo” from the /bin directory of the MongoDB installation.

 > use marketdata 
> db.minbars.findOne()
{
    "_id" : ObjectId("54c00d1816526bc59d84b97c"),
    "Symbol" : "MSFT",
    "Timestamp" : "2009-08-24 09:30",
    "Day" : 24,
    "Open" : 24.41,
    "High" : 24.42,
    "Low" : 24.31,
    "Close" : 24.31,
    "Volume" : 683713
}

Posts #2 and #3 in this blog series show examples of Hive and Spark using this setup above.

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

To learn more, watch our video on MongoDB and Hadoop. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights.

WATCH MONGODB & HADOOP

MongoDB hadoop Big data

Published at DZone with permission of Francesca Krihely, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Building a Data Warehouse for Traditional Industry
  • What Is Data Engineering? Skills and Tools Required
  • go-mysql-mongodb: Replicate Data From MySQL To MongoDB
  • 8 Best Big Data Tools in 2020

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: