DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Improve Your Data Ingestion With Spark

Improve Your Data Ingestion With Spark

Apache Spark is a highly performant big data solution. Learn how to take advantage of its speed when ingesting data.

Abdelghani Tassi user avatar by
Abdelghani Tassi
·
Dec. 04, 18 · Tutorial
Like (6)
Save
Tweet
Share
19.56K Views

Join the DZone community and get the full member experience.

Join For Free

Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.

We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!

26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.

Using Hadoop/Spark for Data Ingestion

Wa decided to use a Hadoop cluster for raw data (parquet instead of CSV) storage and duplication. 

Why Parquet?

Parquet is a columnar file format and provides efficient storage. Better compression for columnar and encoding algorithms are in place. Mostly we are using the large files in Athena. BigQuery also supports the Parquet file format. So we can have better control over performance and cost.

 

Mapping Data With Apache Spark 

Apache Spark is one of the most powerful solutions for distributed data processing, especially when it comes to real-time data analytics.

Reading Parquet files with Spark is very simple and fast:

val df = spark.read.parquet("examples/src/main/resources/data.parquet")

Storing Data in MongoDB

MongoDB provides a connector for Apache Spark that exposes all of Spark's libraries.

Here's how to spin up a connector configuration via SparkSession:

val spark = SparkSession.builder()
      .master("local")
      .appName("MongoSparkConnectorIntro")
      .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection")
      .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection")
      .getOrCreate()

Writing a Dataframe to MongoDB

Writing a dataframe to MongoDB is very simple and it uses the same syntax as writing any CSV or parquet file.

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database",
"people").option("collection", "contacts").save()

Comparing Performances - Who wins ?

Image title

No doubt about it, Spark would win, but not like this. The difference in terms of performance is huge!

And what is more interesting is that the Spark solution is scalable, which means that by adding more machines to our cluster and having an optimal cluster configuration we can get some impressive results.

Data (computing) Apache Spark

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Should You Know About Graph Database’s Scalability?
  • Unlocking the Power of Polymorphism in JavaScript: A Deep Dive
  • Spring Cloud: How To Deal With Microservice Configuration (Part 1)
  • Why Every Fintech Company Needs DevOps

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: