Improve Your Data Ingestion With Spark

DZone 's Guide to

Improve Your Data Ingestion With Spark

Apache Spark is a highly performant big data solution. Learn how to take advantage of its speed when ingesting data.

· Big Data Zone ·
Free Resource

Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.

We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!

26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.

Using Hadoop/Spark for Data Ingestion

Wa decided to use a Hadoop cluster for raw data (parquet instead of CSV) storage and duplication. 

Why Parquet?

Parquet is a columnar file format and provides efficient storage. Better compression for columnar and encoding algorithms are in place. Mostly we are using the large files in Athena. BigQuery also supports the Parquet file format. So we can have better control over performance and cost.


Mapping Data With Apache Spark 

Apache Spark is one of the most powerful solutions for distributed data processing, especially when it comes to real-time data analytics.

Reading Parquet files with Spark is very simple and fast:

val df = spark.read.parquet("examples/src/main/resources/data.parquet")

Storing Data in MongoDB

MongoDB provides a connector for Apache Spark that exposes all of Spark's libraries.

Here's how to spin up a connector configuration via SparkSession:

val spark = SparkSession.builder()
      .config("spark.mongodb.input.uri", "mongodb://")
      .config("spark.mongodb.output.uri", "mongodb://")

Writing a Dataframe to MongoDB

Writing a dataframe to MongoDB is very simple and it uses the same syntax as writing any CSV or parquet file.

"people").option("collection", "contacts").save()

Comparing Performances - Who wins ?

Image title

No doubt about it, Spark would win, but not like this. The difference in terms of performance is huge!

And what is more interesting is that the Spark solution is scalable, which means that by adding more machines to our cluster and having an optimal cluster configuration we can get some impressive results.

apache spark tutorial, big data, data ingestion, mongodb

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}