Improve Your Data Ingestion With Spark
Apache Spark is a highly performant big data solution. Learn how to take advantage of its speed when ingesting data.
Join the DZone community and get the full member experience.Join For Free
Recently, my company faced the serious challenge of loading a 10 million rows of CSV-formatted geographic data to MongoDB in real-time.
We first tried to make a simple Python script to load CSV files in memory and send data to MongoDB. Processing 10 million rows this way took 26 minutes!
26 minutes for processing a dataset in real-time is unacceptable so we decided to proceed differently.
Using Hadoop/Spark for Data Ingestion
Wa decided to use a Hadoop cluster for raw data (parquet instead of CSV) storage and duplication.
Parquet is a columnar file format and provides efficient storage. Better compression for columnar and encoding algorithms are in place. Mostly we are using the large files in Athena. BigQuery also supports the Parquet file format. So we can have better control over performance and cost.
Mapping Data With Apache Spark
Apache Spark is one of the most powerful solutions for distributed data processing, especially when it comes to real-time data analytics.
Reading Parquet files with Spark is very simple and fast:
val df = spark.read.parquet("examples/src/main/resources/data.parquet")
Storing Data in MongoDB
MongoDB provides a connector for Apache Spark that exposes all of Spark's libraries.
Here's how to spin up a connector configuration via SparkSession:
val spark = SparkSession.builder() .master("local") .appName("MongoSparkConnectorIntro") .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.myCollection") .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.myCollection") .getOrCreate()
Writing a Dataframe to MongoDB
Writing a dataframe to MongoDB is very simple and it uses the same syntax as writing any CSV or parquet file.
people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database", "people").option("collection", "contacts").save()
Comparing Performances - Who wins ?
No doubt about it, Spark would win, but not like this. The difference in terms of performance is huge!
And what is more interesting is that the Spark solution is scalable, which means that by adding more machines to our cluster and having an optimal cluster configuration we can get some impressive results.
Opinions expressed by DZone contributors are their own.