Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Analytics: Is Spark DataFrame Always a Good Choice?

DZone's Guide to

Big Data Analytics: Is Spark DataFrame Always a Good Choice?

It's not wise to absentmindedly reach out to third-party stuff, even if it's extremely well-known and seemingly the right tool to use. Don't take open-source for granted.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In the company I'm working for, there are some services implemented in R. But for some cases — especially for calculating 4,000+ variables based on personal credit data from China Central Bank — it really performs poorly. That's why I came into the situation of rewriting R codes in Java to speed up.

In the very beginning, Spark DataFrame just naturally popped up in my mind, as Spark DataFrame stems from R's DataFrame. But I doubted that Spark DataFrame would be fast enough since its foundation is based on RDD, which is specifically designed for distributed environments — whereas what I wanted was just a standalone DataFrame with key functions. This led me to compare Spark DataFrame and to independently develop DataFrame in Java in terms of column manipulations and aggregation. Below are the results:

  1.  Spark DataFrame:

    • Average time spent on adding new columns: 1.0246680102E7 ns (7ms)

    • Average time spent on aggregation: 8559803.424 ns (8ms)

  2. Independently developed DataFrame:

    • Average time spent on adding new columns: 1084220.562 ns (1ms)

    • Average time spent on aggregation: 47053.53 ns (0.04ms)

As you can see,  independently developed DataFrame performs far better than Spark DataFrame. This tells us that it isn't wise to absentmindedly reach out to third-party stuff, even if it's extremely well known and seemingly the only right tool to use. Don't take it for granted — open-source is always good.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,dataframe ,big data analytics ,spark dataframe ,database performance

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}