Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Analytics: Is Spark DataFrame Always a Good Choice?

DZone's Guide to

Big Data Analytics: Is Spark DataFrame Always a Good Choice?

It's not wise to absentmindedly reach out to third-party stuff, even if it's extremely well-known and seemingly the right tool to use. Don't take open-source for granted.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In the company I'm working for, there are some services implemented in R. But for some cases — especially for calculating 4,000+ variables based on personal credit data from China Central Bank — it really performs poorly. That's why I came into the situation of rewriting R codes in Java to speed up.

In the very beginning, Spark DataFrame just naturally popped up in my mind, as Spark DataFrame stems from R's DataFrame. But I doubted that Spark DataFrame would be fast enough since its foundation is based on RDD, which is specifically designed for distributed environments — whereas what I wanted was just a standalone DataFrame with key functions. This led me to compare Spark DataFrame and to independently develop DataFrame in Java in terms of column manipulations and aggregation. Below are the results:

  1.  Spark DataFrame:

    • Average time spent on adding new columns: 1.0246680102E7 ns (7ms)

    • Average time spent on aggregation: 8559803.424 ns (8ms)

  2. Independently developed DataFrame:

    • Average time spent on adding new columns: 1084220.562 ns (1ms)

    • Average time spent on aggregation: 47053.53 ns (0.04ms)

As you can see,  independently developed DataFrame performs far better than Spark DataFrame. This tells us that it isn't wise to absentmindedly reach out to third-party stuff, even if it's extremely well known and seemingly the only right tool to use. Don't take it for granted — open-source is always good.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,dataframe ,big data analytics ,spark dataframe ,database performance

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}