Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Some Lessons of Spark and Memory Issues on EMR

DZone's Guide to

Some Lessons of Spark and Memory Issues on EMR

As Spark heavily utilizes cluster RAM to maximize speed, it's important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In the last few days, we went through several performance issues with Spark as data grew dramatically. The easiest go-around might be increasing the instance sizes. However, as scaling up is not a scalable strategy, we were looking for alternate ways to back to track, as one of our Spark/Scala-based pipelines started to crash.

Some Details About Our Process

We run a Scala (2.1)-based job on a Spark 2.2.0/EMR 5.9.0 cluster with 64 r3.xlarge nodes. The job analyzes several data sources, each of a few hundred GB (and growing), using the DataFrame API and output data to S3 using ORC format.

Analyzing the logs of the crashed cluster resulted in the following error:

WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal):
ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason:
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

How Did We Recover?

Setting the spark.yarn.executor.memoryOverhead to 2,500 (the maximum on the instance type we used r3.xlarge) did not make a major change.

spark-submit --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead=2500 ...

We raised the bar by disabling the virtual and physical memory checks and increasing the virtual to physical memory ratio to 4.

[
  {
    "classification": "spark",
    "properties": {
      "maximizeResourceAllocation": "true"
    }
  },
  {
    "classification": "yarn-site",
    "properties": {
      " yarn.nodemanager.vmem-pmem-ratio": "4",
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  }
]

However, this made the magic till hitting the next limit (probably spark tasks were killed when they trying to abuse the physical memory) with the following error:

ExecutorLostFailure (executor exited caused by one of the running tasks) Reason: Container 
marked as failed: container_ on host:. Exit status: -100. Diagnostics: Container released on a 
*lost* node

This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.

Right now, we run at full steam ahead. When we hit the next limit, it may worth an update.

As Spark heavily utilizes cluster RAM as an effective way to maximize speed, it is highly important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,apache spark ,emr ,memory

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}