DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  • Memory Optimization and Utilization in Java 25 LTS: Practical Best Practices
  • Stateful AI: Streaming Long-Term Agent Memory With Amazon Kinesis

Trending

  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • The Hidden Latency of Autoscaling
  • Why Good Models Fail After Deployment
  • Bringing Intelligence Closer to the Source: Why Real-Time Processing is the Heart of Edge AI

Some Lessons of Spark and Memory Issues on EMR

As Spark heavily utilizes cluster RAM to maximize speed, it's important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

By 
Moshe Kaplan user avatar
Moshe Kaplan
·
Mar. 06, 18 · Opinion
Likes (3)
Comment
Save
Tweet
Share
24.0K Views

Join the DZone community and get the full member experience.

Join For Free

In the last few days, we went through several performance issues with Spark as data grew dramatically. The easiest go-around might be increasing the instance sizes. However, as scaling up is not a scalable strategy, we were looking for alternate ways to back to track, as one of our Spark/Scala-based pipelines started to crash.

Some Details About Our Process

We run a Scala (2.1)-based job on a Spark 2.2.0/EMR 5.9.0 cluster with 64 r3.xlarge nodes. The job analyzes several data sources, each of a few hundred GB (and growing), using the DataFrame API and output data to S3 using ORC format.

Analyzing the logs of the crashed cluster resulted in the following error:

WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal):
ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason:
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

How Did We Recover?

Setting the spark.yarn.executor.memoryOverhead to 2,500 (the maximum on the instance type we used r3.xlarge) did not make a major change.

spark-submit --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead=2500 ...

We raised the bar by disabling the virtual and physical memory checks and increasing the virtual to physical memory ratio to 4.

[
  {
    "classification": "spark",
    "properties": {
      "maximizeResourceAllocation": "true"
    }
  },
  {
    "classification": "yarn-site",
    "properties": {
      " yarn.nodemanager.vmem-pmem-ratio": "4",
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  }
]

However, this made the magic till hitting the next limit (probably spark tasks were killed when they trying to abuse the physical memory) with the following error:

ExecutorLostFailure (executor exited caused by one of the running tasks) Reason: Container 
marked as failed: container_ on host:. Exit status: -100. Diagnostics: Container released on a 
*lost* node

This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.

Right now, we run at full steam ahead. When we hit the next limit, it may worth an update.

As Spark heavily utilizes cluster RAM as an effective way to maximize speed, it is highly important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

Memory (storage engine)

Published at DZone with permission of Moshe Kaplan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Persistent Memory for AI Agents Using LangChain's Deep Agents
  • The Serverless Illusion: When “Pay for What You Use” Becomes Expensive
  • Memory Optimization and Utilization in Java 25 LTS: Practical Best Practices
  • Stateful AI: Streaming Long-Term Agent Memory With Amazon Kinesis

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook