DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Memory Leak Due To Mutable Keys in Java Collections
  • How to Troubleshoot Common Linux VPS Issues: CPU, Memory, Disk Usage
  • Advanced Java Garbage Collection Concepts: Weak References, Finalization, and Memory Leaks
  • Memory Leak Due to Uncleared ThreadLocal Variables

Trending

  • Server-Driven UI: Agile Interfaces Without App Releases
  • Deploy Serverless Lambdas Confidently Using Canary
  • Cell-Based Architecture: Comprehensive Guide
  • Maximizing Return on Investment When Securing Our Supply Chains: Where to Focus Our Limited Time to Maximize Reward

Some Lessons of Spark and Memory Issues on EMR

As Spark heavily utilizes cluster RAM to maximize speed, it's important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

By 
Moshe Kaplan user avatar
Moshe Kaplan
·
Mar. 06, 18 · Opinion
Likes (3)
Comment
Save
Tweet
Share
23.8K Views

Join the DZone community and get the full member experience.

Join For Free

In the last few days, we went through several performance issues with Spark as data grew dramatically. The easiest go-around might be increasing the instance sizes. However, as scaling up is not a scalable strategy, we were looking for alternate ways to back to track, as one of our Spark/Scala-based pipelines started to crash.

Some Details About Our Process

We run a Scala (2.1)-based job on a Spark 2.2.0/EMR 5.9.0 cluster with 64 r3.xlarge nodes. The job analyzes several data sources, each of a few hundred GB (and growing), using the DataFrame API and output data to S3 using ORC format.

Analyzing the logs of the crashed cluster resulted in the following error:

WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal):
ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason:
Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory 
used. Consider boosting spark.yarn.executor.memoryOverhead.

How Did We Recover?

Setting the spark.yarn.executor.memoryOverhead to 2,500 (the maximum on the instance type we used r3.xlarge) did not make a major change.

spark-submit --deploy-mode cluster --conf spark.yarn.executor.memoryOverhead=2500 ...

We raised the bar by disabling the virtual and physical memory checks and increasing the virtual to physical memory ratio to 4.

[
  {
    "classification": "spark",
    "properties": {
      "maximizeResourceAllocation": "true"
    }
  },
  {
    "classification": "yarn-site",
    "properties": {
      " yarn.nodemanager.vmem-pmem-ratio": "4",
      "yarn.nodemanager.pmem-check-enabled": "false",
      "yarn.nodemanager.vmem-check-enabled": "false"
    }
  }
]

However, this made the magic till hitting the next limit (probably spark tasks were killed when they trying to abuse the physical memory) with the following error:

ExecutorLostFailure (executor exited caused by one of the running tasks) Reason: Container 
marked as failed: container_ on host:. Exit status: -100. Diagnostics: Container released on a 
*lost* node

This one was solved by increasing the number of DataFrame partitions (in this case, from 1,024 to 2,048). That reduced the needed memory per partition.

Right now, we run at full steam ahead. When we hit the next limit, it may worth an update.

As Spark heavily utilizes cluster RAM as an effective way to maximize speed, it is highly important to monitor it and verify that your cluster settings and partitioning strategy meet your growing data needs.

Memory (storage engine)

Published at DZone with permission of Moshe Kaplan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Memory Leak Due To Mutable Keys in Java Collections
  • How to Troubleshoot Common Linux VPS Issues: CPU, Memory, Disk Usage
  • Advanced Java Garbage Collection Concepts: Weak References, Finalization, and Memory Leaks
  • Memory Leak Due to Uncleared ThreadLocal Variables

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: