Over a million developers have joined DZone.

When Spark Jobs On Cassandra Start Hanging

How to use Spark to survive node crashes in Apache Cassandra.

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

So this was hyper obvious once I saw the executor logs and the database schema, but this had me befuddled at first and the change in behavior with one node should have made it obvious.

The code was simple, read a bunch of information, do some minor transformations and flush to Cassandra. This was nothing crazy. But during the users fault tolerance testing, the job would just seamingly hang indefinitely when a node was down.

If One Node Takes Down Your App, Do You Have Any Replicas?

That was it, in fact that’s always it, if something myseriously “just stops” usually you have a small cluster and no replicas (RF 1). Now one may ask why anyone would ever have 1 replica with Cassandra and while I concede it is a very fair question, this was the case here.

Example if I have RF1 and three nodes, when I wrote a row it’s only going to go to 1 of those nodes. If it dies, then how would I retrieve the data? Wouldn’t it just fail? Ok yeah wait a minute why is the job not failing?

It’s the Defaults!

This is a bit of misdirection, the other nodes were timing out (probably slow IO layer). If we read the connector docs we get query.retry.count which gives us 10 retries, and the default read.timeout_ms is 120000 (which confusingly is also the write timeout), so 1.2 million milliseconds or 20 minutes to fail a task that is timing out. If you retry that task 4 times (which is the Spark default) it could take you 80 minutes to fail the job, this is of course assuming all the writes timeout.

The Fix

Short term

  • Don’t let any nodes fall over
  • drop retry down to 10 seconds, this will at least let you fail fast
  • drop output.batch.size.bytes down. Default is 1kb (originally I messed this up and made this 1mb, thanks Guda Uma Shanker for the keen eye, 1mb would be a horrible default), half until you stop having issues.

Long term

  • Use a proper RF of 3
  • I still think the default retry of 120 seconds is way too high. Try 30 seconds at least.
  • Get better IO usually local SSDs will get you where you need to go.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison

cassandra,nosql,cluster computing

Published at DZone with permission of Ryan Svihla, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}