Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Talend and Apache Spark: Debugging and Logging Best Practices

DZone's Guide to

Talend and Apache Spark: Debugging and Logging Best Practices

Learn how to find out what went wrong with your Spark job, and where you should look for more information that will help you resolve your issue.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

So far, our journey on using Apache Spark with Talend has been a fun and exciting one. The first three posts on my series provided an overview of how Talend works with Apache Spark, some similarities between Talend and Spark Submit, the configuration options available for Spark jobs in Talend and how to tune Spark jobs for performance. If you haven't already read them you should do so before getting started here. Start with: "Talend & Apache Spark: A Technical Primer"; "Talend vs. Spark Submit Configuration: What's the Difference?"; "Apache Spark and Talend: Performance and Tuning."

To finish this series, we're going to talking about logging and debugging. When starting your journey with using Talend and Apache Spark you may have run into the error like below printed out in your console log:

    "org.apache.spark.SparkContext - Error initializing SparkContext. 
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master"

How do you find out what caused this? Where should you be looking for more information? Did your Spark job even run? In the following sections, I will go over how you can find out what happened with your Spark job, and where you should look for more information that will help you resolve your issue.

Resource Manager Web UI

When you get an error message like the one above, you should always check the Resource Manager Web UI page first to locate your application, and see what errors may be reported in there.

Image title

Once you locate your application you’ll see in the bottom right corner that you have an option to retrieve container logs by clicking on the “logs” link that is provided next to each attempt to get more information about what happened.

In my experience, I never get all the logging information that I need from the Web UI alone, so it is better to login into one of your cluster edge nodes, and then use the YARN commandline tool to grab all the logging information for your containers and output it into a file like below.

Interpreting the Spark Logs (Spark Driver)

Once you have gotten the container logs through the command shown above and have the logs from your Studio, you now need to interpret them and see where our job may have failed. The first place to start is with the Studio logs that contain the logging information for the Apache Spark driver. These logs indicate in the first few lines that state that our Spark driver has started:

[INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 40238.
[INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriverActorSystem' on port 34737.

Now, the next part that you should look for in the Studio log is the information of the Spark Web UI that is started by the Spark driver:

[INFO ]: org.apache.spark.ui.SparkUI - Started SparkUI at http://<ip_address>:4040

This is the Spark Web UI that is launched by the driver. The next step that you should see in the logs is the libraries needed by the executors being uploaded to the Spark cache:

[INFO ]: org.apache.spark.SparkContext - Added JAR ../../../cache/lib/1223803203_1491142638/talend-mapred-lib.jar at  spark://<ip_address>:40238/jars/talend-mapred-lib.jar with timestamp 1463407744593

Once all the information that will be needed by the executors for the job is uploaded to the Spark cache, you will then see the log the request for the Application Master in the Studio:

[INFO ]: org.apache.spark.deploy.yarn.Client - Will allocate AM container, with 896 MB memory including 384 MB overhead
[INFO ]: org.apache.spark.deploy.yarn.Client - Submitting application 20 to ResourceManager
[INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: ACCEPTED)
[INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: RUNNING)

After this, the executors will be registered and then the processing will be started:

[INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) 
(hostname2:41992) with ID 2
[INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) 
(hostname1:45855) with ID 1
[INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 1.0 in stage 1.0 (TID 3) in 59 ms on hostname1 (1/2)
[INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on hostname2 (2/2)

At the end of the Spark driver log you'll see the last stage which is the shutdown and cleanup:

[INFO ]: org.apache.spark.util.ShutdownHookManager - Shutdown hook called
[INFO ]: org.apache.spark.util.ShutdownHookManager - Deleting directory /tmp/spark-5b19fa47-96df-47f5-97f0-cf73375e59e1
[INFO ]: akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.

In your log, you may see every stage being reported, but depending on the issue some of them may not get executed.

This excercise should give you a very good indicator of where the actual failure happened, and if it was an issue encountered by the Spark driver. As a note, the Spark driver logs shown above will only be available through the Studio console log, if the job is run using the YARN-client mode. If YARN-cluster mode is used, this logging information will not be available in the Studio, and you will have to use the YARN command mentioned earlier to get that logging.

Interpreting the Spark Logs (Container Logs)

Now, let's move on to reviewing the container logs (if your application started running at the cluster level) and start interpreting the information. The first step that we will see in these logs is the Application Master starting:

05/05/18 16:09:13 INFO ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]

Then you will see the connectivity happening with the Spark Driver:

05/05/18 16:09:14 INFO ApplicationMaster: Waiting for Spark driver to be reachable.
05/05/18 16:09:14 INFO ApplicationMaster: Driver now available: <spark_driver>:40238

It will then proceed with requesting resources:

05/05/18 16:09:15 INFO YarnAllocator: Will request 2 executor containers,  
each with 1 cores and 1408 MB memory including 384 MB overhead
05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)

Then once it gets the resources it will start launching the containers:

05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000002 for on host hostname1
05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000003 for on host hostname2

It then proceeds with printing the container classpath:

CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$YARN_HOME/*<CPS>$YARN_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
{{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=40238' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@163.172.14.98:40238 --executor-id 1 --hostname hostname1 --cores 1 --app-id application_1463062490123_0020 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

You'll now see our executors starting up:

05/05/18 16:09:19 INFO Remoting: Starting remoting
05/05/18 16:09:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@tbd-bench-09:38635]

And then the executor communicating back to the Spark Driver:

05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@163.172.14.98:40238
05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Successfully registered with driver

Then it proceeds to retrieve the libraries from the Spark cache and update the classpath:

05/05/18 16:09:21 INFO Utils: Fetching spark://<ip_address>:40238/jars/snappy-java-1.0.4.1.jar to /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/fetchFileTemp7487566384982349855.tmp
05/05/18 16:09:21 INFO Utils: Copying /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/1589831721463407744593_cache to /data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar
05/05/18 16:09:21 INFO Executor: Adding file:/data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar to class loader

Next you will see the Spark executor start running:

05/05/18 16:09:22 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)

Finally, the Spark executor shutdowns and cleans up once the processing is done:

05/05/18 16:09:23 INFO Remoting: Remoting shut down
05/05/18 16:09:23 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

As I mentioned in the previous section, you might see your Spark job go through all those stages, or it might not depending on what the issue was that we encountered. Understanding all the different steps involved, and being able to interpret all this information in the logging, will help you to get more information and a better understanding of why a Spark job may have failed.

Spark History Web UI

In previous blogs, I mentioned that, as a best practice, you should always enable the Spark event logging in your jobs, so that the information in the Spark History Web Interface is available even after the job ends.

This Web Interface is the next location that you should always check for more information regarding the processing of your job. The Spark History Web UI does a great job on giving a visual representation of the processing of a Spark Job. When you navigate to that web interface you will see the following tabs:

First, I suggest starting with the environment tab, to verify all the environment information that was passed to our job:

The next step now is to look at the timeline of events in Spark. When you click on the jobs tab, you will see how our application was executed after our executors got registered:

As you can see in the image above, it will show you the addition of all the executors that were requested, then the execution of the jobs that our application was split into, if any of them run in parallel, and if there was any failure with any of them. Now you can proceed by clicking in one of those jobs, to get further information (example below):

Here you'll see the stages that run in parallel and don't depend on each other as well as the stages that depend on others to finish and don't start until the first ones are done. Now the Spark History Web UI allows us to look further into those stages and see the different tasks that are executed:

Here you are looking at the different partitions (in this case we have two) and how they are distributed among the different executors. You'll also see how much of the execution time was spent on shuffling and on actual computation. Furthermore, if we are also doing joins in our Spark job, you will notice that in each job in the Web UI it shows the execution DAG visualization that allows you to easily determine the type of join that was used:

As a final step, make sure to check the executors tab which will give you an overview of all the executors that were used by our Spark job, how many tasks each one of them processed, the amount of data processed, the amount of shuffling, and how much time was spent on the task and in Garbage Collection:

All this information that you gather is important as it will lead you to better understanding the root cause of a potential issue, and the corrective action you should take.

Conclusion

This concludes my blog series on Talend with Apache Spark. I hope you enjoyed this journey, and had as much fun reading the blogs as I had while putting all this information together! I would love to hear from you on your experience with Spark and Talend, and if the information within the blogs was useful, so feel free to post your thoughts and comments below.

References

  1. https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-history-server.html

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,apache spark ,talend ,tutorial ,debugging

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}