Over a million developers have joined DZone.

How to Configure ELK Stack for Telemetrics on Apache Spark

DZone's Guide to

How to Configure ELK Stack for Telemetrics on Apache Spark

While apps generally have their own monitoring tools, having a single solution for gaining insight is a good goal. A lot of teams choose the Elastic Stack to solve this.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

In this blog, I want to go over how to set up and deploy a Talend Spark Streaming job into a new Elastic Stack instance. Spark is the engine of choice for near real-time processing, not only for Talend but also for many organizations who have a need for large-scale lightning fast data processing. The Elastic Stack is a highly versatile and widely adopted suite of tools built for monitoring that works perfectly for this scenario.

The ELK Stack is comprised of Elasticsearch for indexing, Logstash of aggregating data, and Kibana for visualization. In the following tutorial, we will install and configure the stack to read Spark Streaming metrics and display them as visualizations. Talend for Real-time Big Data is used to create the Spark Streaming processes. What we are trying to achieve overall is a holistic insight into all our processes as an organization. Holistic insight is important because a single application may have hundreds of loosely coupled and interconnected processes and systems. Having the ability to recognize issues, like bottlenecks, data loss, underutilized resources, etc. without needing to switch between multiple pages, software, or charts, saves individuals and the organization measurable time and effort.

The way that we integrate Talend and Spark into this single solution is through JMX. JMX is a generic interface, not tightly coupled with the Elastic Stack and not exclusive to Talend or Spark, therefore it can be used as an interface from any Java-based application to many monitoring tools. What this gives you is all available metrics for a Spark Streaming job synchronized, indexed, and visualized in near real-time to your analytics or monitoring tool of choice.


Getting Started With Logstash

Unzip logstash-5.3.0.zip to C:\elk\logstash-5.3.0.

Test the install

cd C:\elk\logstash-5.3.0\bin
logstash.bat -e 'input { stdin { } } output { stdout {} }'
hello world

If everything is ok it will print the following with the current date and time:

2017-03-30T17:12:32.876Z MYHOSTNAME hello world

1) Install the JMX Plugin

The JMX plugin is specifically for retrieving data from a running Java program. This does not come with Logstash and is maintained by a third party.

cd C:\elk\logstash-5.3.0\bin
logstash-plugin.bat install logstash-input-jmx

2) Create the Configuration Files


First, edit the logstash configuration file.


Add the following lines to jmx.conf:

input { 
  jmx {
  path => "C:/elk/logstash-5.3.0/config/jmx/"
  polling_frequency => 30
  type => jmx
  nb_thread => 2
output {
  elasticsearch {
 hosts => ["localhost:9200"]


 Folder location where configurations are stored. REQUIRED


 The interval between two JMX metrics retrievals in seconds.


 Used as the type in Elasticsearch.


 Number of threads used to retrieve metrics.

Next, create a file for the plugin configuration. Add the following lines to jmx-input-conf:

  "host" : "localhost",
 "port" : 54321,
 "queries" : [
   "object_name" : "kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs",
   "attributes" : ["FifteenMinuteRate","FiveMinuteRate","MeanRate","OneMinuteRate","Max","Count","Min","Mean","StdDev"],
   "alias" : "kafkaconsumer_fetchrequest_rate"
   "object_name" : "kafka.producer:type=producer-metrics,client-id=producer-1",
   "attributes" : ["batch-size-avg","incoming-byte-rate","outgoing-byte-rate","io-ratio","request-rate"],
   "alias" : "kafkaproducer_metrics"


 Can be found by opening Jconsole and selecting MBeans.


 The values under the MBean that you would like Logstash to collect. Each is collected as a separate entry.


 Appears in Elasticsearch as a field value.

A quick note before we continue, the object names will change depending on Spark version and distribution. Wildcards are possible, for instance, we can have an object_name "kafka.consumer:type=*" to retrieve all of the objects under the kafka.consumer Mbean. This configuration may need to be changed for the tutorial after inspecting Jconsole during the Talend and Spark section.

3) Start Logstash With Configurations


Unzip elasticsearch-5.0.0.zip to C:\elk\elasticsearch-5.0.0:

1) Start

cd C:\elk\elasticsearch-5.0.0\bin

2) Create an Analyzer

Here we add an Analyzer to the field we are using to search our logs in Kibana. The reason for this is so that certain strings can be searched easier. The stop analyzer type breaks out the field based on characters like period and slash. This works well for the metric_path.

Run the commands using CURL or a REST client. Replace 2017.03.30 with the current date.

url -s -XPUT http://localhost:9200/logstash-2017.03.30/ { "mappings": { "jmx": { "properties": { "metric_path": { "type": "text", "analyzer": "stop" } } } } }

To reset an Elasticsearch Index run an XDELETE on the index and then run the XPUT again.

curl -s -XDELETE http://localhost:9200/logstash-2017.03.30

When you see the date 2017.03.30 replace it with the current date. Logstash automatically uses the date as a template for Elasticsearch. This can be changed in the jmx.conf

Unlocking the Power of Apache Spark in Talend

This process requires Talend for Real-time Big Data. We must create a Streaming Spark Job to run this scenario.

1) Configure the Spark Metrics

Create the file C:\Talend\Spark\metrics.properties

Add the line:


2) Create the Jobs

The most important aspect of the job creation is to have a Kafka input and Kafka output. These two components are what the object_name property of the jmx-input-conf file is based on. These jobs do require that Kafka is installed.

The prebuilt Jobs can be downloaded HERE

3) Modify the Settings

Update the Spark Configuration for the job. Add the new items to Advanced properties:


 “-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -



Update the Advanced settings for the job. Check the Use specific JVM arguments box to add the following parameters:


4) Run the Jobs

After the jobs have been built, and settings changed, we can run them locally. When the Spark Streaming job is running you can open Jconsole to view the metrics.

From Run or CMD type Jconsole. This will bring up the application window if the Java environment variables are configured correctly.

Connect to a Remote Process with the URL localhost:54321


Unzip kibana-5.0.0-windows-x86.zip to C:\elk\kibana-5.0.0-windows-x86

1) Start

cd C:\elk\kibana-5.0.0-windows-x86\bin

Go to http://localhost:5601/

When you first start Kibana you should be prompted to set up the Index Patterns. This should be gathered semi-autonomously from Elasticsearch. This is the reason we focus on Kibana last. If the other processes are not running and collecting information there will not be an initial index Pattern to select. Once the index has been set, we can open the Discover page to query the logs.

Create the Visualizations

For each visualization, the METRIC and BUCKET will be the same. Set them to the following:


 Sum of metric_value_number


 Date Histogram of @timestamp

In the Search field, we need to enter filters for each chart that is being created. Enter the following filters, one per visualization:

  1. metric_path:(incoming AND byte AND rate)
  2. metric_path:(outgoing AND byte AND rate)
  3. metric_path:OneMinuteRate

Save each of the Visualizations so you can add them to a dashboard.

This is an example dashboard that incorporates 4 metrics from our Talend Spark Streaming job:


Looking at a black box wondering what is going on inside and how it's affecting your operations is about as effective as looking at 10 separate dashboards and trying to combine the figures in your head. While each application generally has its own monitoring tools, having a single solution for attaining insight is the goal. A lot of organizations see this need and have chosen the Elastic Stack to solve this challenge.

Metrics for Spark Streaming are available natively but have their own separate interface which is not generally utilized by anyone outside of IT. And while Talend provides monitoring capabilities for its own solution out of the box (including an Elastic Stack), it is much better for an organization to integrate its data management platform metrics into their own monitoring suite.


Example URL for Elasticsearch jmx mapping:


Reset Elasticsearch

To remove all of the indexed logs you need to delete the actual index from the REST API. Then reimplement the analyzer. Make sure you stop the Talend Spark Process first, so that no logs are written while you are attempting this.

curl -s -XDELETE http://localhost:9200/logstash-2017.03.30/
curl -s -XPUT http://localhost:9200/logstash-2017.03.30/ { "mappings": { "jmx": { "properties": { "metric_path": { "type": "text", "analyzer": "stop" } } } } }

Reset Kafka

To reset Kafka, you must stop the program and remove the log files. The location of these files are set via configuration and can be changed.

rm -rf C:\kafka_2.10-\topic-logs\*
rm -rf C:\kafka_2.10-\zookeeper-logs\*

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

performance ,apache spark ,elk stack

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}