Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Schema and Spark Fun With HDF 3.1

DZone's Guide to

Schema and Spark Fun With HDF 3.1

Get tips for installation and setup when using Schema Registry and other HDF 3.1 tools for Apache Spark execution.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I want to easily integrate Apache Spark jobs with my Apache NiFi flows. Fortunately, with the release of HDF 3.1, I can do that via Apache NiFi's ExecuteSparkInteractive processor.

For the first step, let me set up a Centos 7 cluster with HDF 3.1. Follow the well-written guide here.

With the magic of time-lapse photography, instantly, we have a new cluster of goodness:

It is important to note the new NiFi Registry for doing version control and more. We also get the new Kafka 1.0, updated SAM, and the ever-important updated Schema Registry.

The star of the show today is Apache NiFi 1.5 here.

My first step is to add a Controller Service (LivySessionController).

Then, we add the Apache Livy Server. You can find this in your Ambari UI. It is by default port 8999. For my session, I am doing Python, so I picked pyspark. You can also pick pyspark3 for Python 3 code, spark for Scala, and sparkr for R.

To execute my Python job, you can pass the code in from a previous processor to the ExecuteSparkInteractive processor or put the code inline. I put the code inline.

Two new features of Schema Registry that I have to mention are the version comparison:

You click the COMPARE VERSIONS link and now, you have a nice comparison UI.

And the amazing new Swagger documentation for interactive documentation and testing of the Schema Registry APIs.

Not only do you get all the parameters for input and output, the full URL, and a curl example — you get to run the code live on your server.

I will be adding an article on how to use Apache NiFi to grab schemas from data using InferAvroSchema and publish these new schemas to the Schema Registry via REST API automatically.

Part two of this article will focus on the details of using Apache Livy, Apache NiFi, and Apache Spark with the new processor to call jobs.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
apache spark ,apache nifi ,apache livy ,schema registry ,big data ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}