Try the Latest Innovations in Apache Spark and Apache Zeppelin with Hortonworks 2.5 Sandbox

DZone 's Guide to

Try the Latest Innovations in Apache Spark and Apache Zeppelin with Hortonworks 2.5 Sandbox

Some major upgrades available in the free sandbox including two new Spark versions (1.6.2 and 2), Zeppelin and new demos and tutorials available.

· Big Data Zone ·
Free Resource

With the release of Hortonworks 2.5 Sandbox several new exciting features have been added to Apache Spark and Apache Zeppelin.

Apache Spark Updates

One of the most powerful new Hortonworks 2.5 Sandbox features is the ability to run two versions of Spark alongside in the same environment: a Generally Available (GA) Spark 1.6.2 and a Tech Preview (TP) of Spark 2.0. If you would like to learn how to effortlessly run different versions of Spark, check out A Lap Around Apache Spark tutorial.

NOTE:  Zeppelin does not yet support Spark 2.0. This functionality will be coming soon.

Also, a new HBase connector has been added that allows you to ingest HBase datasets straight into a Spark DataFrame. To learn more, see the Spark on HBase tutorial.

Apache Zeppelin Updates

With HDP 2.5, Zeppelin notebook security and multi-user support were added. By enabling a Livy REST server and a LDAP/AD for user authentication, you may now specify user access to different notebooks, depending on their role and needs. Livy also adds a more efficient cluster utilization with the ability to recycle inactive interpreters after 60 minutes.

Given Zeppelin’s General Availability, Enterprise readiness, flexibility (30+ interpreters), ease of use, and a rich development community, it’s a great time to start exploring how you can leverage Zeppelin notebooks to accelerate data wrangling, analytics, and data science in your business. If you would like to give Zeppelin a try, check out the Learning Spark with Zeppelin tutorial.

Image title

A Monte Carlo Simulation with Spark and Zeppelin

If you are beyond the basics with Zeppelin and Spark and want to explore other notebooks for inspiration check out the Zeppelin Notebook Gallery or the ZeppelinHub.

What’s New in Spark 2.0

With Spark 2.0 TP now available, there are several updates that you should be aware of.

API Unification

  • DataFrame is now an alias for a Dataset of Row type or Dataset[Row] in Scala.
  • SparkSession replaces SparkContext, SQLContext, and HiveContext. In other words, Spark is the new entry point to all Spark features.

Structured Streaming

  • You can manipulate stream data via DataFrames and Datasets.
  • Real-time incremental processing. Conceptually, it’s useful to think of an infinite DataFrame.

Performance Improvements

  • Speedup from Tungsten Phase 2 multi-stage code generation.
  • ORC and Parquet file format improvements.

Stay tuned for more blogs, with more details, on each of these topics.

Get Started in 4 Steps

  1. Download HDP Sandbox as a VM image (VMware and Virtualbox or Docker).
  2. Setup and Start the VM image.
  3. Try a Sandbox tutorial, check out the list of free tutorials, or jump directly into a Learning Spark with Zeppelin hands-on tutorial.
  4. Need more help? Visit the Hortonworks Community Connection (HCC) and interact directly with the community and our development team.

Next Steps

If you want a little more of a guided introduction view the following Hadoop Summit Crash Courses:

You can also find the latest set of Spark tutorials and Zeppelin tutorials.

Try Hortonworks Cloud

Don’t have the minimum 8 GB of RAM to allocate to the virtual machine?  Looking to try the latest in Hive and Spark in AWS. Try the Hortonworks Cloud Technical Preview, which supports ephemeral workloads for Hive and Spark.

hadoop ,hbase ,hortonworks ,spark ,zeppelin

Published at DZone with permission of Robert Hryniewicz , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}