Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Summer of Big Data Innovation 2018

DZone's Guide to

The Summer of Big Data Innovation 2018

A data scientist and DZone Zone Leader explores what's going on in big data, machine learning, cloud, blockchain, and more.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

It's been a long hot summer already and that's certainly been evident all over. Fortunately that has kept people inside coding! That must be it, because the amount of major upgrades and releases is astounding.

First up, the most major one of the decade, is the 3.1 version of Hadoop that has been released into production as Hortonworks Data Platform 3.0. This modernized version of Hadoop has turned into a cloud beast. You now have dockerization, GPU support, Tensorflow, ultra fast SQL, erasure coding for those that know three copies of your data is too much and a laundry list of upgraded components.

Image title

To me, the ability to run dockerized workloads brings your big data platform on compute par with clouds, making it easier, faster, and more developer friendly to write big data applications. If this was it, that would be awesome. You can also write Spark applications as Docker containers, as well TensorFlow and others. My friend Amol wrote an awesome walk through of how to use Cloudbreak to spin up and run a dockerized Spark application running financial libraries.   

I installed HDP 3.0 in an OpenStack cluster running Centos 7 and it was smooth. I am running it and would like to report that it's pretty awesome. There are upgrades to most components including Hive to 3.1, HBase to 2.0, Zeppelin to 0.8.0, Spark to 2.3.1, Ambari to 2.7.0, and Kafka to 1.0.1.

Speaking of cloud, which everyone is, Hortonworks has made hybrid cloud a real option for enterprises. This includes expanding relationships with existing cloud vendors like perennial Hortonworks partner, Microsoft. The relationship with Google is very interesting and provides more choices for companies.

What is really innovative is Cloudbreak 2.7, this lets you run multiple types of workloads on multiple clouds including Spark, Hive, and NiFi.

Image title

Enterprises can now use an open source tool for deploying to public and/or private clouds using dynamic configuration, automated scaling, and full security with Kerberos. This upgrade really brings the features you need to make this possible today. Did I mention there were blueprints so you can easily repeat and use DevOps for your process? There are a number of useful blueprints included for Data Science with Spark and Zeppelin, EDW Analytics, and EDW ETL. You can now spin up ephemeral clusters, run some analytics, store final results to S3, and then shut down to avoid pricey cloud costs on unused VMs.

Image title

Speaking of cloud, Hortonworks has added another open source tool to its DataPlane Service for Global Data Management, Data Steward Studio.

Image title

DSS lets you curate, discover, and organize your data assets across multiple types and tiers of data in a hybrid environment. You can understand and audit data asset security and govern proper usage and lineage of data assets. It doesn't matter if your data is on-premise or in one or more clouds. This is awesome. This really elevates your data lake to the most important place for your data. It is also showing off something you will notice this year, Hortonworks has upped the interfaces to be very clean, functional, modern, and UX-centric.  

You don't think I could have an article with no mention of Apache NiFi did you? Well, it's upgraded to 1.7.1. Along with updates to subprojects MiniFi and NiFi Registry. My recent IoT article on multiple device ingestion highlights the useful features. The killer one for NiFi registry is utilizing Git as a flow file persistence engine.

There are also some other interesting things out there including, the best-named project ever, Circus Train from Hotels.com, which is for replicating Apache Hive tables between clusters.

TensorFlow had another upgrade to 1.9, it will probably be in 2.0 before I finish this line. It's hard to avoid Keras at this point with it's tight integration here. Keras is also now supported by Apache MXNet. Let's not forget about ONNX, which has even more interesting models to use.

Often overshadowed by the roller coaster ride that cryptocurrencies are experiencing, blockchain is exploding. All of the major cloud vendors are now doing something with blockchain. As I have previously published on DZone, there's a ton of sites that you can use for working with these currencies. To learn more, see my article on BTC.

Image title

Image title

Image title

To wrap up, I included a picture of the HDP 3.0/TensorFlow/GPU powered race car that Hortonworks had driving around this race track on the floor of Data Works Summit in San Jose in late June. I was there to do my talk on Open Computer Vision and help out with the Deep Learning Crash Course.  Most of the sessions have their videos now posted, so check them out. You can also read and download the slides here.   

Have a great summer of upgrades!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
hadoop ,spark ,machine learning ,blockchain ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}