2016 has been an interesting year — Big Data got big general interest when various machine learning algorithms were predicting who may become president (and were a bit off). We had major updates to Spark (2.0!), Apache NiFi (1.1) and Apache Big Data is just expanding by droves in projects.
My Pick For Must Best Technologies of 2016
Apache NiFi: From drones to Dr. Seuss to Twitter to relational databases, NiFi has become the Swiss Army Knife of ingest, quick transformation, and edge IoT processing. All with almost no coding, a simple UI, and scalable performance. This is a tool that is hard not to love.
Apache Spark: Constantly evolving, with even better APIs, more committers, more libraries, and real production features from Hortonworks, Databricks, IBM, and more. This is my go to when I want to quickly do some distributed batch or streaming jobs. The Scala and Python APIs are great. The Java 8 API is okay, but hard to choose after embracing the sparseness of Scala.
Apache Beam: An attempt from Google to unify the mass of streaming algorithms that exploded this year. From the unstoppable Spark to Flink and Apex and Google's own Data Flow. This is a technology worth trying out if you are a Java developer.
Apache Kafka: 0.10 and improving tooling and AVRO schema support and production support from Confluent, Hortonworks and more. This is a tool that you must have in your enterprise. Kafka just works, ties together Storm, NiFi, Apex, Flink, Spark, and everything you can find. If your tool doesn't work with Kafka, it's probably not worth looking at yet.
TensorFlow + Keras: Keras adds a nice Python modular access to TensorFlow and some standardization. Google keeps adding improvements, documentation, examples, tutorials and more to their ever expanding open source Deep Learning library. It's still in a very early version and is not part of the Apache Foundation. Do not run TensorFlow for mission-critical jobs, but it's fun to experiment with.
Deep Learning 4 J (DL4J): Real production Deep Learning with all the features you need in Java. This is the library you can run in your production Hadoop and Spark clusters or dedicated GPU cluster.
Apache Storm: Lots of people are working on compatible things like Heron. But Storm is past 1.0, enterprise scale, has UI tools and more tools are coming. This is the enterprise streaming framework option in Java.
HiveMall: This project has been in the works for a while, but is now in Apache Incubation status. It is a library that adds easy machine learning to your SQL for Hive and Spark as well as to Pig. It's a very cool open source project and has some very helpful functions. I am waiting to see, but once more people join this project it can be amazing. I am hoping they can port some of the amazing stuff from MADLib. Why not include MADLib? Lack of support for Apache Hive is the only reason, it works great on PostgreSQL, GreenplumDB, and HAWQ.
Hortonworks Data Platform 2.5 was released in 2016 and really brought Apache Big Data to an exciting level with the latest, greatest projects working together with easy security and administration. Updates to Apache Hive, Ambari, Phoenix, and more really brought Big Data into mainstream enterprises.
My Pick for Best Presentations of 2016
My Top Articles of 2016
Apache NiFi 1.x Cheatsheet: The goto quick guide on using NiFi.
Using HDFS Cheatsheet: The goto quick guide for using HDFS.
Processing Real-Time Tweets: Fun topic and fun code.
Atari and TensorFlow: Video games and deep learning, perfect holiday reading.
- Streaming SQL Ingest: Loading SQL data into Hadoop.
Best Name of the Year: Parsey McParseface.
A lot of awesome things are coming to Big Data, Hadoop, Spark, Machine Learning, Deep Learning, and IoT in 2017, I can hardly wait.
I will have some post-Holiday posts on some new IoT devices and robots I will be working with.