Over a million developers have joined DZone.

This Week in Hadoop and More: HoliBigDataDays

DZone's Guide to

This Week in Hadoop and More: HoliBigDataDays

Here is a wrap up for the year and interesting items from Hadoop, Big Data, Spark, Deep Learning and Machine Learning.

· Big Data Zone ·
Free Resource

The Architect’s Guide to Big Data Application Performance. Get the Guide.

Predictions, Trends, and Fun for 2017

A new year is coming and it's time to look ahead using what we has happened in 2016.

This year is just about over and there a few things to watch going forward into 2017.

  1. IoT & Drones
  2. Apache NiFi
  3. Streaming
  4. Deep Learning
  5. Integrated Data Platforms


I have been doing a little work on drones and IoT.

I recently worked with a Drone pilot on a meetup and ingested some Bebop 2 drone data into Apache Phoenix / HBase using Apache NiFi.   There's a lot of sensors, photos, and videos that can be obtained in various methods in near real-time for very interesting use cases. 

Machine Data from Robots, Devices, UAVs, Automated Trucks, Automated Ships, and other connected things will dwarf anything humans can do.  An explosion of data that dwarfs social and mobile data by a factor of N. IoT can be more controlled formats, more regular intervals and a lot of diffs and summary data.   When I worked at Real-Time Energy Monitoring, we had a lot of small numeric offsets and some summary data. Time series databases like ones on top of HBase, make a lot of sense here. Get yourself a Raspberry Pi 3 and start experimenting, from 5 to 105, this is a great way to start learning computers, sensors, Python and IoT. Python, NodeJs, and Java are really powerful in IoT.

Apache NiFi How-Tos

On top of NiFi, ThinkBigAnalytics has added Kylo which will be open source soon and adds advanced ingest features, SQL and UI features.   This is a project to watch in 2017! Again the primary languages you are working with besides the built-in express language are Java and Python.


More and faster! All the good streaming is happening in open source. Look for improvements in speed, usage, UI, tools, and features.  A major push from all players. Flink, Apex, and Spark are all pushing ahead quickly. Apache Beam is starting to unite all of them in a unified API. For production workloads, Apache Storm is the mature solution with advanced metrics, UI, and true streaming. Look for very interesting additions to this rock solid framework in 2017.  Java and Scala are your strongest choices here.  I expect Python and Go will make some headways in 2017.

Heron is interesting, but there are so many frameworks out there. A vendor needs to adopt it. You can't have Streaming without Kafka! Though I am betting the conversion of Streaming and IoT will give MQTT a bigger role.  MQTT is getting a major push from IBM and it just works.

Deep Learning

From Google's TensorFlow to DeepLearning4J to a number of other libraries, frameworks, and tools, is exploding!  There's some funky languages in there, but a lot of Python and Java. Things are starting to move into Deep Learning on Big Data clusters as Amazon, On Premise and elsewhere add GPUs to standard cloud and Hadoop servers. It's time to get your CNN on, by CNN I mean Convolutional Neural Network! I am looking for Keras to be a uniting force for good.   DeepLearning4J imports Keras models which can really help you move Deep Learning into production workloads. Adam is a one-man force for Deep Learning! I highly recommend buying his book on Deep Learning.

Integrated Data Platforms

I hinted at that above. You need the massive, scale out, Petabytes of storage of Hadoop, possibly extended with AWS S3 data. You need the SQL, NoSQL, security, machine learning, real-time, deep learning, pipelines, streaming, structured, unstructured, semistructured, Logs, JSON, and everything else in one platform. Integrated Data Platforms like HDP+HDF from Hortonworks are what's needed to capture and use the massive streaming flows of data from IoT, sensors, robots, logs, social media, mobile, applications, live feeds, free sources, open data, open city data, fitness devices, et. al.

I need to run SQL queries, update dashboards real-time and deliver reports via Slack and Email. I need all the sources, sinks, and processing to be quick and easy, scalable and just work for the variety, velocity, volume, veracity, variability of data.  I need to be able to query and visualize petabytes, zettabytes and yottabytes of data on the fly without constantly writing complex programs, queries, and long development cycles. I need agility to be able to push out new applications to widely distributed distribution channels, allow data scientists to harness the power of massive hybrid clusters and run deep learning and machine learning models as data arrives, accumulates and summarizes.  I can't spend months designing schemas, changing brittle compiled code every time a field is added, type is changed, or a field is removed.  I need to allow for flexibility and unstable feeds. Data can change instantly, constantly as new sources come and go, new fields appear and quickly become null.  I can't lose a byte of precise data ever. I need to store everything possibly in multiple formats for fast queries, computations, summaries, and transformations.

I can't wait for flat file extracts from one proprietary system to be loaded by a script that requires constant tweaking for new types, fields, and data sizes.  I need to burst and autoscale up and down with spikes, needs, ephemeral jobs and to chase high and low spot prices. I need to be able to run on multiple cloud providers and on specialized deep learning GPU heavy hardware that I may have in-house possibly on loan or at a partner's collocated facility.

All of this data needs to be secure at rest and in motion, security breaches can be catastrophic.  I need to protect the public's data, our shared infrastructure, the communities good and the enterprise from competitors.  While working with less people, faster, cheaper and more open environments.  And I need to do that every second of every day with zero downtime, forever. These algorithms may be running unmanned vehicles in the air, land, sea, underground, underwater, near space, other planets and moons and outer space. All of this data from every environment needs to flow into algorithms that calculate, respond, learn, and grow continuously.

I need to allow for thousands of columns, all of which can appear and disappear at random.  I need all my software to be open source, supported by a foundation like Apache, be widely distributed via GitHub and have a community like HCC to support it.  

Some cool things to read:

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.

big data ,machine learning ,deep leaerning ,spark ,hadoop ,hortonworks

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}