This Week in Hadoop and More: Machine Learning, Deep Learning, and Minimal Viable Big Data
Kafka and MQTT are required. Getting trapped in non-locally installable technology that doesn't extend to your 1GB devices is a great weakness in your environment.
Join the DZone community and get the full member experience.Join For Free
The table stacks have been raised. Now, your minimal viable enterprise system is a combination of Big Data and related technologies. The minimum infrastructure is now hybrid and spans multiple clouds and devices on-premise. You need, at a minimum, a standard open-source Hadoop platform like HDP 2.5 or ODPi with HDFS, YARN, Hive2LLAP, HBase, or Phoenix as your base for massive petabyte storage.
On top of that, you need to be running Spark 2 jobs on YARN for various Machine Learning, streaming, graph, and batch jobs. Through Spark or through containers, you need to run TensorFlow and other Deep Learning packages at scale in your massive distributed cluster.
You also need to be able to coordinate, ingest, transmit, translate and store from thousands of different devices from IIoT, SCADA, mobile, raspberry Pis, and various devices. You will need to stream from various sources in an Apache open way not tied to the cloud vendor of the day.
Thus, Kafka and MQTT are required. Getting trapped in non-locally installable technology that doesn't extend to your 1GB devices is a weakness in your environment. Basically:
Devices, logs, and distributed servers > MQTT > Cloud X > NIFI > Phoenix, Hive LLAP, and HBase > Kafka > Spark, Spark with Deep Learning packages, and Storm.
The idea is to stream everything in near real-time continuously from all sources. Stream coordination with Apache NiFi is key.
Must-Read Spark Summit 2017 East Presentations
For fans of R, Microsoft, Hortonworks H2O, and others have been pushing R forward in Spark.
From Princeton comes a really interesting talk on text processing.
Also check out:
Machine Learning at Scale at Facebook
Check out the Machine Learning at Scale 2017 report with videos of great talks.
TensorFlow Dev Summit
More Introductions to Machine Learning
CoreNLP well-integrated with Apache Spark as functions. Check it out on GitHub.
ModelDB is a database for Machine Learning models from MIT with clients for Spark ML and SciKit-Learn.
Want a cheap place to run some Machine Learning, NLP, or Deep Learning on a small Raspberry Pi-sized cloud box? Read this article for the rundown of Digital Ocean, Linode, and a bunch of cheap offerings out there.
-Xmx2G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M -Duser.timezone=GMT
Deep Learning in the Deep End
Check out this really nice 100+ page presentation on Deep Learning and reinforcement learning.
This introduction to Deep Learning and CNN is a great introduction to convoluted neural networks.
Check out this nice discussion on using H2O's Deep Water for Deep Learning.
Let's go back — all the way back — to an introduction to Machine Learning.
HBase Best Practices
Let's end with some HBase best practices resources.
Opinions expressed by DZone contributors are their own.