Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

This Week in Hadoop and More: Machine Learning, Deep Learning, and Minimal Viable Big Data

DZone's Guide to

This Week in Hadoop and More: Machine Learning, Deep Learning, and Minimal Viable Big Data

Kafka and MQTT are required. Getting trapped in non-locally installable technology that doesn't extend to your 1GB devices is a great weakness in your environment.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

The table stacks have been raised. Now, your minimal viable enterprise system is a combination of Big Data and related technologies. The minimum infrastructure is now hybrid and spans multiple clouds and devices on-premise. You need, at a minimum, a standard open-source Hadoop platform like HDP 2.5 or ODPi with HDFS, YARN, Hive2LLAP, HBase, or Phoenix as your base for massive petabyte storage.  

On top of that, you need to be running Spark 2 jobs on YARN for various Machine Learning, streaming, graph, and batch jobs. Through Spark or through containers, you need to run TensorFlow and other Deep Learning packages at scale in your massive distributed cluster.

You also need to be able to coordinate, ingest, transmit, translate and store from thousands of different devices from IIoT, SCADA, mobile, raspberry Pis, and various devices. You will need to stream from various sources in an Apache open way not tied to the cloud vendor of the day.  

Thus, Kafka and MQTT are required. Getting trapped in non-locally installable technology that doesn't extend to your 1GB devices is a weakness in your environment. Basically:

Devices, logs, and distributed servers > MQTT > Cloud X > NIFI > Phoenix, Hive LLAP, and HBase > Kafka > Spark, Spark with Deep Learning packages, and Storm.  

The idea is to stream everything in near real-time continuously from all sources. Stream coordination with Apache NiFi is key.

Must-Read Spark Summit 2017 East Presentations

For fans of R, Microsoft, Hortonworks H2O, and others have been pushing R forward in Spark.

From Princeton comes a really interesting talk on text processing.

Also check out:

Machine Learning at Scale at Facebook

Check out the Machine Learning at Scale 2017 report with videos of great talks.  

TensorFlow Dev Summit

There were some great talks at the first TensorFlow Dev Summit. First up is a nice introduction to using TensorFlow.  There's also a a nice summary of the TensorFlow Summit (it's mostly in English).

More Introductions to Machine Learning

Must-Try Projects

  • Kylo is open-sourced by Think Big Analytics combines NiFi, Spark, Hive, and Hadoop.   Download the Hortonworks + Kylo Virtualbox Sandbox to try this interesting toolkit.

  • Accurate Quantiles Using t-Digests in Spark. For more info, you can check out GitHub, as well as a couple of slides

  • CoreNLP well-integrated with Apache Spark as functions. Check it out on GitHub.

  • ModelDB is a database for Machine Learning models from MIT with clients for Spark ML and SciKit-Learn. 

  • EasyMapReduce is Spark in Docker for interesting map reduce jobs that can be used to scale scientific jobs.

Want a cheap place to run some Machine Learning, NLP, or Deep Learning on a small Raspberry Pi-sized cloud box? Read this article for the rundown of Digital Ocean, Linode, and a bunch of cheap offerings out there.

Spark This

-Xmx2G -XX:+UseConcMarkSweepGC 
-XX:+CMSClassUnloadingEnabled 
-XX:MaxPermSize=2G 
-Xss2M -Duser.timezone=GMT 

Deep Learning in the Deep End

HBase Best Practices

Let's end with some HBase best practices resources.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
big data ,hadoop ,machine learning ,deep learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}