Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Deep Learning on Big Data Platforms

DZone's Guide to

Deep Learning on Big Data Platforms

Here are my personal recommendations and things that I think you should keep in mind when it comes to Deep Learning on Big Data platforms.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Read on to learn about Deep Learning on Hadoop Big Data platforms.

Big Data Deep Learning Options

You've got tons of options for Deep Learning. 

  • TensorFlow (C++, Python, Java).

  • TensorFlow on Spark.

  • MXNet.

  • Deep Learning 4 J (Skymind) JVM.

  • PyTorch.

  • H2o Deep Water.

  • Keras on top of TensorFlow and DL4J.

  • Apache Singa.

Recommendations

Here are my personal recommendations and things that I think you should keep in mind.

  • Install CPU version on CPU YARN nodes and install GPU version on GPU YARN nodes.

  • Do training on GPU YARN Nodes where possible.

  • Apply model on all nodes and trigger with Apache NiFi.

  • Remember that what helps Hadoop and Spark will help TensorFlow.

  • More RAM = more and faster cores = more nodes.

  • Today, run either pure TensorFlow with Keras or TensorFlow on Spark. Later in the year, try YARN 3.0 Containerized TensorFlow.

  • Consider Alluxio for in-memory optimization

  • Download model zoos.

  • Evaluate other Deep Learning frameworks like MXNet and PyTorch.

MXNet

Here are the details of MXNet and GitHub for running on YARN.

  • Cloud-ready product developed by an experienced team (XGBoost)

  • Has AWS, Microsoft, NVIDIA, Baidu, and Intel backing.

  • An Apache project run distributed on YARN, and also runs on Raspberry PI and constrained devices.

  • In my early tests, it was faster than Google's TensorFlow, but not Keras. Additionally, it doesn't have as much documentation, examples, or backing as TensorFlow.

DL4J (Deep Learning 4 J)

  • DL4J is Deep Learning 4 J for production workloads. It is Keras-compatible and has JVM strength.

  • Support from a very knowledgeable team, has professional support, and is a Hortonworks partner.

  • Publishing an awesome book (Deep Learning A Practioner’s Approach).

TensorFlow on Spark and YARN

It has the strength and testing of Yahoo! Engineering on a big platform. They have the tools, engineering, clusters, and experience to get this right. I will be evaluating this soon. For more info, see here and here

TensorFlow on Hadoop

HDFS files can be used as a distributed source for input producers for training, allowing one fast cluster to store these massive datasets and share them amongst your cluster. This requires setting a few environment variables:

JAVA_HOME
HADOOP_HDFS_HOME
LD_LIBRARY_PATH
CLASSPATH

See more info here.

TensorFlow Serving on YARN

See more info here

YARN 3 With GPU Support

See more info here.

References

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
deep learning ,hadoop ,big data ,spark ,tensorflow ,mxnet ,machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}