DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Lake 3.0 Part 3 - Distributed TensorFlow Assembly on Apache Hadoop Yarn

Data Lake 3.0 Part 3 - Distributed TensorFlow Assembly on Apache Hadoop Yarn

In this post on Data Lake 3.0, we'll take a look under the hood and demonstrate a Deep Learning framework (TensorFlow) assembly on Apache Hadoop YARN.

Vinod Kumar Vavilapalli user avatar by
Vinod Kumar Vavilapalli
·
Wangda Tan user avatar by
Wangda Tan
·
Mar. 14, 17 · Tutorial
Like (2)
Save
Tweet
Share
5.70K Views

Join the DZone community and get the full member experience.

Join For Free

Thank you for reading our Data Lake 3.0 series! In part 1 of the series, we introduced what a Data Lake 3.0 is and in part 2 of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In this post, we will take a look under the hood and demonstrate a Deep Learning framework (TensorFlow) assembly on Apache Hadoop YARN.

TensorFlow™ is an open source software library for numerical computation using data flow graphs. It is one of the most popular platforms for creating machine learning and deep learning applications.

It’s best to run TensorFlow on machines equipped with GPUs, since TensorFlow can leverage CUDA and cuDNN to speed-up a lot by leveraging GPU power!

We’ve talked about YARN and docker container runtime in Hadoop Summit 2016.

The recently published Nvidia Docker wrapper makes it possible to use GPU capabilities in a Docker container.

Part 0. Why TensorFlow Assembly On Yarn?

YARN has been used successfully to run all sorts of data applications. These applications can all coexist on a shared infrastructure managed through YARN’s centralized scheduling.

With TensorFlow, one can get started with deep learning without much knowledge about advanced math models and optimization algorithms.

If you have GPU-equipped hardware, and you want to run TensorFlow, going through the process of setting up hardware, installing the bits, and optionally also dealing with faults, scaling the app up and down etc. becomes cumbersome really fast. Instead, integrating TensorFlow to YARN allows us to seamlessly manage resources across machine learning / deep learning workloads and other YARN workloads like MapReduce, Spark, Hive, etc.

If you have a few GPUs and you want them to be shared by multiple tenants and applications to run TensorFlow and other YARN applications all on the same cluster, this blogpost will help you.

Part 1. Background

This blog uses a couple of features of YARN, under ongoing active development:

  1. YARN-3611, Support Docker Containers In LinuxContainerExecutor: Better support of Docker container execution in YARN
  2. YARN-4793, Simplified API layer for services and beyond: The new simple-services API layer backed by REST interfaces. This API can be used to create and manage the lifecycle of YARN services in a simple manner. Services here can range from simple single-component apps to complex multi-component assemblies needing orchestration.

Both of the features will most likely be available in one of the Apache Hadoop 3.x releases.

Part 2. Running TensorFlow Services Assembly On Yarn

A common workflow of TensorFlow (And this is common for any supervised machine learning platform) is like this:

  1. Training cluster reads from input dataset, uses algorithms to build a data model. For example, see Google’s cat recognition learning.
  2. Serving cluster serves the data model, so that any client can connect to the serving cluster to predict new samples, for example, ask “Is this a cat?”
  3. Training cluster can periodically update data model when new data available, so clients can get prediction results from the up-to-date model.

Also, TensorFlow added support for distributed training since 0.8 release, which can significantly shorten training time in lots of cases. With YARN, we can start a distributed training/serving cluster combo in no time!

Assembly Description File

To orchestrate multi-component assembly on YARN, we need to prepare a description file so YARN can know how to launch it (please refer to the video and slides for an introduction on assemblies).

Below is an example of assembly description file for TensorFlow assembly.

In our example, we have 3 components:

  1. Training cluster components
    1. Trainers: Training the model from training.
    2. Parameter servers: Manage shared model parameters for trainers.
  2. Serving cluster components: Uses tensorflow-serving to serve exported data model from training cluster.

So we have 3 components in the assembly description file.

Let’s look at the description of each of the components:

Trainer

First, let’s look at the definition of trainer:

Some notes:

  1. Docker image is specified to be “hortonworks/tensorflow-yarn:0.2”
  2. The launch command to use is example-distributed-trainer.py
  3. It specified hostname/ports of parameter servers and workers (trainers). We configured DNS to make launched container has a unique hostname to: <component_name>.<assembly-name>.<user-name>.<domain>

Parameter Server

Description of parameter server is very much similar to trainer, the only difference is to specify job_name to “ps” instead of “worker”

Serving Server

Following is a description of a serving server, it changed to use different Docker images and launch commands to make it serve inception v3 model to do image classification. I will cover more details in the demo video below.

Demo

The following demo video shows the launch of TensorFlow assembly on YARN, running TensorFlow training and serving the inception-v3 model to do image classification.


Part 3. Accessing GPU in Yarn Assembly

Configurations

Nvidia-docker is a wrapper of /usr/bin/docker, which is required to make processes inside the Docker container use the GPU capacity. So first of all, you need to Follow the Nvidia-Docker installation guide to properly install dependencies such as Docker/Nvidia driver, etc. on all the nodes.

After Nvidia Docker is installed, test it by executing the following command on each of the nodes:

You should see this output:

Then, you need to set YARN configuration to use nvidia-docker binary to launch container, just update container-executor.cfg under $HADOOP_CONF_DIR, set docker.binary to

Once all above steps are done, you can launch Docker containers through YARN as same as part 1. And all these containers can access GPU.

The training process gains a lot of speed when it runs with GPU support!

Below are screen recordings of running the same training process without and with a GPU.

Without GPU

With GPU
hadoop TensorFlow Assembly (CLI) Data lake Data (computing) Docker (software) Machine learning

Published at DZone with permission of Vinod Kumar Vavilapalli, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Getting a Private SSL Certificate Free of Cost
  • Top 5 Data Streaming Trends for 2023
  • What Is Advertised Kafka Address?
  • What Is JavaScript Slice? Practical Examples and Guide

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: