Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Deep Learning With TensorFlow, Nvidia, and Apache Mesos (DC/OS) (Part 2)

DZone's Guide to

Deep Learning With TensorFlow, Nvidia, and Apache Mesos (DC/OS) (Part 2)

If you want to be able to deploy your TensorFlow service quickly and manage it easily in production across multiple teams, read on.

Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

In the last post, we demonstrated how GPUs can dramatically reduce the time you need for a TensorFlow job. But what if we want to run this in production, not just from the laptop? You’d want to be able to deploy your TensorFlow service quickly and manage it easily in production across multiple teams: that’s where DC/OS comes in.

Watch a video of this tutorial here.

In part 2 of this tutorial, we’ll:

  • Install the TensorFlow service without GPUs.
  • Run a neural network example.
  • Install TensorFlow with GPUs.
  • Run the same neural network example.
  • Run an example that uses multiple GPUs.

Run Tensorflow on DC/OS Without GPUs

First, let’s see how easy it is to use TensorFlow on DC/OS, even without GPUs.

Prerequisites

Deploy the Tensorflow Service

First, let’s get TensorFlow running on your DC/OS cluster.

  1. Go to the Services tab of the DC/OS UI.
  2. Click + to add a service.
  3. Choose Single Container.
  4. Toggle to the JSON Editor and paste the following application definition into the editor.
    {
     "id": "my-tensorflow-no-gpus",
     "cpus": 4,
     "gpus": 0,
     "mem": 2048,
     "disk": 0,
     "instances": 1,
     "container": {
       "type": "MESOS",
       "docker": {
         "image": "tensorflow/tensorflow"
       }
     }
    }
    This application definition specifies no GPUs and the standard TensorFlow Docker image.
  5. Click Review and Run, then Run Service.

Run a Tensorflow Example

  1. Exec into the TensorFlow container from the DC/OS CLI. This command allows you to execute commands inside the container and stream the output to your local terminal.
    dcos task exec -it my-tensorflow-no-gpus bash
  2. Now, let’s get some examples to run. Install git and then clone the TensorFlow-Examples repository.
    apt-get update; apt-get install -y git
    git clone https://github.com/aymericdamien/TensorFlow-Examples
  3. Run and time the same example you ran locally in the last tutorial, the convolutional network example.
    cd TensorFlow-Examples/examples/3_NeuralNetworks
    time python convolutional_network.py

This took my DC/OS cluster 11 minutes.

Run Tensorflow on DC/OS With GPUs

This involves a couple of steps.

Deploy the Tensorflow Service With GPUs

Now that you’ve got TensorFlow examples running on your cluster, let’s see how performance compares when you configure your service to use GPUs.

  1. Go to the Services tab of the DC/OS UI.
  2. Click + to add a service.
  3. Choose Single Container.
  4. Toggle to the JSON Editor and paste the following application definition into the editor.
    {
     "id": "tensorflow-gpus-1",
     "acceptedResourceRoles": ["slave_public"],
     "cpus": 4,
     "gpus": 4,
     "mem": 2048,
     "disk": 0,
     "instances": 1,
     "container": {
       "type": "MESOS",
       "docker": {
         "image": "tensorflow/tensorflow:latest-gpu"
       }
     }
    }
    This application definition is largely the same as the last one, except, here, you’re requesting 4 GPUs and specifying the TensorFlow Docker image that’s configured for GPUs.
  5. Click Review and Run, then Run Service.

Verify Access to GPUs

You’ll recall that we created a cluster with a public agent that has eight GPUs but only requested access to four. Let’s verify that the node has eight GPUs and that our service has access to only four of them.

  1. First, use dcos task exec to run a command inside of the container to get the public IP address of the agent node the container is running on.
    dcos task exec tensorflow-gpus-1 curl -s ifconfig.co
  2. Now, use that public IP to SSH into the node and run nvidia-smi to verify the number of GPUs the node has.
    ssh <public-ip>
    nvida-smi
    You should see eight GPUs installed and running on the machine. The container for your service, however, should only be able to see four of those GPUs.
  3. Run dcos task exec with the bash option to get a shell inside of your service’s container.
    dcos task exec -it tensorflow-gpus-1 bash
  4. Set up environment variables so you can run nvida-smi from within this shell.
    export LD_LIBRARY_PATH=/usr/local/nvidia/lib64
    export PATH=$PATH:/usr/local/nvidia/bin
  5. Run nvidia-smi to verify that even though you have 8 GPUs installed on the machine, you only have access to four of them inside this container.
    nvidia-smi

Run a Tensorflow Example With GPUs

Now that you’ve installed TensorFlow and verified your access to four GPUs, let’s run the same example as before.

  1. If you exited the tensorflow-gpus-1 container, reenter it and set up the environment variables by following the steps in the last section.
  2. Install git and clone the TensorFlow-Examples repository.
    apt-get update; apt-get install -y git
    git clone https://github.com/aymericdamien/TensorFlow-Examples
  3. Run and time the same example you ran earlier, the convolutional network example.
    cd TensorFlow-Examples/examples/3_NeuralNetworks
    time python convolutional_network.py
  4. Watch the code find the GPUs and execute.

This took my DC/OS cluster about two minutes — about five times faster than before!

Launch Two Tensorflow Instances

You’ll recall that we have a cluster with 8 GPUs, but we only requested access to four of them. Now, let’s launch a second TensorFlow instance that will consume the remaining four GPUs in parallel with the first.

Running more than one TensorFlow instance in parallel shows that you can have multiple users on the same cluster with isolated access to the GPUs on it.

  1. Add a third service to your DC/OS cluster with the following application definition, which is similar to the first application definition with GPUs.
    {
     "id": "tensorflow-gpus-2",
     "acceptedResourceRoles": ["slave_public"],
     "cpus": 4,
     "gpus": 4,
     "mem": 2048,
     "disk": 0,
     "instances": 1,
     "container": {
       "type": "MESOS",
       "docker": {
         "image": "tensorflow/tensorflow:latest-gpu"
       }
     }
    }
  2. Verify that your second TensorFlow instance is running by accessing the Jupyter notebook that runs by default on the TensorFlow Docker image. In the application definition above, the acceptedResourceRoles parameter is set to slave_public, which gives us access to the public IP of the agents where the containers are running.
    1. Get the public IP of the agent where the task has been launched.
      dcos task exec tensorflow-gpus-2 curl -s ifconfig.co
    2. Go to the STDERR log of the service to get the Jupyter URL. Services > tensorflow-gpus-2 > task-id > paper icon > ERROR (STDERR). You will see this a message similar to the following.
      Copy/paste this URL into your browser when you connect for the first time, to login with a token:
      
      http://localhost:10144/?token=d4f3d8f80eb97299e74b5254d1600c480c3f042d548e51f5
    3. Replace localhost with the public IP you found earlier to see the Jupyter notebook.
    4. Click the Getting Started notebook and run some commands.

Thanks for playing along at home!

The next post in the series will show you how to use DC/OS to dynamically request cluster resources and launch a distributed TensorFlow job across multiple agents. When that job completes, the resources it had used are automatically released back to the cluster and made available to other jobs. This dramatically increases efficiency in comparison to traditional TensorFlow deployment strategies.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:
big data ,deep learning ,tensorflow ,nvidia ,apache mesos ,tutorial

Published at DZone with permission of Kevin Klues, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}