Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Training AI Models on Kubernetes

DZone's Guide to

Training AI Models on Kubernetes

IBM has introduced an open source version of its Deep Learning service, and it runs on Kubernetes! Let's find out more!

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Early this year, IBM announced Deep Learning as a Service within Watson Studio. The core of this service is available as open source and can be run on Kubernetes clusters. This allows developers and data scientists to train models with confidential data on-premises, for example on the Kubernetes-based IBM Cloud Private.

The open source version of IBM's Deep Learning service is called Fabric for Deep Learning. Fabric for Deep Learning supports framework independent training of Deep Learning models on distributed hardware. For the training, CPUs can be used as well as GPUs. Check out the documentation for a list of DL frameworks, versions, and processing units.

I had open sourced samples that show how to train visual recognition models with Watson Studio that can be deployed to edge devices via TensorFlow Lite and TensorFlow.js. I extended one of these samples slightly to show how to train the model with Fabric for Deep Learning instead. To do this, I only had to change a manifest file slightly since the format expected by Watson Studio is different.

This is the manifest file that describes how to invoke and run the training. For more details on how to run trainings on Kubernetes, check out the readme of my project and the Fabric for Deep Learning documentation.

name: retrain
description: retrain
version: "1.0"
gpus: 2
cpus: 8
memory: 4Gb
learners: 1

data_stores:
  - id: sl-internal-os
    type: mount_cos
    training_data:
      container: nh-input
    training_results:
      container: nh-output
    connection:
      auth_url: http://169.62.129.231:32551
      user_name: test
      password: test

framework:
  name: tensorflow
  version: "1.5.0-gpu-py3"
  command: python3 retrain.py --bottleneck_dir ${RESULT_DIR}/bottlenecks --image_dir ${DATA_DIR}/images --how_many_training_steps=1000 --architecture mobilenet_0.25_224 --output_labels ${RESULT_DIR}/labels.txt --output_graph ${RESULT_DIR}/graph.pb --model_dir ${DATA_DIR} --learning_rate 0.01 --summaries_dir ${RESULT_DIR}/retrain_logs

evaluation_metrics:
  type: tensorboard
  in: "$JOB_STATE_DIR/logs/tb"

To initiate the training, this manifest file and the training Python code needs to be uploaded. In order to do this, you can either use the web user experience or a CLI.

In my example, I've stored the data on my Kubernetes cluster. Fabric for Deep Learning comes with S3 based Object Storage which means that you can use the AWS CLI to upload and download data. Alternatively you could also use Object Storage in the cloud, for example IBM's Cloud Object Storage.

To find out more about Fabric for Deep Learning, check out these resources.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
deep learning ,models ,kubernetes ,cluster ,cloud ,artificial intelligence ,machine learning ,ffdl

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}