Over a million developers have joined DZone.

Machine Learning on Kubernetes

DZone's Guide to

Machine Learning on Kubernetes

A data scientist focused on solving machine learning problems may not have all expertise/time to build a scalable infrastructure required to run large-scale jobs.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

With the rise of containers, the problems of orchestration became more relevant. Over the last few years, various projects and have companies tried to address the challenge — but Kubernetes came out as a strong and dominant platform to run containers. Today, most companies are running (or are planning to move to) Kubernetes as a platform for running various workloads — be it stateless microservices, cron jobs, or stateful workloads such as databases (though these workloads only represent a small portion of computing workloads in the real world. For example, there are workloads which need specialized hardware like GPU. The resource management working group exactly focuses on this area and work towards aligning project and technologies so that more diverse kind of workloads run on Kubernetes platform.

Image source

When I read Abhishek Tiwari's post on big data workloads on Kuberentes, I was intrigued by the amount of work that has happened. This post tries to explore running machine learning/deep learning workloads on Kubernetes as a platform and various projects and companies trying to address this area. One of the key requirements for running ML workloads on Kubernetes is support for GPU. Kubernetes community has been building support for GPUs since v1.6 and the Kubernetes documentation has details.

Why Kubernetes?

Before we dive deeper into various projects and efforts on enabling ML/DL on Kubernetes, let's answer the question: Why Kubernetes? Kubernetes offers some advantages as a platform; for example:

  • A consistent way of packaging an application (container) enables consistency across the pipeline — from your laptop to the production cluster.
  • Kubernetes is an excellent platform to run workloads over multiple commodity hardware nodes while abstracting away the underlying complexity and management of nodes.
  • You can scale based on demand — the application as well as the cluster itself.
  • Kubernetes is already a well-accepted platform to run microservices and there are efforts underway, for example, to run serverless workloads on Kubernetes. It would be great to have a single platform which can abstract the underlying resources and make it easy for the operator to manage the single platform.

RAD Analytics

Some of the early efforts in enabling intelligent applications on Kubernetes/OpenShift was done by Redhat in form RAD analytics. The early focus of the project was to enable Spark clusters on Kubernetes and this was done by Oshinko projects. For example, the combination of the oshinko-cli and other projects can be used to deploy Spark cluster on Kubernetes. Tutorials on the website are full of various examples. you can also track progress on various projects on the website.

Paddle From Baidu

Baidu open sourced it's deep and machine learning platform Paddle (Parallel Distributed Deep LEarning) written in Python in September 2016. It then announced that Paddle can run on Kubernetes in Feb 2016. Paddle can be used for image recognition, natural language processing as well as for recommendation systems. Paddle uses Kubernetes native constructs such as jobs to run pieces of training and finish the job when training run. It also runs trainer pods and scales them on a need basis so that the workload can be distributed effectively.

Commercial Options

There are companies which provide the commercial or open source with limited capability software for running machine learning on Kubernetes, for example, Seldon has an open-source core but requires commercial license beyond a certain scale. Microsoft has run machine learning on its Kubernetes offering with technology from Litbit, as explained here. RiseML is a startup which provides a machine learning platform which runs on Kubernetes.


Announced in Kubecon 2017 at Austin, Kubeflow is an open source project from Google which aims to simplify running machine learning jobs on Kubernetes. Initially, the version supports Jupyter notebooks and TensorFlow jobs, but the eventual goal is to support additional open source tooling used for machine learning on Kubernetes. Kubeflow wants to be a toolkit which simplifies deploying and scaling machine learning by using tools of user's choice.


A data scientist focused on solving machine learning problems may not have all expertise/time to build a scalable infrastructure required to run large-scale jobs. Similarly, an infrastructure engineer who can build scalable infrastructure may not be a master of machine learning. This theme is visible in all projects that we explored — bridge the gap between the two so we can run large ML/DL programs on a scalable infrastructure. Although there is work to be done, this is a great start and something to look forward with excitement in 2018 and beyond!

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

ai ,machine learning ,kubernetes ,deep learning

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}