Over a million developers have joined DZone.

Making Machine Learning on Kubernetes Portable and Observable

DZone 's Guide to

Making Machine Learning on Kubernetes Portable and Observable

This step-by-step tutorial shows how to set up Kubeflow, a tool that simplifies set up of a portable machine learning stack and Weave Cloud on the Google Cloud Platform.

· AI Zone ·
Free Resource

One of the big announcements from KubeCon + Cloud Native Con 2017 in early December 2017 was about Kubeflow, an open-source project “dedicated to making machine learning (ML) on Kubernetes easy, portable, and scalable.” Kubeflow is targeted to users who may want portable stacks, more control, simplification for their ML stacks, or the ability to use their Kubernetes deployments on different platforms, on-premises, etc. For instance, a vision for Kubeflow would be to have different teams (data scientists, devs, IT, etc.) sharing or handing off systems without worrying about who might still need to manage the underlying infrastructure. Kubeflow is intended to leverage Kubernetes’ ability to deploy on diverse infrastructure, deploy and manage loosely-coupled microservices, and scale based on demand. For people using a single-cloud, hosted ML service today, Kubeflow may offer an alternative solution to meet different user needs.

Kubeflow and Weave Cloud

Among Weaveworks’ users today, companies such as Qordoba and Seldon already see the value of using Kubernetes for machine learning. Stay tuned for future Weaveworks blog posts that will cover potential use cases around Kubeflow and Weave.

Kubeflow users can leverage Weave Cloud to simplify the observability, deployment, and monitoring of Kubeflow running on Kubernetes clusters. Especially if Kubeflow users follow the GitOps methodology, they can have their manifests in a repo as a single source of truth; thus, the vision for sharing and handing off Kubeflow systems can be operationalized with git push and Weave Cloud.

Weave Cloud’s monitoring capabilities also help with a variety of metrics, including resources management, which can be critical for Kubeflow. Weave Cloud’s UI offers quick interactive views to CPU and memory usage from the high-level overview:

To drill down to resources monitoring for processes, containers, pods, and hosts, as well as by service and namespace, see below.

Getting Started With Kubeflow and Weave Cloud

  1. Kubeflow on GitHub: Clone the Kubeflow repository. (These will install the Kubernetes manifests to run Kubeflow on a production cluster.)
  2. Google Cloud Platform:
    1. Set up a cluster in Google Cloud Platform.
    2. We created a basic three-node cluster for this demo.
    3. Follow the steps for Kubeflow on Google Kubernetes Engine.
    4. Follow the Quick Start steps (which use ks apply). This installs all of the manifests recursively that appear in components.
    5. Optional: If you want to run code to train TensorFlow convolutional neural network (TF CNN) models, the jobs can be done by running kubectl, as well.
  3. Weave Cloud:
    1. Create a free trial account.
    2. Follow the set-up steps for Weave Cloud Deploy, Explore, and Monitor. (You will be deploying agents to send metrics to leverage the Weave Cloud UI.)
    3. Note: The Weave Cloud Deploy set-up process injects a Deploy key into the GH repo that you created earlier. (You can’t inject the key into the Google Kubeflow repo, so that’s why you clone the Kubeflow repo in GH.)
    4. Click on the Explore button in Weave Cloud to visualize your new Kubeflow cluster in Google Cloud.
    5. Weave Cloud also offers monitoring based on hosted, multi-tenant, and scalable Prometheus. With Weave Cloud, you can store Prometheus metrics over months for querying using Prometheus’ powerful query language, PromQL.
    6. Weave Cloud’s monitoring capabilities also help with a variety of metrics, including resources management, which can be critical for Kubeflow. To drill down to processes, containers, and hosts, click on the “bar graph” icon:
    7. As well as pods (note the “water level” in each pod that indicates CPU usage for the tf-job-operator pod):
    8. Weave Cloud’s monitoring also shows resources by service and namespace:

The Kubeflow announcement made it to #1 on Hacker News when it was announced at KubeCon and there has been a lot of interest with ML companies testing it out.

We hope you found this step by step guide helpful. Please join our conversations on Twitter @weaveworks and @kubeflow or share your thoughts on Slack #weaveworks and #kubeflow.

machine learning ,kubernetes ,kubeflow ,ai ,tutorial ,weave cloud

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}