Kubeflow Fundamentals Part 5: Getting Started With Notebooks
In this post, we’ll focus on getting a little more familiar with one of the first Kubeflow components that data scientists need familiarity with, Kubeflow Notebooks.
Join the DZone community and get the full member experience.Join For Free
Welcome to the fifth blog post in our “Kubeflow Fundamentals” series specifically designed for folks brand new to the Kubelfow project. The aim of the series is to walk you through a detailed introduction of Kubeflow, a deep-dive into the various components, add-ons, and how they all come together to deliver a complete MLOps platform.
If you missed the previous installments in the “Kubeflow Fundamentals” series, you can find them here:
In this post, we’ll focus on getting a little more familiar with one of the first Kubeflow components data scientists will get familiar with, Kubeflow Notebooks. Ok, let’s dive right in!
What Are Kubeflow Notebooks?
Kubeflow Notebooks provide a way to run web-based development environments inside a Kubernetes cluster. These Notebook servers run as containers inside a Kubernetes Pod.
Which development environment is available inside of Kubeflow (and which packages are installed) is determined by the Docker image used to invoke the Notebook server. Note that users can create notebook containers directly in the Kubeflow cluster, rather than having to configure everything locally on their laptops. Another benefit of this arrangement is that admins can provide standard notebook images for their organization with all the required packages pre-installed.
There are plenty of stories in the data science community of it taking hours, days, even weeks to just get the notebook environment set up correctly with all the tools, libraries and dependencies sorted out.
A final advantage is that access control can be managed by Kubeflow’s role-based access control capabilities. This enables easier and more secure notebook sharing across an organization.
What Development Environments Are Supported?
Kubeflow Notebooks natively supports three types of IDEs:
- Visual Studio Code (code-server)
But technically, any web-based IDE should work!
RStudio is an IDE for R and Python, with a console and syntax-highlighting editor. It supports direct code execution and features tools for plotting, history, debugging, and workspace management. It is currently branded as RStudio Team with both open source and commercial versions.
- RStudio Workbench for data scientists
- RStudio Connect for business users
- RStudio Package Manager for DevOps
To learn more about RStudo, check out: https://www.rstudio.com/
Visual Studio Code
Visual Studio Code is a free, lightweight, and cross-platform code editor from Microsoft. VS code-server allows you to run VS Code on any machine and access it in the browser which is critical for Kubeflow Notebook integration. Its features include IntelliSense, in editor debugging and Git commands built-in. Finally, it supports not only Python but also a large variety of other languages and frameworks.
For more information, check out: https://github.com/cdr/code-server
JupyterLab is an open-source, web-based environment for creating Jupyter notebook documents. The development environment is very useful in interactive data science and scientific computing projects across a variety of programming languages. Jupyter’s JSON documents follow a versioned schema, contain an ordered list of input/output cells, can contain code, text (via Markdown), math (via LaTex), plots, and even rich media.
One thing worth mentioning is that the term “notebook” term can often refer to the Jupyter web application, Jupyter Python web server, or the Jupyter document format, so pay attention to the context in which the term is being used!
For more information, check out: https://jupyter.org/
JupyterLab in Kubeflow
The Kubeflow Notebooks Web App enables you to easily spin up JupyterLab Notebook Servers. These servers can contain multiple Jupyter Notebooks, but it is more common to have a 1:1 mapping between notebook and server. Ok, let’s walk you through the process of setting up a new Notebook Server!
Obviously, the first thing you need to do is to install Kubeflow. You basically have two options, go with a packaged distribution from a vendor or roll/support your own via manifests. (For a complete discussion on installation options, check out: Kubeflow Fundamentals Part 3: Distributions and Installations.
For the purposes of this walk-through, we are going to go with the MiniKF packaged distribution. Why?
- It is the easiest way to get started on AWS, GCP, or locally
- Comes pre-bundled with all the correct and necessary Kubeflow components
- Is pre-configured with the open-source Kale Jupyter extension to make pipelines and AutoML jobs a snap
- Plus the Rok data management add-on that makes snapshotting and restoring notebooks or entire clusters (including data and metadata) a point and click operation
For more info, check out: Install Kubeflow
Create a New Notebook Server
Now that we have Kubeflow installed we are ready to create a Notebook Server.
- Select the appropriate namespace
- Click on “Notebooks”
- Click on “New Server”
- Name your Notebook Server
Note: Notebook Server names must consist of lowercase alphanumeric characters or ‘-‘, start with an alphabetic character, and end with an alphanumeric character.
Configure the Notebook Server (Basics)
Next up, let’s set the required configuration options.
- Select a Docker image for your notebook server
- Two options: custom images or standard images (provided by your admin)
- Specify CPU and RAM
- Specify a “workplace volume” to be mounted as a PVC Volume
- Click “Launch”
Container Images (Optional)
Kubeflow ships with some base VS Code, RStudio, and JupyterLab example container images.
There are also extended images that build off the base images that showcase PyTorch, SciPy, TensorFlow, and more. You can choose these options instead of the image you’ve provided. When you install Kubeflow via MiniKF, you actually get all the images you’ll need to work through any of our four tutorials by default!
Notebook Server from a Snapshot (Optional)
Creating a Notebook Server from a previous snapshot is only possible if you have installed Kubeflow via the MiniKF distribution. Assuming you’ve already made a snapshot of the Notebook, here are the necessary steps:
- Select a namespace
- Click “New Notebook”
- Select the Rok file chooser
- Select the Notebook via a Rok URL
- Name the Notebook
- Click “Launch”
Configure the Notebook Server (Optional)
There are some final optional Notebook Server options worth mentioning which may or may not be relevant based on your use case and environment:
- Data volumes
- Configurations – PodDefault resources
- Enabling shared memory
About the Notebook Pod ServiceAccount
Kubeflow assigns the default-editor Kubernetes ServiceAccount to Notebook pods.
Which in turn is bound to the kubeflow-edit ClusterRole, which has namespace scoped permissions. You can get the full list of RBAC for ClusterRole kubeflow-edit using:
kubectl describe clusterrole kubeflow-edit
Note that this setup means you can run kubectl inside the pod without providing any additional authentication!
How are Kubeflow Notebooks Governed and Developed?
Recall from previous posts that Kubeflow development is broken up into smaller “working groups” that focus on a subset of components and features. “WG Notebooks” is responsible for the user experience around Notebooks and their integrations with Kubeflow.
What’s Next? Part 6 – Working With Jupyter Lab Notebooks
Stay tuned for the next blog in this series where we’ll focus on getting a little more familiar with Jupyter notebooks and how we can leverage them within Kubeflow as part of our machine learning workflow.
Published at DZone with permission of Jimmy Guerrero. See the original article here.
Opinions expressed by DZone contributors are their own.