I work with a lot of data science teams at our enterprise customers, and in the past several months, I've seen an increased adoption of machine learning and deep learning frameworks for a wide range of applications.
As with other use cases in big data analytics and data science, these data science teams want to run their preferred deep learning frameworks and tools in Docker containers on the BlueData EPIC software platform. So part of my job is trying out these cool new tools and making sure they run as they should on our platform — and to help develop new functionality that might solve any challenges.
One of the most popular open-source frameworks for deep learning and machine learning is TensorFlow. TensorFlow was originally developed by researchers and engineers working at Google to conduct machine learning for deep neural networks research. However, it's general enough to be applicable to many other use cases. Some other deep learning examples using TensorFlow include image recognition, natural language processing with free text data, and threat detection and monitoring.
"TensorFlow is an open-source software library for machine learning across a range of tasks. It is a system for building and training neural networks to detect and decipher patterns and correlations, analogous to (but not the same as) human learning and reasoning." — Wikipedia
TensorFlow allows for the distribution of computations across a wide variety of heterogeneous systems, including CPUs and GPUs. To accelerate the computation of TensorFlow jobs, several of the data science teams I've worked with use GPUs. However, GPUs are costly, and the resources need to be managed carefully. And this is where we found some challenges that our software platform can help address.
Considerations for Deploying TensorFlow
Here are some of the challenges and considerations when deploying data science applications, and TensorFlow in particular, at large-scale in the enterprise:
- How to manage the deployment complexity (for example, between OS, kernel libraries, and TensorFlow versions).
- How to support a transient cluster creation for the duration of a job.
- How to isolate resources in use and preventing requests from simultaneous access.
- How to manage quotas and allocation for GPU-enabled and CPU resources in a shared, multi-tenant environment.
The BlueData EPIC software platform can address these challenges, providing their data science teams with on-demand access to a wide range of different Big Data analytics, data science, machine learning, and deep learning tools. Using Docker containers, our Big-Data-as-a-Service software platform can support large-scale distributed data science and deep learning use cases in a flexible, elastic, and secure multi-tenant architecture.
And with the new fall release, BlueData can now support clusters accelerated with GPUs and provide the ability to run TensorFlow for deep learning on GPUs or on Intel architecture CPUs. Using the BlueData EPIC software platform, data scientists can spin up instant TensorFlow clusters for deep learning running on Docker containers. BlueData supports both CPU-based TensorFlow that runs on Intel Xeon hardware with Intel Math Kernel Library (MKL) and GPU-enabled TensorFlow with NVIDIA CUDA libraries, CUDA extensions, and character device mappings for Docker containers.
The BlueData EPIC software platform can provide self-service, elastic, and secure environments for TensorFlow whether on-premises, in the public cloud, or some combination of the two in a hybrid architecture all from the same interface, with the same user experience regardless of the underlying infrastructure.
As illustrated in the graphic below, this means that our customers can easily spin up instant TensorFlow clusters with BigDL for deep learning with BlueData just as they do today for other big data analytics, data science, and machine learning environments. And they can specify placement of Docker containers running TensorFlow on infrastructure configured with GPUs or CPUs and in the public cloud or on-premises.
On-Demand TensorFlow Clusters
With BlueData EPIC, users can create TensorFlow clusters on demand with just a few mouse clicks. And with the host tagging introduced in our new fall release, they can create GPU-enabled or CPU-based clusters with host tagging that specifies the hardware for their particular workload (as indicated in the screenshot below).Once created, the cluster will have one or many nodes of Docker containers deployed with TensorFlow software and the appropriate GPU and/or CPU acceleration libraries. For example, GPU-enabled TensorFlow clusters would have NVIDIA CUDA and CUDA extensions within the Docker containers; whereas a CPU-based TensorFlow cluster would have Intel MKL packaged within the Docker image along with a Jupyter notebook.
Efficient GPU Resource Management
GPUs and specialized CPUs are generally not identified as a separate resource for Docker containers. BlueData EPIC handles this by managing a shared pool of GPUs across all host machines and allocating the requested number of GPUs to a cluster during cluster creation time. This exclusivity or isolation guarantees the quality of service for deep learning jobs and prevents multiple processing jobs from trying to access the same resource simultaneously.
For most enterprise organizations today, GPUs are a premium resource and need to be utilized efficiently. When a cluster is not in use or is finished running a job, BlueData EPIC can stop the cluster and assign the GPU to a different cluster. This allows users to create multiple clusters, in different tenant environments and use GPUs only when they need it without deleting or recreating their clusters. There is also a mechanism to create a cluster for the duration of the job as a transient cluster.
Improved User Productivity
Once the TensorFlow cluster is completed, the containers can be enabled with AD/LDAP-controlled SSH access and secure Jupyter notebooks.
Sample Jupyter notebooks are included with the TensorFlow cluster by default for immediate validation and testing as shown in the screenshot below.
The samples shown in the screenshot above are from this GitHub repo. These and other tutorials are available for users to get started and be productive immediately with TensorFlow.
This screenshot shows the results, comparing input images and output predictions:
Ability to Update Running TensorFlow Clusters
New libraries and packages are constantly being introduced and the needs of data science teams are constantly changing, so BlueData EPIC provides a mechanism called "action scripts" that allow users to simultaneously update all nodes of a running cluster with new libraries and packages. Users can also submit Python jobs as interactive or batch jobs for long running processes via the web-based UI or a RESTful API.