Deep Learning on Qubole Using BigDL for Apache Spark (Part 1)
Deep Learning on Qubole Using BigDL for Apache Spark (Part 1)
Get started with the distributed deep learning library BigDL on Qubole using Apache Spark as a greatly optimized serivce.
Join the DZone community and get the full member experience.Join For Free
BigDL runs natively on Apache Spark, which makes for a perfect deployment platform because Qubole offers a greatly enhanced and optimized Spark as a service.
In this Part 1 of a two-part series, you will learn how to get started with the distributed deep learning library BigDL on Qubole. By the end, you will have BigDL installed on a Spark cluster with a distributed deep learning library readily available for you to use in your deep learning applications running on Qubole.
In Part 2, you will learn how to write a deep learning application on Qubole that uses BigDL to identify handwritten digits (0 to 9) using a LeNet-5an LeNet-5 (Convolutional Neural Networks) model that you will train and validate using database.
Before we get started, here's some introduction and background on the technologies involved.
What Is Deep Learning?
Deep learning is a form of machine learning that uses a model of computing very much inspired by the structure of the brain. It is a kind of machine learning that allows computers to improve with data and achieve great flexibility by learning to represent the world as a nested hierarchy of concepts.
In early talks on deep learning, Andrew Ng described it in the context of traditional artificial neural networks. In his talk titled Deep Learning, Self-Taught Learning, and Unsupervised Feature Learning, he described the idea of deep learning as:
Using brain simulations, hope to:I believe this is our best shot at progress towards real AI.
- Make learning algorithms much better and easier to use.
- Make revolutionary advances in machine learning and AI.
So, What Is BigDL?
BigDL is a distributed deep learning library created and open sourced by Intel. It was designed from the ground up to run natively on Apache Spark and therefore enables data engineers and scientists to write deep learning applications as standard Spark programs without having to explicitly manage distributed computations.
- Rich deep learning support
- Extremely high performance
- Efficient scaling
- BigDL can efficiently scale out to perform data analytics at “big data scale” by leveraging Apache Spark, efficient implementations of synchronous SGD as well as all-reduce communications on Spark.
For more details on BigDL, click here.
Why BigDL on Qubole?
BigDL runs natively on Apache Spark, and because Qubole offers a greatly enhanced and optimized Spark as a service, it makes for a perfect deployment platform.
Highlights of Apache Spark As a Service Offered on Qubole
Let's look at auto-scaling Spark clusters, heterogeneous Spark clusters on AWS, and optimized split computation for Spark SQL.
- Auto-scaling Spark clusters
- In the open-source version of auto-scaling in Apache Spark, the required number of executors for completing a task are added in multiples of two. In Qubole, we’ve enhanced the auto-scaling feature to add the required number of executors based on configurable SLA.
- With Qubole’s auto-scaling, cluster utilization is matched precisely to the workloads, so there are no wasted compute resources and it also leads to lowered TCO. Based on our benchmark on performance and cost savings, we estimate that auto-scaling saves a Qubole’s customer over $300K per year for just one cluster.
- Heterogeneous Spark clusters on AWS
- Qubole supports heterogeneous Spark clusters for both on-demand and spot instances on AWS. This means that the slave nodes in Spark clusters may be of any instance type.
- For on-demand nodes, this is beneficial in scenarios when the requested number of primary instance type nodes are not granted by AWS at the time of the request. For spot nodes, it’s advantageous when either the spot price of primary slave type is higher than the spot price specified in the cluster configuration or the requested number of Spot nodes are not granted by AWS at the time of the request.
- Optimized split computation for Spark SQL
- We’ve implemented optimization with regards to AWS S3 listings which enables split computations to run significantly faster on Spark SQL queries. As a result, we’ve recorded up to 6X and 81X improvements on query execution and AWS S3 listings respectively.
To learn more about Qubole, click here.
Getting Started With BigDL on Qubole
After the prerequisites, this can be done in just nine steps.
Have a Qubole account — for a free trial, click here
Build BigDL JAR. Once you’ve successfully built BigDL jar, it will be in format
Download the MNIST database of handwritten digits.
Copy/upload BigDL JAR, MNIST data files (train-images-idx3-ubyte, train-labels-idx1-ubyte, t10k-images-idx3-ubyte, and t10k-labels-idx1-ubyte) and test images to a S3 bucket that can be accessed from a remote shell script. (These files will need to be downloaded on the cluster via a bootstrap script.)
If you don’t have a Spark cluster configured for this application, click here for instructions on how to configure one.
On the Clusters page, select/scroll down to the Spark cluster of your choice and click Edit.
On the Edit Cluster Settings page, click on 4. Advanced Configuration tab.
Scroll down to the SPARK SETTINGS section and copy-and-paste the following in Override Spark Configuration:
Note: These parameters are required by BigDL and setting them here will make them available to Spark driver and executors across existing nodes as well as any new nodes that are added during auto-scaling in Qubole.
Save the cluster settings and configuration by clicking on Update. At this point, you should be back on the main Cluster page.
Click on the dotted (...) menu all the way to the right and select Edit Node Bootstrap.
Copy-and-paste the following script:
echo "Setting BigDL env variables in usr/lib/zeppelin/conf/zeppelin-env.sh"
YOUR_S3_BUCKETwith S3 bucket in your AWS account where you uploaded BigDL JAR and MNIST data files and also replace
bigdl-[VERSION]-SNAPSHOT-jar-with-dependencies.jarwith your BigDL jar. Here is what's happening in the above bootstrap script:
- Set Python 2.7 as system default
- Create temp directories that are accessed by our application
- Recall setting BigDL environment variables for Spark in previous step. Similarly, we need to make those available to Zeppelin driver running on the master node
- Download test images we will use in our Spark application
- Download BigDL jar so it's available for us to import in our Spark application
- Download MNIST dataset that we will use to train model in our application
Click Save to save the bootstrap script.
Click Start to bring up the cluster.
Once the Spark cluster comes up, the BigDL deep learning library will be readily available for you to use in your Spark application running on Qubole.
See you in Part 2!
Published at DZone with permission of Dharmesh (Dash) Desai , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.