There are two clear trends in the Big Data ecosystem: the growth of Machine Learning use cases that leverage large distributed data sets, and the growth of Spark’s Machine Learning libraries (often referred to as MLlib) for these use cases. In fact, Spark’s MLlib library is arguably the leading solution for Machine Learning on large distributed data sets.
Intel and Cloudera have collaborated to speed up Spark’s ML algorithms, via integration with Intel’s Math Kernel Library (Intel® MKL).
Intel MKL is a library of optimized math routines that are hand-optimized specifically for Intel processors. For example, it includes highly optimized routines for Linear Algebra, Fast Fourier Transforms (FFT), Vector Math, and Statistics functions. These mathematical operations are building blocks for Machine Learning and related analytic algorithms, and thus integration with MKL delivers a massive performance boost for machine learning workloads.
Spark is already instrumented to take advantage of optimized implementations of these routines using netlib-java, but it still requires the addition of an implementation like MKL to activate these optimizations.
We benchmarked performance of MKL against the default JVM-based execution (referred to as F2JBLAS) and against a popular open source hardware acceleration library called OpenBLAS. For the benchmark, we used an open source suite of performance tests called spark-perf.
We selected seven popular Machine Learning Algorithms for the benchmark:
ALS: Alternating Least Squares for collaborative filtering.
PCA: Principal Component Analysis.
LDA: Latent Dirichlet Allocation.
SVD: Singular Value Decomposition.
Logistic Regression classifier.
RF: Random Forest classifier.
GBT: Gradient Boosted Tree classifier.
For the popular Logistic Regression algorithm (which arguably still is the most popular algorithm for building predictive analytics use cases), MKL provides an incredible 9x performance boost vs. F2JBLAS, and a significant 2.5x performance boost over OpenBLAS.
ALS, which is the most popular algorithm for building recommender systems, demonstrates a resounding 4x performance boost.
As you can see, the achieved performance boost varies for different algorithms. Moreover, it should be noted that the observed performance boost will have some variation over different Intel processors. This benchmark was done on Intel® Xeon® E5-2697A processors, and more details about the hardware used for the benchmark are provided below.
The best part about the performance boost with MKL is that it does not require a modified version of Spark or modifications to your Spark application code, nor does it require procurement of extra or special hardware (after all, most data centers in the world are running Intel processors). It merely requires a few steps to install MKL on your cluster. Moreover, we will soon simplify the installation process by providing MKL as a parcel which can be installed on your entire CDH cluster via two simple clicks in Cloudera Manager.
The benefits of these performance gains are clear: improved performance means you can train with larger data sets, explore a larger range of the model hyper-parameter space, and train more models. In many cases, it alleviates the need to buy any specialized hardware for your Machine Learning workloads.