Securing Apache Spark Shuffle Using Apache Commons Crypto
Securing Apache Spark Shuffle Using Apache Commons Crypto
Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the current approach.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
When running a big data computing job, the data being processed may contain sensitive information that users don’t want anyone else to access. Encrypting that sensitive data is becoming more and more important, especially for enterprise users.
For Apache Spark, which is the emerging standard for big data processing, data is transferred via network and also spilled to disk during shuffles—thus, unencrypted shuffle data will result in an unprotected Spark job. And because shuffle performance is important for the overall performance of Spark applications, it’s crucial that the shuffle encryption backend perform at a level comparable to the unencrypted case.
Spark shuffle encryption should secure two different operations: RPC and local disk storage. For RPC, shuffle files that are transferred across different executors over the network must be encrypted. And for files, the spilled files of the intermediate data, as well as the output files of the shuffle result, must be similarly encrypted. Furthermore, given the importance of performance in Spark shuffle, the cryptographic library used has to be sufficiently high throughput to limit the overhead of encryption and decryption.
For those purposes, in the current upstream, Spark uses 3DES provided by SASL to encrypt RPC data. However, that approach has two main weaknesses: first, there’s no encryption support for disk spills; and second, the JVM’s SASL encryption implementation does not provide the same level of performance as AES-NI.
Apache Commons Crypto is a cryptographic library optimized with AES-NI (Advanced Encryption Standard New Instructions). It provides a Java API for directly interacting with encryption ciphers and Java stream abstractions built on top of the cipher API. Developers can use it to implement high-performance AES encryption/decryption with minimum code and effort. The project was originally created by Intel under the name Chimera, and is now available as a subproject of Apache Commons.
Because Crypto’s high throughput makes it an ideal candidate for this application, Intel’s Spark committers are working to implement it for Spark shuffle encryption. In this post, we will talk about how this feature works and its performance. The work described here is ongoing and tracked in SPARK-5682.
As explained previously, Spark shuffle file encryption is focusing on encrypting spilled files and shuffle output files. Currently, it only supports Spark-on-YARN mode. The following picture shows the basic workflow:
The basic steps can be described as follows:
- When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.
- When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using
CryptoOutputStreamwill encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.
- For the read path, the first 16 bytes are used to initialize the IV, which is provided to
CryptoInputStreamalong with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.
We evaluated the performance of Crypto using a micro benchmark, and the performance of Spark shuffle file encryption by running TeraSort.
Micro Benchmark for Crypto Throughput
We evaluated the throughput of Crypto in CTR and CBC mode, which are the most widely used modes of operation for AES. In the evaluation, we used the micro benchmark available in the Crypto source repository. We ran it using JDK 1.6.0_37 and JDK 1.7.0_60, comparing both AES-NI mode without AES-NI support (using JDK 1.6.0_37) and AES-NI without optimizations (using JDK 1.7.0_60).
The throughput for CTR mode is shown below. For encryption/decryption, higher throughput means better performance. The CPUs used for the benchmarks were Intel Xeon CPU E5-2690 v2 @ 3.00GHz. In the case of AES-NI vs non-AES-NI, we can observe an 18x and 20x performance gain for encryption and decryption. And in the case of AES-NI vs AES-NI without optimizations, 9.7x and 8.5x performance gains were observed.
We also tested Crypto throughput in CBC mode. The data is shown below. Because encryption can’t be done in parallel for CBC mode, it has lower performance gain than CTR mode for encryption. CBC decryption can be parallelized, but there is no such optimization yet available in JDK 6, JDK7, or JDK8.
In CBC mode, we observed 5x and 16x improvements for encryption and decryption when comparing AES-NI with non AES-NI (JDK 1.6.0_37). And 1.25x and 5x improvements can be observed for encryption and decryption when comparing AES-NI with AES-NI without optimizations.
Spark File Encryption Performance
For Spark file encryption, we ran TeraSort to measure performance.
The TeraSort application has three steps: generate random data, sort random data, and validate sorted data. The shuffle mainly happens in the second stage, where a
sortByKey transformation is performed. We ran the TeraSort application using
OpenSSLCipher (optimized with AES-NI),
JCECipher (using JDK 1.6 which doesn’t support AES-NI), and unencrypted. In this process, we ran the TeraSort application with three nodes and four executors in each node. The cumulative time shown below means the sum of execution time in all four executors from all three nodes in our test. The
sortByKey time means the time used by
sortByKey stage; the total time means the execution time for the TeraSort application. And we observed that
JCECipher takes much longer than
OpenSSLCipher, which has AES-NI optimizations.
Hopefully, this post has demonstrated how Apache Commons Crypto will meet enterprise requirements for encryption of Spark shuffle data while minimizing any performance overhead. The work involved should land in the Spark trunk soon and then become available in a future release of CDH.
The author would like to thank Liyun Zhang (Intel) for her initial development work; Dapeng Sun (Intel), Jia Ke (Intel), and Xianda Ke (Intel) for developing Apache Commons Crypto; and Haifeng Chen (Intel) and Marcelo Vanzin (Cloudera) for their excellent reviews and guidance.
Published at DZone with permission of Cheng Xu , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.