DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How are you handling the data revolution? We want your take on what's real, what's hype, and what's next in the world of data engineering.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Apache Spark 4.0: Transforming Big Data Analytics to the Next Level
  • Spark Job Optimization
  • All You Need to Know About Apache Spark
  • Iceberg Catalogs: A Guide for Data Engineers

Trending

  • Jakarta EE 11 and the Road Ahead With Jakarta EE 12
  • Leveraging AI: A Path to Senior Engineering Positions
  • How to Troubleshoot Common Linux VPS Issues: CPU, Memory, Disk Usage
  • Seata the Deal: No More Distributed Transaction Nightmares Across (Spring Boot) Microservices
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Securing Apache Spark Shuffle Using Apache Commons Crypto

Securing Apache Spark Shuffle Using Apache Commons Crypto

Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the current approach.

By 
Cheng Xu user avatar
Cheng Xu
·
Aug. 11, 16 · Opinion
Likes (3)
Comment
Save
Tweet
Share
6.2K Views

Join the DZone community and get the full member experience.

Join For Free

When running a big data computing job, the data being processed may contain sensitive information that users don’t want anyone else to access. Encrypting that sensitive data is becoming more and more important, especially for enterprise users.

For Apache Spark, which is the emerging standard for big data processing, data is transferred via network and also spilled to disk during shuffles—thus, unencrypted shuffle data will result in an unprotected Spark job. And because shuffle performance is important for the overall performance of Spark applications, it’s crucial that the shuffle encryption backend perform at a level comparable to the unencrypted case.

Spark shuffle encryption should secure two different operations: RPC and local disk storage. For RPC, shuffle files that are transferred across different executors over the network must be encrypted. And for files, the spilled files of the intermediate data, as well as the output files of the shuffle result, must be similarly encrypted. Furthermore, given the importance of performance in Spark shuffle, the cryptographic library used has to be sufficiently high throughput to limit the overhead of encryption and decryption.

For those purposes, in the current upstream, Spark uses 3DES provided by SASL to encrypt RPC data. However, that approach has two main weaknesses: first, there’s no encryption support for disk spills; and second, the JVM’s SASL encryption implementation does not provide the same level of performance as AES-NI.

Apache Commons Crypto is a cryptographic library optimized with AES-NI (Advanced Encryption Standard New Instructions). It provides a Java API for directly interacting with encryption ciphers and Java stream abstractions built on top of the cipher API. Developers can use it to implement high-performance AES encryption/decryption with minimum code and effort. The project was originally created by Intel under the name Chimera, and is now available as a subproject of Apache Commons.

Because Crypto’s high throughput makes it an ideal candidate for this application, Intel’s Spark committers are working to implement it for Spark shuffle encryption. In this post, we will talk about how this feature works and its performance. The work described here is ongoing and tracked in SPARK-5682.

Detailed Workflow

As explained previously, Spark shuffle file encryption is focusing on encrypting spilled files and shuffle output files. Currently, it only supports Spark-on-YARN mode. The following picture shows the basic workflow:

shuffle-encrypt-f1

The basic steps can be described as follows:

  1. When a Spark job starts, it will generate encryption keys and store them in the current user’s credentials, which are shared with all executors.
  2. When shuffle happens, the shuffle writer will first compress the plaintext if compression is enabled. Spark will use the randomly generated Initial Vector (IV) and keys obtained from the credentials to encrypt the plaintext by using CryptoOutputStream from Crypto.
  3. CryptoOutputStream will encrypt the shuffle data and write it to the disk as it arrives. The first 16 bytes of the encrypted output file are preserved to store the initial vector.
  4. For the read path, the first 16 bytes are used to initialize the IV, which is provided to CryptoInputStream along with the user’s credentials. The decrypted data is then provided to Spark’s shuffle mechanism for further processing.

Performance Evaluation

Crypto uses Intel’s AES-NI technology for high-performance encryption. Java supports AES-NI since 1.7.0_45, but is missing optimizations that will only be available in Oracle JDK 9.

We evaluated the performance of Crypto using a micro benchmark, and the performance of Spark shuffle file encryption by running TeraSort.

Micro Benchmark for Crypto Throughput

We evaluated the throughput of Crypto in CTR and CBC mode, which are the most widely used modes of operation for AES. In the evaluation, we used the micro benchmark available in the Crypto source repository. We ran it using JDK 1.6.0_37 and JDK 1.7.0_60, comparing both AES-NI mode without AES-NI support (using JDK 1.6.0_37) and AES-NI without optimizations (using JDK 1.7.0_60).

The throughput for CTR mode is shown below. For encryption/decryption, higher throughput means better performance. The CPUs used for the benchmarks were Intel Xeon CPU E5-2690 v2 @ 3.00GHz. In the case of AES-NI vs non-AES-NI, we can observe an 18x and 20x performance gain for encryption and decryption. And in the case of AES-NI vs AES-NI without optimizations, 9.7x and 8.5x performance gains were observed.

shuffle-encrypt-f3

We also tested Crypto throughput in CBC mode. The data is shown below. Because encryption can’t be done in parallel for CBC mode, it has lower performance gain than CTR mode for encryption. CBC decryption can be parallelized, but there is no such optimization yet available in JDK 6, JDK7, or JDK8.

In CBC mode, we observed 5x and 16x improvements for encryption and decryption when comparing AES-NI with non AES-NI (JDK 1.6.0_37). And 1.25x and 5x improvements can be observed for encryption and decryption when comparing AES-NI with AES-NI without optimizations.

shuffle-encrypt-f2

Spark File Encryption Performance

For Spark file encryption, we ran TeraSort to measure performance.

The TeraSort application has three steps: generate random data, sort random data, and validate sorted data. The shuffle mainly happens in the second stage, where a sortByKey transformation is performed. We ran the TeraSort application using OpenSSLCipher (optimized with AES-NI), JCECipher (using JDK 1.6 which doesn’t support AES-NI), and unencrypted. In this process, we ran the TeraSort application with three nodes and four executors in each node. The cumulative time shown below means the sum of execution time in all four executors from all three nodes in our test. The sortByKey time means the time used by sortByKey stage; the total time means the execution time for the TeraSort application. And we observed that JCECipher takes much longer than OpenSSLCipher, which has AES-NI optimizations.

shuffle-encrypt-f4

Conclusion

Hopefully, this post has demonstrated how Apache Commons Crypto will meet enterprise requirements for encryption of Spark shuffle data while minimizing any performance overhead. The work involved should land in the Spark trunk soon and then become available in a future release of CDH.

Acknowledgements

The author would like to thank Liyun Zhang (Intel) for her initial development work; Dapeng Sun (Intel),  Jia Ke (Intel), and Xianda Ke (Intel) for developing Apache Commons Crypto; and Haifeng Chen (Intel) and Marcelo Vanzin (Cloudera) for their excellent reviews and guidance.

Big data Apache Spark

Published at DZone with permission of Cheng Xu, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Apache Spark 4.0: Transforming Big Data Analytics to the Next Level
  • Spark Job Optimization
  • All You Need to Know About Apache Spark
  • Iceberg Catalogs: A Guide for Data Engineers

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: