DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • A Guide to Using Amazon Bedrock Prompts for LLM Integration
  • Boosting Efficiency: Implementing Natural Language Processing With AWS RDS Using CloudFormation
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation
  • AWS CDK: Infrastructure as Abstract Data Types, Part 3

Trending

  • Building Resilient Identity Systems: Lessons from Securing Billions of Authentication Requests
  • Modern Test Automation With AI (LLM) and Playwright MCP
  • AI-Driven Test Automation Techniques for Multimodal Systems
  • Monolith: The Good, The Bad and The Ugly
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Apache Ranger and AWS EMR Automated Installation and Integration Series (1): Solutions Overview

Apache Ranger and AWS EMR Automated Installation and Integration Series (1): Solutions Overview

Automated installation and integration for Amazon EMR and Apache Ranger with this four-part series showing solutions against different technology stacks.

By 
Laurence Geng user avatar
Laurence Geng
·
Updated Dec. 22, 22 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
7.2K Views

Join the DZone community and get the full member experience.

Join For Free

System security usually includes two core topics: authentication and authorization. One solves the problem of “Who is s/he?” and the other solves the problem of “Does s/he have permission to perform an operation?” In the big data area, Apache Ranger is one of the most popular choices for authorization, it supports all mainstream big data components, including HDFS, Hive, HBase, and so on. As Amazon EMR rolls out native ranger (plugins) features, users can manage the authorization of EMRFS(S3), Spark, Hive, and Trino all together. For authentication, an organization usually has its own centralized authentication infrastructure, i.e., Windows AD or OpenLDAP; however, for most big data components, Kerberos is only supported authentication mechanism, so users usually need to integrate Windows AD/OpenLDAP and Kerberos together to unify authentication.

We will focus on how to implement automated installation and integration for Amazon EMR and Apache Ranger. This series is composed of four articles. Each article will introduce a completed solution against different technology stacks.

1. Solutions Overview

Installing Apache Ranger and integrating with Amazon EMR covers three main components:

  1. Install and integrate an authentication provider.
  2. Setup Ranger server and its plugins on EMR cluster.
  3. Configure all related components if Kerberos is enabled.

For authentication providers, Windows AD and OpenLDAP are most widely used. Their installation and integration are very different, so they should count as two separate jobs.

For Ranger installation, there are two options. The first is “open-source ranger server + EMR-native ranger plugins.” In the article, we will refer to it as an “EMR-native” ranger solution. The second is “open-source ranger server + open-source ranger plugins.” In the article, we will refer to it as an “open-source” ranger solution. Installing the two solutions will be two separate jobs. 

For Kerberos, if enabled, it will bring a lot of changes to the above jobs, so enabling or disabling Kerberos is also two separate jobs.

In summary, based on the three factors above, there are eight possible scenarios (technology stacks) as follows:

This series is composed of four articles, which are against the first four scenarios. The following is a scenarios and solution map:

Scenario
Solution
         1          
Apache Ranger and AWS EMR Automated Installation and Integration Series (2): OpenLDAP + EMR-Native Ranger
         2          
Apache Ranger and AWS EMR Automated Installation and Integration Series (3): Windows AD + EMR-Native Ranger
         3          
Apache Ranger and AWS EMR Automated Installation and Integration Series (4): OpenLDAP + Open-Source Ranger
         4          
Apache Ranger and AWS EMR Automated Installation and Integration Series (5): Windows AD + Open-Source Ranger

For scenarios 5 and 6, as of this writing, EMR is not yet supported. Since disabling Kerberos on EMR cluster is not a recommended practice, the AWS service team is working on a solution to meet the needs. For scenarios 7 and 8, considering few users pick them, we won't discuss them. 

Note: At the time of writing, Trino plugin is NOT available yet, so this solution does NOT support Trino plugin at present.

2. Why Is Installing Ranger So Difficult?

Whether you’ve successfully made it before or not, installing and integrating Windows AD/OpenLDAP + Ranger + EMR is a very hard job, it is complicated, error-prone, and time-consuming for the following reasons:

  1. It requires operators have enough knowledge about Windows AD, OpenLDAP, Kerberos, and SSL/TLS, which are not core skills of big data engineers. Learning them will take a lot of time.
  2. The architecture of Ranger is complex, it includes two server-side components: Ranger Admin and Ranger UserSync. Two storage components: MySQL and Solr, and a variety of plugins. For plugins, they also need to be installed on cluster nodes, so a complete manual installation is a heavy job.
  3. It is not a self-contained job, usually it needs to integrate with an existing Windows AD/OpenLDAP server or an EMR cluster. Many external uncertain factors may result in installation failure, i.e., network issues, incorrect environment-specific configurations, and so on.
  4. The EMR-native ranger solution strongly depends on Kerberos and SSL/TLS. This significantly increases the complexity of integration.
  5. There is no out-of-the-box distribution package for Ranger. Installation has to start from compiling source codes, which is a challenge for non-java engineers.

The overlapping of the above factors makes this job very difficult.

3. Introduction to Automated Installer

As the voice of simplifying the Ranger usage experience is getting louder, since 2020, I took on the initiatives and reinvented an automated installer to improve the user experience for Ranger on EMR. Here is the automated installer repository address:

Project Name
Repository Address
Ranger EMR CLI Installer
https://github.com/bluishglc/ranger-emr-cli-installer

It supports four scenarios (No. 1, 2, 3, 4) at the same time. In other words, it supports Windows AD and OpenLDAP and works in all AWS regions (including Chinese regions). For Scenarios 3/4, this installer can install on an existing cluster and supports multi-master clusters and single-master clusters. For each step, this installer always checks connectivity first then decides whether to go for the next steps. This is very helpful to identify network issues or service failure, i.e., when Ranger or OpenLDAP is not up. Finally, the actual installation job is a trial-and-error process. Users always need to try different parameter values to find the one that works in users' environment. The installer allows users to rerun an all-in-one installation anytime without side effects and users can also do a step-by-step run for debugging. The following is a key features summary:

Key Features Summary

We know there is an existing solution on this AWS blog: “Implementing Authorization and Auditing using Apache Ranger on Amazon EMR.” However, this installer is very different from the solution to design to features. This solution only supports two scenarios, (No. 2, 4), and works in the us-east-1 region only. For Scenarios 3/4, this solution can not support existing or multi-master clusters, and so on. This series of articles are totally based on this tool to guide users through the installation for the first four scenarios.

Because EMR and Ranger have multiple versions, the compatibility between different versions should be brought to our attention. Generally, Ranger 1 works with Hadoop 2 and Ranger 2 works with Hadoop 3. This installer is developed against Ranger 2.1.0 and 2.2.0, so it only supports EMR 6.X. We fully tested four solutions against Ranger 2.1.0, all passed, and we partially tested for Ranger 2.2.0, which also works, but there may be potential bugs which are not found yet. The following is Ranger and EMR version compatibility matrix:

In the next article, we start to introduce each solution one by one.

AWS Integration

Opinions expressed by DZone contributors are their own.

Related

  • A Guide to Using Amazon Bedrock Prompts for LLM Integration
  • Boosting Efficiency: Implementing Natural Language Processing With AWS RDS Using CloudFormation
  • Setting Up CORS and Integration on AWS API Gateway Using CloudFormation
  • AWS CDK: Infrastructure as Abstract Data Types, Part 3

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!