The first line of security we recommend is authenticating users in Apache Hadoop. Like most, if not all RDBMSes, a user is provided with a username and a password to validate their identity. This is a requirement to access any data managed by those systems. The goal is the same in Apache Hadoop. Since the Hadoop stack does not have an authentication component, Kerberos Key Distribution Center is used as the mechanism to identify users.
There are two implementations of a Kerberos KDC that are supported on a CDH cluster: A MIT KDC installation, and/or integration with Microsoft Active Directory (AD) built-in Kerberos KDC. Generally, the latter is recommended to our enterprise customers and the blog will focus on a direct integration of CDH and the Active Directory KDC. This integration is favored because of other tools that will be used to communicate with Active Directory.
Active Directory is mainly known for its Domain Service (AD DS) service as an Identity Management service which authenticates users and groups. However, there are other powerful services within AD like AD CS, and AD DNS.
On May 6, 2016, my colleague, Ben Spivey wrote a blog on securing a cluster on Amazon AWS. He covered a great deal on the AD DS and AD CS services. For more details, Ben’s blog is a good place to start. This blog will spend more time on AD DNS service.
Active Directory Domain Name System
Deploying a CDH cluster requires both forward and reverse name resolution for internal IP addresses. When deploying a cluster on-premises, this is usually done by your system administrator. When you deploy a cluster on Amazon AWS, this is automatically configured when you launch an EC2 instance.
A forward DNS lookup is resolving a Fully Qualified Domain Name (FQDN) to an IP address, and a reverse DNS lookup is doing the opposite, resolving an IP address to a FQDN. Currently, Microsoft Azure does not provide reverse DNS lookup for internal private IP addresses. This will be covered later.
There are many options for DNS when deploying on Azure. You can install the supported BIND package for your Linux OS, an existing Active Directory Domain Name System, etc. This blog will cover the AD DNS in more details.
If not already configured, ensure your AD administrator has properly configured a reverse DNS zone in the DNS Manager as seen below.
The important section in the figure above is the red box in the “Reverse Lookup Zones.” This illustrates the zone configured to host all the DNS objects for a particular subnet.
This is a view of the “Forward Lookup Zones” for the CLOUDERA.MORANTUS.COM domain.
Also a view of my OU tree showing zero entries.
Azure Virtual Machine
I provisioned a VM in Azure with all the default DNS settings, and we will join it to our AD DS and DNS services.
As you can see, the hostname -f command displays a very long FQDN for my VM and hostname -i gives us the IP address associated with the VM. Next, I did a forward DNS lookup using the host FQDN command, which resolved to the IP address. Then, I did a reverse DNS lookup using host IP–address as shown in the red box above, it did not locate a reverse entry for that IP address. A reverse lookup is a requirement for a CDH deployment. We’ll revisit this later.
In order to configure our RHEL 6.7 VM to communicate with Active Directory, we need to configure a tool called Samba. Samba is a Linux based utility that enables the integration of Linux systems with AD.
First, join the VM to AD with Samba. Ensure the DNS servers property for your Virtual Network in the
Next, install packages needed to integrate with AD.
sudo yum install -y samba-common krb5-workstation openldap-clients
After that, configure the VM to point to the AD DNS server.
The nameserver is the IP address for the AD server. This can also be accomplished by running “service network restart” on the VM.
Once that's done, configure samba to join the AD domain and verify the entry in AD. This must be executed as a privileged user. In this case “jmorantus” is an admin account in Active Directory.
Note: You can ignore the failed DNS update error showed above. We need to create a Kerberos keytab with a privileged account to update/create DNS objects in AD. This step will be executed later.
As you can above, we succeeded joining our VM to the AD domain and an AD object was created in the OU servers.
Configure Kerberos krb5.conf file to generate keytab file to update DNS in ADThen, Update/Create Forward and Reverse DNS entries.
Here's a view of the Forward DNS entry added to AD DNS service.
And here's a view of reverse DNS entry added to AD DNS service.
Note: it’s worth mentioning that Active Directory will age DNS entries that it considers “inactive”. An additional process should be implemented to keep these entries “alive” in AD.
The System Security Service Daemon is used to cache users and groups information locally to a Linux system. This integration is also necessary to configure authorization with Apache Sentry for data access.
Now that SSSD is fully configured, we’ll verify we can read user information from AD.
Here you can see with SSSD stopped, the VM does not know user “scm-cloudera.” With SSSD running, the user information was pulled from AD. If you are looking for a commercial option, Cloudera also recommends Centrify.
You should now be able to configure a VM on Azure, join an AD domain, and create DNS entries in AD DNS server. These steps will work for any other cloud provider and on-premise deployments. In Part 2 of this series, we’ll cover creating a Kerberized cluster with Cloudera Director on Azure.