Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Secure Apache Solr Collections and Access Them Programmatically

DZone's Guide to

Secure Apache Solr Collections and Access Them Programmatically

This article is a tutorial on securing your SOLR DATA in a Hadoop cluster with detailed instructions and hands-on information.

Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Learn how to secure your Solr data in a policy-based, fine-grained way.

Data security is more important than ever before. At the same time, risk is increasing due to the relentlessly growing number of device endpoints, the continual emergence of new types of threats, and the commercialization of cybercrime. And with Apache Hadoop already instrumental for supporting the growth of data volumes that fuel mission-critical enterprise workloads, the necessity to master available security mechanisms is of vital importance to organizations participating in that paradigm shift.

Fortunately, the Hadoop ecosystem has responded to this need in the past couple of years by spawning new functionality for end-to-end encryption, strong authentication, and other aspects of platform security. For example, Apache Sentry provides fine-grained, role-based authorization capabilities used in a number of Hadoop components, including Apache Hive, Apache Impala (incubating), and Cloudera Search (an integration of Apache Solr with the Hadoop ecosystem). Sentry is also able to dynamically synchronize the HDFS permissions of data stored within Hive and Impala by using ACLs that derive from Hive GRANTs.

In this post, you’ll learn how to secure Solr data by controlling read/write access via Sentry (backed up by the strong authentication capabilities of Kerberos) and access it programmatically from Java applications and Apache Flume. This operation applies to many industry use cases where Solr is the backing data layer in multi-tenant, Java-based web applications associated with frequent updates that happen in the background.

Preparation

Our example assumes that:

  • Solr is running in a Cloudera-powered enterprise data hub, with Kerberos and Sentry also deployed.
  • A web app needs to access a Solr collection programmatically using Java.
  • The Solr collection is updated in real-time via Flume and a MorphlineSolrSink.

Sentry authorizations for Hive and Impala can be stored in either a dedicated database or a file in HDFS (the policy provider is pluggable). In the below example, we’ll configure role-based access control via the file-based policy provider.

Create the Solr Collection

First, we’ll generate a collection configuration set called poems:

solrctl instancedir --generate poems

We are assuming that your Solr client configuration automatically comprises settings for solrctl such that it can locate Apache ZooKeeper and the Solr nodes. If that is not the case, you might have to instruct the solrctl command on its location explicitly, for example:

solrctl --zk zookeeper-host1:2181,zookeeper-host1:2181,zookeeper-host1:2181/solr --solr http://your.datanode.net:8983/solr

Edit poems/conf/schema.xml to reflect a smaller number of fields per document. (A simple id and text field will suffice.) Also, confirm that copy-fields are removed from the sample schema:

Be sure to use the secured solrconfig.xml:

cp poems/conf/solrconfig.xml   poems/conf/solrconfig.xml.original
cp poems/conf/solrconfig.xml.secure   poems/conf/solrconfig.xml

Push the configuration data into Apache ZooKeeper:

solrctl instancedir --create poems poems

Create the collection:

solrctl collection --create poems

Secure the Poems Collection Using Sentry

The policy shown below establishes four Sentry roles based on the admin, operators, users, and techusers groups.

  • Administrators are entitled to all actions.
  • Operators are granted update and query privileges.
  • Users are granted query privileges.
  • Tech users are granted update privileges.
[groups]
cloudera_hadoop_admin = admin_role
cloudera_hadoop_operators = both_role
cloudera_hadoop_users = query_role
cloudera_hadoop_techusers = update_role
 
[roles]
admin_role = collection = *->action=*
both_role = collection = poems->action=Update, collection = poems->action=Query
query_role = collection = poems->action=Query
update_role = collection = poems->action=Update

Add the content of the listing to a file called sentry-provider.ini. Rename the groups according to the corresponding groups in your cluster.

Put sentry-provider.ini into HDFS:

hdfs dfs -mkdir -p   /user/solr/sentry
hdfs dfs -put sentry-provider.ini /user/solr/sentry
hdfs dfs -chown -R   solr /user/solr



hdfs dfs -mkdir -p   /user/solr/sentry
hdfs dfs -put sentry-provider.ini /user/solr/sentry
hdfs dfs -chown -R   solr /user/solr

Enable Sentry policy-file usage in the Solr service in Cloudera Manager:

Solr -> Configuration → Service Wide → Policy File Based Sentry → Enable Sentry Authorization = True

Restart Solr (only needed once for enabling Sentry integration):

Solr → Actions → Restart

Add Data to the Collection via curl

Use curl to add content:

kinit
curl --negotiate -u : -s \
http://your.datanode.net:8983/solr/poems/update?commit=true -H "Content-Type: text/xml" --data-binary \
'1Mary had a little lamb, the fleece was white as snow.2The quick brown fox jumps over the lazy dog.'

Use curl to perform an initial query and verify Solr’s function:

curl --negotiate -u : -s \
http://your.datanode.net:8983/solr/poems/get?id=1

Accessing the Collection via Java

Next, we’ll make sure that the web app can access the collection whenever needed.

Add the following code to a Java file called SecureSolrJQuery.java:

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;

import java.net.MalformedURLException;
class SecureSolrJQuery {

   public static void main(String[] args)
          throws MalformedURLException, SolrServerException {

             String queryParameter = args.length == 1? args[0] : "*";

             String urlString = "http://your.datanode.net:8983/solr/poems";
             SolrServer solr = new HttpSolrServer(urlString);

             SolrQuery query = new SolrQuery();
             query.set("q", "text:"+queryParameter);

             QueryResponse response = solr.query(query);
             SolrDocumentList results = response.getResults();
             for (int i = 0; i < results.size(); ++i) {
                     System.out.println(results.get(i));
             }
   }

}

Create a JAAS config (jaas-cache.conf) to use the Kerberos ticket cache (that is, your existing ticket from kinit):

Client {
             com.sun.security.auth.module.Krb5LoginModule required
             useTicketCache=true
             debug=false;
      };

Later, you’ll see how to achieve the same goal with a keytab to make authentication happen non-interactively.

Using the Code

Compile the java class:

CP=` find /opt/cloudera/parcels/CDH/lib/solr/ |grep "\.jar"|tr '\n' ':'`
CP=$CP:`hadoop classpath`
javac -cp $CP SecureSolrJQuery.java

Create a shell script called query-solrj-jaas.sh to run the query code:

CP=` find /opt/cloudera/parcels/CDH/lib/solr/ |grep "\.jar"|tr '\n' ':'`
CP=$CP:`hadoop classpath` 
java -Djava.security.auth.login.config=`pwd`/jaas-cache.conf -cp $CP SecureSolrJQuery $1

kinit as a user who is a member of cloudera_hadoop_admin (or any other group with query privileges) and run the code:

kinit
 
./query-solrj-jaas.sh
15/03/25 16:00:57 INFO impl.HttpClientUtil: Creating new http client, config:maxConnections=128&&maxConnectionsPerHost=32&followRedirects=false
15/03/25 16:00:57 INFO impl.HttpClientUtil: Setting up SPNego auth with config: /home//solr_test/jaas-cache
SolrDocument{id=1, text=Mary had a little lamb, it’s fleece was white as snow., _version_=1496618383939993600}
SolrDocument{id=2, text=The quick brown fox jumps over the lazy dog., _version_=1496618383970402304}

Changing sentry-provider.ini verifies that access is denied/Sentry works as intended. Performing kinit as a user who is not in the group mapped to an appropriate role has the same effect.

Policy:

[groups]
nogroup = admin_role 
[roles]
admin_role = collection = *->action=*




[groups]
nogroup = admin_role 
[roles]
admin_role = collection = *->action=*

Effect:

./query-solrj-jaas.sh
15/03/25 16:03:32 INFO impl.HttpClientUtil: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
15/03/25 16:03:33 INFO impl.HttpClientUtil: Setting up SPNego auth with config: /home/a.jkunig/solr_test/jaas-cache
Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.sentry.binding.solr.authz.SentrySolrAuthorizationException: User bob does not have privileges for poems
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:556)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:221)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:216)
at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at com.cloudera.fts.solr.query.SecureSolrJQuery.main(SecureSolrJQuery.java:36)

Accessing the Collection via Flume

Add the same sample data to a file called data.txt in exactly the following format, which we will use in the Morphline:

3|Mary had additional lambs, their fleeces were like the first.
4|The quick brown fox still jumps over the lazy dog.

Create a morphline.conf file to transform the text data into Solr documents:



SOLR_LOCATOR : {
   	   collection : poems
          zkHost : "your.datanode.net:2181/solr"
    	}
    	
        morphlines : [
      	{
            id : morphline1
    	
            importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    	
            commands : [
          	{
                readCSV {
                  separator : "|"
                  columns : [id,text]
                  charset : UTF-8
            	}
          	}
    	
          	{ logDebug { format : "output record: {}", args : ["@{}"] } }
    	
          	{
                loadSolr {
                  solrLocator : ${SOLR_LOCATOR}
            	}
          	}
        	]
      	}
    	]

SOLR_LOCATOR : {
   	   collection : poems
          zkHost : "your.datanode.net:2181/solr"
    	}
    	
        morphlines : [
      	{
            id : morphline1
    	
            importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    	
            commands : [
          	{
                readCSV {
                  separator : "|"
                  columns : [id,text]
                  charset : UTF-8
            	}
          	}
    	
          	{ logDebug { format : "output record: {}", args : ["@{}"] } }
    	
          	{
                loadSolr {
                  solrLocator : ${SOLR_LOCATOR}
            	}
          	}
        	]
      	}
    	]

Prepare a Keytab for Flume

Create a technical user (e.g. tech.hadoop), create a principal for this user, and extract a keytab for that principal. The exact method to do so depends on whether you use Kerberos or Microsoft Active Directory.

Give the user has appropriate permissions to update the collection. As an example, the user could be in our cloudera_hadoop_techusers group.

Next, create a local JAAS config file (jaas-kt.conf) that uses the keytab of the tech user:


Client {
   com.sun.security.auth.module.Krb5LoginModule   required
   useKeyTab=true
   useTicketCache=false
   keyTab="/home/tech.hadoop/tech.hadoop.keytab"
   principal="tech.hadoop@YOURREALM";
  };

Configure and Start Flume

Create a Flume configuration file (flume.conf) that pushes the data:


# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
 
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
 
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
a1.sinks.k1.morphlineFile = /home/path/to/morphline.conf
 
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Start the agent using the JAAS config:

flume-ng agent -n a1 -f   ./flume.conf -Xmx1G -Djava.security.auth.login.config=/home/tech.hadoop/jaas-kt.conf

Ingest the data file:

cat ./data.txt | nc localhost 44444

Configure Flume in Cloudera Manager

Cloudera Manager automatically generates a JAAS configuration for Flume that is used by Java client code such as that generated by the Morphline sink. There are three options to get the desired behavior when Cloudera Manager manages the execution of the Flume agent:

  • Cloudera Manager creates keytabs for a principal other that Flume: We configure the flume service with a principal that we choose, like our above hadoopadmin principal by changing the “Kerberos Principal” setting under Flume → Configuration → Security. Cloudera Manager will then create a keytab for tech.hadoop/yourhost@YOURREALM, where yourhost is the host running the Flume agent, and use this principal globally as the Hadoop service principal for Flume. Authentication requests against a “Sentry-fied” Solr service will map the tech.hadoop/yourhost@YOURREALM  to the tech.hadoop user / cloudera_hadoop_techusers group, which is eligible to access the collection. The Flume agent will still be run as the flume user. (Note: While this is a quick configuration change, it may not be desirable to change the Flume principal globally.)
  • Use Cloudera Manager’s Flume principal for Sentry authorization: This option does not change anything in the default service configuration of Flume in Cloudera Manager, which means that Flume will access Solr with the flume/yourhost@YOURREALM principal (where yourhost is the host running the Flume agent). This option requires that the Linux user flume is a member of the cloudera_hadoop_techusers group (or any other group that has the appropriate privileges as per our sentry-provider.ini), so that the Sentry-fied Solr server permits flume to access the collection. (Again, depending on your needs, it may or may not be desirable to do that.)
  • Cloudera Manager uses a user-defined JAAS configuration to run the Flume agent: We place the jaas-kt.conf, which we previously generated, as well as the keytab tech.hadoop.keytab, in /etc/flume-ng/conf/. The location of the files is in fact arbitrary, but we need to make sure they can be accessed by the flume user:
    chown flume:flume /etc/flume-ng/conf/tech.hadoop.keytab   /etc/flume-ng/conf/jaas-kt.conf

    In Cloudera Manager, we then use the “Flume Service Environment Advanced Configuration Snippet (Safety Valve)” under Flume → Configuration → Security  to supply custom Java options to the Flume agent:
    FLUME_AGENT_JAVA_OPTS="-Xms262144000 -Xmx262144000    -XX:OnOutOfMemoryError={AGENT_COMMON_DIR}/killparent.sh   -Dflume.monitoring.type=HTTP -Dflume.monitoring.port=41414 -Djava.security.auth.login.config=/etc/flume-ng/conf/jaas-kt.conf"

The options above were copied from the standard options derived by Cloudera Manager, with the exception of the -Djava.security.auth.login.config flag.

Conclusion

At this point, you should have a good understanding of how to use Sentry to manage access control and enforce authorization for queries to Solr from Java-based applications and Flume, using Kerberos for strong authentication.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
hadoop ,big data ,keberos ,solr ,lucene ,search

Published at DZone with permission of Jan Kunigk, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}