Security Layer for NutchServer
How basic authentication and SSL support was added to the open source Apache Nutch project.
Join the DZone community and get the full member experience.
Join For FreeI worked on the Security Layer for NutchServer as my GSoC 2016 project and I finished it. In this blog post, I’ll explain how it works and how to use it. First of all, I suggest you read my previous posts about my GSoC 2016 acceptance: http://furkankamaci.com/gsoc-2016-acceptance-for-apache-nutch/ if you haven’t read it.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:
Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.
Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexible model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.
Nutch 2.x had a REST API but it didn’t have a security layer on it. I’ve implemented Basic Authentication, Digest Authentication, and SSL support as authentication mechanisms, as well as fine-grained authorization support into NutchServer.
When you want to enable security at your NutchServer API you should follow these steps:
- Enable security at nutch-site.xml with setting: restapi.auth property to either BASIC, DIGEST or SSL. NONE is default and provides no security.
- Set restapi.auth.users property if you have selected BASIC or DIGEST as the authentication type. Username, password and role should be delimited by pipe character (|) Every user should be separated with comma character (,). i.e. admin|admin|admin,user|user|user. Default is admin|admin|admin,user|user|user
- Set restapi.auth.ssl.storepath, restapi.auth.ssl.storepass and restapi.auth.ssl.keypass properties if you have selected SSL as the authentication mode at restapi.auth property.
You can follow these steps when you want to connect to NutchServer API via your client code:
1. Basic Authentication
ClientResource resource = new ClientResource(protocol + "://" + domain + ":" + port + path);
resource.setChallengeResponse(challengeScheme, username, password);
try {
resource.get();
} catch (ResourceException rex) {
//catch it
}
2. Digest Authentication
Use the same code at step 1 and add this after it:
// Use server's data to complete the challengeResponse object
ChallengeRequest digestChallengeRequest = retrieveDigestChallengeRequest(resource);
ChallengeResponse challengeResponse = new ChallengeResponse(digestChallengeRequest, resource.getResponse(),
username, password.toCharArray());
resource.setChallengeResponse(challengeResponse);
try {
resource.get();
} catch (ResourceException rex) {
//catch it
}
...
private ChallengeRequest retrieveDigestChallengeRequest (ClientResource resource) {
ChallengeRequest digestChallengeRequest = null;
for (ChallengeRequest cr : resource.getChallengeRequests()) {
if (ChallengeScheme.HTTP_DIGEST.equals(cr.getScheme())) {
digestChallengeRequest = cr;
break;
}
}
return digestChallengeRequest;
}
3. SSL
Follow the same procedure as Basic Authentication, but do not forget to add an SSL certificate into your trust store.
NutchServer provides access to many functionalities over its REST API. Implementing authentication and authorization let users to communicate with it via a secure way.
Published at DZone with permission of Furkan Kamaci, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments