Over a million developers have joined DZone.

Book Review: Web Crawling and Data Mining with Apache Nutch

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

In our space, we found that some of the most current healthcare related information is found on the internet.  We harvest that information as input to our healthcare masterfile. Our crawlers run against hundreds of websites. We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra: Crawling the web with Cassandra.

When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read. The first quarter of the book is largely introductory.  It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).

For me, the book got a bit more interesting when it covered the Nutch Plugin architecture.  HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!

The book then covers deployment and scaling.   A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop.   (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)

About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop.  This includes detailed instructions on Hadoop installation and configuration. This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.

Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running.  To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.

The book is available from PACKT here:

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Brian O' Neill, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}