Over a million developers have joined DZone.

Book Review: Web Crawling and Data Mining with Apache Nutch

DZone's Guide to

Book Review: Web Crawling and Data Mining with Apache Nutch

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

In our space, we found that some of the most current healthcare related information is found on the internet.  We harvest that information as input to our healthcare masterfile. Our crawlers run against hundreds of websites. We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra:  Crawling the web with Cassandra.

When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read. The first quarter of the book is largely introductory.  It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).

For me, the book got a bit more interesting when it covered the Nutch Plugin architecture.  HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!

The book then covers deployment and scaling.   A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop.   (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)

About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop.  This includes detailed instructions on Hadoop installation and configuration. This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.

Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running.  To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.

The book is available from PACKT here:

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.


Published at DZone with permission of Brian O' Neill, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}