When Web Crawling and Data Mining with Apache Nutch came out, I was eager to have a read. The first quarter of the book is largely introductory. It walks you through the basics of operating Nutch and the layers in the design: Injecting, Generating, Fetching, Parsing, Scoring and Indexing (with SOLR).
For me, the book got a bit more interesting when it covered the Nutch Plugin architecture. HINT: Take a look at the overall architecture diagram on Page 34 before you start reading!
The book then covers deployment and scaling. A fair amount of time is spent on SOLR deployment and scaling (via sharding), which in and of itself may be valuable if you are a SOLR shop. (not so much if you are Elastic Search (ES) fans -- in fact, it was one of the reasons why we moved to ES ;)
About midway through the book, the real fun starts when the author covers how to run Nutch with/on Hadoop. This includes detailed instructions on Hadoop installation and configuration. This is followed by a chapter on persistence mechanisms, which uses Gora to abstract away the actual storage.
Overall, this is a solid book, especially if you are new to the space and need detailed, line by line instructions to get up and running. To kick it up a notch, it would have been nice to have a smattering of few use cases and real-world examples, but given the book is only about a hundred pages, it does a good job of balancing utility with color commentary.
The book is available from PACKT here: