Over a million developers have joined DZone.

How to Clone Wikipedia and Index it with Solr

DZone's Guide to

How to Clone Wikipedia and Index it with Solr

· Java Zone
Free Resource

Bitbucket is for the code that takes us to Mars, decodes the human genome, or drives your next car. What will your code do? Get started with Bitbucket today, it's free.

It took 6 weeks, but Fred Zimmerman, a blogger for Nimblebooks.com just completed a very cool use case scenario for Solr indexing.  He cloned all of Wikipedia and then indexed it with Solr:

1.  "Hardware. I found out the hard way that 32-bit Ubuntu machines with 613 MB RAM (Amazon’s ECS “micro” instances) were not big enough—they created time out errors that disappeared when I upgraded to 1.7GB / single cores. You will also need at least 200 GB disk space, 300 is a safe figure."

2.  "Software.  You will need MediaWiki 1.17 or greater, several extensions (listed in this good page by Metachronistic), either mwimport or http://www.mediawiki.org/wiki/Manual:MWDumper, mySQL, and Apache Solr 3.4. Install the necessary MediaWiki extensions now, it will make it easier later on verify that your database import was successful."

3. "Data.  Get the latest Wikipedia dump from http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia.  You probably want the pages-articles file which is ~ 8 GB compressed and ~ 33 GB uncompressed."

4....    --Nimblebooks.com

I think this is a great real world example tutorial that could help any developer get familiar with the open source search utility of Solr, or just tune their skills.  A good read.

Source: http://www.nimblebooks.com/wordpress/2011/10/how-to-clone-wikipedia-and-index-it-with-solr/

Bitbucket is the Git solution for professional teams who code with a purpose, not just as a hobby. Get started today, it's free.


Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}