the so-called “rich documents”, ie files like pdf, doc, rtf, and so on
(or binary files) always required some additional work on the
developer side, at least to get the contents of the file and prepare it
in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.
First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using
Apache Tika we can also extract information from compressed documents,
HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi,
wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It
is worth mentioning that the described framework is based on libraries
such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee
very good results of extracted data.
Sample index structure
skip how to manually start the extraction of the contents of the
documents in the Apache Tika and I will focus on the integration of this
framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="tytul" type="text" indexed="true" stored="true"/> <field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/>
To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.Last-Modified">last_modified</str> <bool name="uprefix">ignored_</bool> </lst> </requestHandler>
All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.
In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.
If you are going to index large binary files, remember to change the
size limits. To do that, change the following values in the
solrconfig.xml file:<requestDispatcher handleSelect=”true”>
All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.