DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • Real-Time Presence Platform System Design
  • File Upload Security and Malware Protection
  • Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing
  • Enriching Kafka Applications With Contextual Data
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Solr and Tika integration (part 1 – basics)

Solr and Tika integration (part 1 – basics)

Rafał Kuć user avatar by
Rafał Kuć
·
Jun. 16, 11 · News
Like (0)
Save
Tweet
Share
17.20K Views

Join the DZone community and get the full member experience.

Join For Free

Indexing the so-called “rich documents”, ie files like pdf, doc, rtf, and so on (or binary files) always required some additional work on the developer side, at least to get the contents of the file and prepare it in a format understood by the search engines, in this case for Solr. To minimize this job I decided to look at the Apache Tika and integration of this library with Solr.

Introduction

First, a few words about the opportunities that we have when we choose Apache Tika. Apache Tika is a framework designed to extract information from the so-called “rich documents”-documents such as PDF files, the files in Microsoft Office format, rtf, but not only. Using Apache Tika we can also extract information from compressed documents, HTML files, images (eg jpg, png, gif), audio files (eg mp3, midi, wave), and compiled Java bytecode files. In addition, Apache Tika can detect the type of file being processed, which further simplifies the work with such documents. It is worth mentioning that the described framework is based on libraries such as PDFBox, Apache POI, or Neko HTML which indirectly guarantee very good results of extracted data.

Sample index structure

I’ll skip how to manually start the extraction of the contents of the documents in the Apache Tika and I will focus on the integration of this framework with Solr and how trivial it is. Assume that we are interested in the ID, title and contents of the documents we have to index. Thus we create a simple schema.xml file describing the index structure, which could look like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="tytul" type="text" indexed="true" stored="true"/>
<field name="zawartosc" type="text" indexed="true" stored="false" multiValued="true"/>


Configuration

To the solrconfig.xml file we add the following entry which defines a handler that will handle the indexing of documents using Apache Tika:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
   <lst name="defaults">
      <str name="fmap.Last-Modified">last_modified</str>
      <bool name="uprefix">ignored_</bool>
    </lst>
</requestHandler>

All update requests sent to the /update/extract address will be handled by Apache Tika. Of course, remember to send the commit command to the update handler after sending the documents to the handler using Apache Tika. Otherwise, your documents won’t be visible. In the standard Solr deployment you shoudl send the commit command to the handler located under /update.

In the configuration we told the extraction handler to assign the Last-Modified attribute to the last_modified field and to ignore the fields that do are not specified.

Additional notes

If you are going to index large binary files, remember to change the size limits. To do that, change the following values in the solrconfig.xml file:<requestDispatcher handleSelect=”true”> <requestParsers enableRemoteStreaming=”false” multipartUploadLimitInKB=”10240″ />

fmap.Last-Modified

The end

All parameters defining the ExtractingRequestHandler can be found at: http://wiki.apache.org/solr/ExtractingRequestHandler.

Apache Tika Integration

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Real-Time Presence Platform System Design
  • File Upload Security and Malware Protection
  • Tomorrow’s Cloud Today: Unpacking the Future of Cloud Computing
  • Enriching Kafka Applications With Contextual Data

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: