DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Rich Documents Processing: On the Search or Application Side

Rich Documents Processing: On the Search or Application Side

Rafał Kuć user avatar by
Rafał Kuć
·
Jun. 18, 12 · Java Zone · Interview
Like (0)
Save
Tweet
7.28K Views

Join the DZone community and get the full member experience.

Join For Free

when indexing so called “rich documents” we should sometimes think about, where we want those documents to be processes – should we send them to apache solr (or other search engine, like elasticsearch) and forget about them or whether we should use apache tika before sending the document and send the extracted content along with other information for indexation.

options

as i wrote a few lines above we have two options – the first one is sending the binaries to search engine and use extractingrequesthandler (information about integrating solr with apache tika can be found here ) in solr case, so it will make all the work for us. the second option is to use the same functionality (almost the same) to parse binary documents and get their contents before sending them to solr. of course there is a third option, not possible in most cases – get the documents you want to index in a format understandable by solr :)

processing on the search server side

the simplest approach is to process your “rich documents” on the search server side. lets assume its apache solr. we configure the extractingrequesthandler in the way we want it to work and we forget about everything else. but its not the right approach every time. you can imagine a situation when your indexing server is almost 100% utilized. if you would add another source of generating load you would probably suffer from performance problems. in such cases you will probably want to do it the other way.

processing outside of the search server

if the amount of rich documents is huge or your indexing server is almost completely utilized than it may be a good idea to process your binary files before sending them to your indexing server. using apache tika for example we are able to build (quite easily) a good and reliable solution to process rich documents in your application. of course such approach require a bit of knowledge about java (or any other language you will use for content extraction). such approach can save us from a situation where our indexing server is overloaded and because of the amount of data we can’t do anything with it.

a few words at the end

once every few weeks we will be publishing posts that don’t cover one of the apache solr functionalities, but instead discuss some overall search problem or describe architecture of system with search as their part. we hope that such posts will allow us and you to look a bit wider on search topics than only from apache solr point of view.

Document application Processing Apache Solr Apache Tika

Published at DZone with permission of Rafał Kuć, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Deploy Apache Kafka With Kubernetes
  • Demystify the Cybersecurity Risk Management Process
  • Package and Deploy a Lambda Function as a Docker Container With AWS CDK
  • A Guide to Events in Vue

Comments

Java Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo