Over a million developers have joined DZone.

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 3

DZone's Guide to

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 3

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Lesson 4: Highlight fragments where searched phrase was found

A word or phrase search has two goals: to discover in which document it can be found, but to also know in which document fragment was matched.  There are two helper classes that show the document hits:

  • org.apache.lucene.search.TopDocs - contains references to the top documents returned by a search. TopDocs methods :
    •  totalHits -  number of documents that matched the search
    • scoreDocs - array of ScoreDoc instances containing results
    • getMaxScore() - returns best score of all matches
  • org.apache.lucene.search.ScoreDoc- represents a single search result. ScoreDoc methods :
    • doc  - document id
    • score - document score

In order to retrieve the fragment where the phrase/word was found the Highlighter class it is used. The highlight package contains classes to provide "keyword in context" features typically used to highlight search terms in the text of results pages. The Highlighter class is the central component and can be used to extract the most interesting sections of a piece of text and highlight them, with the help of Fragmenter.

Listing 4.1 Obtaining document hits of the searched phrase

public static String processResults(IndexReader reader, 
        IndexSearcher searcher, Analyzer analyzer, Query query, 
        TopScoreDocCollector collector) throws Exception {
        StringBuffer answer = new StringBuffer();

        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, 
                new QueryScorer(query));
        highlighter.setEncoder(new SimpleHTMLEncoder());
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
       //obtaining document hits of the search phrase
        searcher.search(query, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);

        answer.append("Number of documents where it was found : ")
        for (int i = 0; i < hits.length; i++) {
            Document doc = reader.document(hits[i].doc);

            answer.append("Document where phrase or part of it was found and its scoring : ");
            answer.append(doc.get("file")).append("  (").append(hits[i].score)
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_ABSTRACT_TEXT, analyzer));
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_TEXT, analyzer));

        return answer.toString();

Analyzers return a TokenStream; the TokenStream retrieved by the analyzer will be used to highlight fragments.

Furthermore,  the Highlighter object created above can be used to retrieve the fragment where the phrase/word was found :

Listing 4.2 Highlighting fragments (abstract is analyzed first, then the regular indexed text)

private static String searchInContent(Highlighter highlighter, 
          IndexReader indexReader, Document doc, Integer docId, 
          String contentOption, Analyzer analyzer) throws Exception {
        StringBuffer answer = new StringBuffer();
        String text = doc.get(contentOption);
        TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader, 
                docId, contentOption, analyzer);
       //try to get the best matching fragment(s) 

        try {

            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, 
                    text, false, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);
            for (int j = 0; j < frag.length; j++) {
                answer.append(SearchUtils.processText(frag[j], (contentOption
                        .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                        : false), j+1));

        } catch (InvalidTokenOffsetsException ex) {
            TextFragment frag  = new TextFragment(text, 0, 10);
            answer.append(SearchUtils.processText(frag, (contentOption
                    .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                    : false), 1));

        return answer.toString();

Personal Note : When indexing documents try to avoid documents containing images; the Highlighter class cannot perform fragment retrieval on those documents and an InvalidTokendOffsetsException will be thrown.



Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}