Over a million developers have joined DZone.

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 3

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Lesson 4: Highlight fragments where searched phrase was found

A word or phrase search has two goals: to discover in which document it can be found, but to also know in which document fragment was matched.  There are two helper classes that show the document hits:

  • org.apache.lucene.search.TopDocs - contains references to the top documents returned by a search. TopDocs methods :
    •  totalHits -  number of documents that matched the search
    • scoreDocs - array of ScoreDoc instances containing results
    • getMaxScore() - returns best score of all matches
  • org.apache.lucene.search.ScoreDoc- represents a single search result. ScoreDoc methods :
    • doc  - document id
    • score - document score

In order to retrieve the fragment where the phrase/word was found the Highlighter class it is used. The highlight package contains classes to provide "keyword in context" features typically used to highlight search terms in the text of results pages. The Highlighter class is the central component and can be used to extract the most interesting sections of a piece of text and highlight them, with the help of Fragmenter.

Listing 4.1 Obtaining document hits of the searched phrase

public static String processResults(IndexReader reader, 
        IndexSearcher searcher, Analyzer analyzer, Query query, 
        TopScoreDocCollector collector) throws Exception {
        StringBuffer answer = new StringBuffer();

        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, 
                new QueryScorer(query));
        highlighter.setEncoder(new SimpleHTMLEncoder());
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
       //obtaining document hits of the search phrase
        searcher.search(query, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);

        answer.append("Number of documents where it was found : ")
                .append(collector.getTotalHits()).append("\n");
        answer.append("\n----------------------------------------------------\n");
        for (int i = 0; i < hits.length; i++) {
            Document doc = reader.document(hits[i].doc);

            answer.append("Document where phrase or part of it was found and its scoring : ");
            answer.append(doc.get("file")).append("  (").append(hits[i].score)
                    .append(")").append("\n");
            answer.append("\n-----------------------------------------------\n");
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_ABSTRACT_TEXT, analyzer));
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_TEXT, analyzer));
            answer.append("\n------------------------------------------------\n");
        }

        return answer.toString();
    }

Analyzers return a TokenStream; the TokenStream retrieved by the analyzer will be used to highlight fragments.


Furthermore,  the Highlighter object created above can be used to retrieve the fragment where the phrase/word was found :

Listing 4.2 Highlighting fragments (abstract is analyzed first, then the regular indexed text)

private static String searchInContent(Highlighter highlighter, 
          IndexReader indexReader, Document doc, Integer docId, 
          String contentOption, Analyzer analyzer) throws Exception {
        StringBuffer answer = new StringBuffer();
        String text = doc.get(contentOption);
        TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader, 
                docId, contentOption, analyzer);
       //try to get the best matching fragment(s) 

        try {

            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, 
                    text, false, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);
            for (int j = 0; j < frag.length; j++) {
                answer.append(SearchUtils.processText(frag[j], (contentOption
                        .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                        : false), j+1));
            }

        } catch (InvalidTokenOffsetsException ex) {
            ex.printStackTrace();
            TextFragment frag  = new TextFragment(text, 0, 10);
            answer.append(SearchUtils.processText(frag, (contentOption
                    .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                    : false), 1));
        }

        return answer.toString();
    }

Personal Note : When indexing documents try to avoid documents containing images; the Highlighter class cannot perform fragment retrieval on those documents and an InvalidTokendOffsetsException will be thrown.

Resources

Learn

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}