Over a million developers have joined DZone.

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 3

DZone's Guide to

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 3

· Big Data Zone ·
Free Resource

Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

Lesson 4: Highlight fragments where searched phrase was found

A word or phrase search has two goals: to discover in which document it can be found, but to also know in which document fragment was matched.  There are two helper classes that show the document hits:

  • org.apache.lucene.search.TopDocs - contains references to the top documents returned by a search. TopDocs methods :
    •  totalHits -  number of documents that matched the search
    • scoreDocs - array of ScoreDoc instances containing results
    • getMaxScore() - returns best score of all matches
  • org.apache.lucene.search.ScoreDoc- represents a single search result. ScoreDoc methods :
    • doc  - document id
    • score - document score

In order to retrieve the fragment where the phrase/word was found the Highlighter class it is used. The highlight package contains classes to provide "keyword in context" features typically used to highlight search terms in the text of results pages. The Highlighter class is the central component and can be used to extract the most interesting sections of a piece of text and highlight them, with the help of Fragmenter.

Listing 4.1 Obtaining document hits of the searched phrase

public static String processResults(IndexReader reader, 
        IndexSearcher searcher, Analyzer analyzer, Query query, 
        TopScoreDocCollector collector) throws Exception {
        StringBuffer answer = new StringBuffer();

        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
        Highlighter highlighter = new Highlighter(htmlFormatter, 
                new QueryScorer(query));
        highlighter.setEncoder(new SimpleHTMLEncoder());
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
       //obtaining document hits of the search phrase
        searcher.search(query, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);

        answer.append("Number of documents where it was found : ")
        for (int i = 0; i < hits.length; i++) {
            Document doc = reader.document(hits[i].doc);

            answer.append("Document where phrase or part of it was found and its scoring : ");
            answer.append(doc.get("file")).append("  (").append(hits[i].score)
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_ABSTRACT_TEXT, analyzer));
            answer.append(searchInContent(highlighter, reader, doc, i, 
                    ISearchConstants.FIELD_TEXT, analyzer));

        return answer.toString();

Analyzers return a TokenStream; the TokenStream retrieved by the analyzer will be used to highlight fragments.

Furthermore,  the Highlighter object created above can be used to retrieve the fragment where the phrase/word was found :

Listing 4.2 Highlighting fragments (abstract is analyzed first, then the regular indexed text)

private static String searchInContent(Highlighter highlighter, 
          IndexReader indexReader, Document doc, Integer docId, 
          String contentOption, Analyzer analyzer) throws Exception {
        StringBuffer answer = new StringBuffer();
        String text = doc.get(contentOption);
        TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader, 
                docId, contentOption, analyzer);
       //try to get the best matching fragment(s) 

        try {

            TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, 
                    text, false, ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH);
            for (int j = 0; j < frag.length; j++) {
                answer.append(SearchUtils.processText(frag[j], (contentOption
                        .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                        : false), j+1));

        } catch (InvalidTokenOffsetsException ex) {
            TextFragment frag  = new TextFragment(text, 0, 10);
            answer.append(SearchUtils.processText(frag, (contentOption
                    .equals(ISearchConstants.FIELD_ABSTRACT_TEXT) ? true 
                    : false), 1));

        return answer.toString();

Personal Note : When indexing documents try to avoid documents containing images; the Highlighter class cannot perform fragment retrieval on those documents and an InvalidTokendOffsetsException will be thrown.



Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}