Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 2

DZone's Guide to

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 2

Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Lesson 2: Automate text extraction and indexing from any file type

Each Lucene index consists of one or more segments:

  • A segment is a standalone index for a subset of documents
  • All segments are searched
  • A segment is created whenever IndexWriter flushes adds/deletes
  •  Periodically, IndexWriter will merge a set of segments into a single segment.

 Using the handler object obtained in Lesson 1 we will procede to indexing file content for each document from our data directory.  Below is a snippet of how a directory of documents can  be handled using Lucene 4.8 :

Listing 2.1 Processing material from a directory

public static String indexDirectory(String indexDirectory, 
        String dataDirectory, Analyzer analyzer) throws IOException {
        StringBuffer sb = new StringBuffer();
        File docs = new File(dataDirectory); 
        File indexDir = new File(indexDirectory);

        Directory directory = FSDirectory.open(indexDir);

        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_48, 
                analyzer);
        IndexWriter writer = new IndexWriter(directory, conf);
        writer.deleteAll();
        for (File file : docs.listFiles()) {
            if (!file.isDirectory()) {
                DocumentWithAbstract abDoc = indexFile(analyzer, file);
                writer.addDocument(abDoc.getDoc());
//                sb.append("Abstract of document is : <br />");
//                sb.append(abDoc.getAbstractOfWords());
            }
        }
        writer.commit();
        writer.deleteUnusedFiles();
        return sb.append(
                "Indexes for " + writer.maxDoc() + " documents were written")
                .toString();
    }

Using org.apache.lucene.store.FSDirectory the directory content is being read; the Directory object represents the location of an index and will given for processing to the IndexWriter.  The  org.apache.lucene.index.IndexWriter is the central component that allows you to create a new index, open an existing one, and add, remove, or update documents in an index.

As a particularity of the above code, the Analyzer given as parameter can have the type of any class that extends Analyzer class; inside the downloadable project is a factory class com.retriever.lucene.index.IndexCreatorLanguageFactory that calls the above method with the RomanianAnalyzer, EnglishAnalyzer and FrenchAnalyzer.  For proof of concept purpose we have used two analyzers that handle text written with diacritics (RomanianAnalyzer  and FrenchAnalyzer). As a drawback for the RomanianAnalyzer: it does not have its default stopwords list saved in UTF-8 format and a different stopwords file was given to the analyzer.

  Each file is being passed to the indexFile method and then processed using the class com.retriever.lucene.index.utils.IndexUtils.

Lesson 3: Increase search efficiency through  stemming, boosting, scoring

Results of a search can be adjusted if indexing is customized. Stemming is the process for reducing inflected (or sometimes derived) words to their stem http://en.wikipedia.org/wiki/Word_stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. In Lucene4.8 stemming is included in each language analyzer, but before version 3.1 there was a special analyzer used for stemming (org.apache.lucene.analysis.snowball.Analyzer).

Scoring shows how relevant a given Document is to a user's query.  Document scoring can be altered and words from it can be made to appear more important when they are searched. Lucene allows influencing search results by "boosting" in more than one level :

  • Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).
  • Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost().

In our sample, the first 100 different words from each document were "boosted" to a double score value in order to be retrieved before their occurence  in other parts of the document(s).

Listing 3.1 Indexing file content and also apply boosting/scoring

public static DocumentWithAbstract applyCustomIndexing(Document doc, String text) {
        DocumentWithAbstract abDoc = new DocumentWithAbstract(doc, calculateLeadingLength(text));
        int leadingLength = abDoc.getAbstractOfWords().length() + 100;
        //index the first 100 words from a document content and boost them
        FieldType type = new FieldType();
        type.setIndexed(true);
        type.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true); //store the term vectors
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);
        Field fieldLeading = new Field(ISearchConstants.FIELD_ABSTRACT_TEXT, 
                text.substring(0, leadingLength), type);
        fieldLeading.setBoost(2);  //boost the first 100 words from a document
        //index and store the other words from document content
        Field otherField = new TextField(ISearchConstants.FIELD_TEXT, text.toString().substring(leadingLength, text.length()), Field.Store.YES);
        doc.add(fieldLeading);
        doc.add(otherField);
        return abDoc;
    }

  •  org.apache.lucene.search.IndexSearcher- the central class that exposes several search methods on an index and it is accessed via an IndexReader
  • org.apache.lucene.search.Query - an abstract query class.  Concrete subclasses represent specific types of queries, e.g., matching terms in fields, boolean queries, phrase queries.
  •  org.apache.lucene.queryparser.classic.QueryParser  - parses a textual representation of a query into a Query instance

Listing 3.2 Searching through indexed documents (boost queries or not)

public static String search(String indexDirectory, String searchPhrase, SearchIndexOptions options) throws Exception {
        Directory index = FSDirectory.open(new File(indexDirectory));
        // Build a Query object
        Query query = SearchUtils.buildQuery(searchPhrase,
                options.isDefaultStopWords());

        IndexReader reader = IndexReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopScoreDocCollector collector = TopScoreDocCollector.create(
                ISearchConstants.MAXIMUM_RESULTS_PER_SEARCH, true);
        //boost search
        Float boost = options.getBoost();
        if (boost != null && boost > 0) {
            if (options.getScore()) {
                CustomScoreQuery customQuery = new SpecifiedScoreQuery(query);
                customQuery.setBoost(boost);
                searcher.search(customQuery, collector);
            } else {
                query.setBoost(boost);
                searcher.search(query, collector);
            }
        } else {
             searcher.search(query, collector);
        }

        return SearchUtils.processResults(reader, searcher, query, collector)
                .toString();

    }

The method described above is the core of class com.retriever.lucene.index.IndexFinder; if a boosted search is preffered then  documents matching this clause will (in addition to the normal weightings) have their score multiplied by the value of  Float boost = options.getBoost(); in this manner the matching documents will be highlighted when retrieved.

Also, if custom query scoring is desired then a separate implementation of CustomScoreQuery will be used.  A developer can write his/her implementation of CustomScoreQuery in order  to highlight different aspects of their searcher.

  Implementing a CustomeScoreQuery consists of two steps:

  • Extending org.apache.lucene.queries.CustomScoreQuery class and override the method getCustomScoreProvider.

Listing 3.3 Extending  org.apache.lucene.queries.CustomScoreQuery

public class SpecifiedScoreQuery extends CustomScoreQuery {


    public SpecifiedScoreQuery(Query subQuery) {
        super(subQuery);
    }
    
    @Override
    protected CustomScoreProvider getCustomScoreProvider(
                 AtomicReaderContext context) throws IOException {
        return new SpecifiedScoreProvider(context); 
    }
}

  • Extending org.apache.lucene.queries.CustomScoreProvider class in order to provide a different scoring for the documents discovered at search. In the below implementation the document score is doubled for those documents that have the search phrase within their abstract zone (abstract being the area defined by the first 100 different words from the document).

Listing 3.4 Extending  org.apache.lucene.queries. CustomScoreProvider

public class SpecifiedScoreProvider extends CustomScoreProvider {
    private static AtomicReader atomicReader;
    
    public SpecifiedScoreProvider(AtomicReaderContext context) {
        super(context);
        atomicReader = context.reader();
    }

    @Override
    public float customScore(int doc, float subQueryScore,  float valSrcScores[])
            throws IOException {
        Document docAtHand = atomicReader.document(doc);
        String[] itemOrigin = docAtHand.getValues("text_leading");
        for (int counter = 0; counter < itemOrigin.length; counter++) {
            if (itemOrigin[counter] != null && doc < 1) {
                return 2.0f * subQueryScore;
            }
        }
        return subQueryScore;
        
    }

Listing 3.5 Building Query

  public static Query buildQuery(String searchPhrase,
          SearchIndexOptions options) throws ParseException, IOException {
        String[] fields = {ISearchConstants.FIELD_ABSTRACT_TEXT, 
                ISearchConstants.FIELD_TEXT};
        return new MultiFieldQueryParser(Version.LUCENE_48, fields,
                options.getAnalyzer()).parse(searchPhrase);

    }

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}