Lucene 'MoreLikeThis' Example Code
Join the DZone community and get the full member experience.
Join For FreeWe've all seen how important it is for sites like StackOverflow and DZone Links to prevent duplicate questions or links. Lucene is a well-equipped to handle this sort of problem using it's 'MoreLikeThis' feature. The following is a post from Mark Shead's blog that gives you a good use case and example where this feature can be used.
I was recently working on a simple application where the user will enter famous quotations.
Obviously we want to avoid duplicates so I needed a way to check for
quotations that were substantially similar before a new quote was added
to the database.
The idea was to show the top 5 most similar quotes before letting the user save the new quotation to the db. I used Lucene for this which allowed me to punt on the more difficult task of figuring out if two quotes were similar or not. I left that up to Lucene and only had to worry about how to get my information in and out of Lucene in a usable manner.
Below is the interesting method that uses Lucene to build an index of
all the quotes in the system and then returns the five quotes that are
most similar to the new quote text. Obviously creating a new index each
time a quote is added isn’t particularly efficient, but makes it easier
to demonstrate how it works and processor efficiency isn’t much of an
issue with this particular task.
public List<Quote> getSimilarQuotes() throws CorruptIndexException, IOException { String quoteText = quote.getText(); logger.info("creating RAMDirectory"); RAMDirectory idx = new RAMDirectory(); IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_31, new StandardAnalyzer(Version.LUCENE_31)); IndexWriter writer = new IndexWriter(idx, indexWriterConfig); List<Quote> quotes = session.createCriteria(Quote.class).list(); //Create a Lucene document for each quote and add them to the //RAMDirectory Index. We include the db id so we can retrive the //similar quotes before returning them to the client. for (Quote quote : quotes) { Document doc = new Document(); doc.add(new Field("contents", quote.getText(),Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("id", quote.getId().toString() ,Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); } //We are done writing documents to the index at this point writer.close(); //Open the index IndexReader ir = IndexReader.open(idx); logger.info("ir has " + ir.numDocs() + " docs in it"); IndexSearcher is = new IndexSearcher(idx, true); MoreLikeThis mlt = new MoreLikeThis(ir); //lower some settings to MoreLikeThis will work with very short //quotations mlt.setMinTermFreq(1); mlt.setMinDocFreq(1); //We need a Reader to create the Query so we'll create one //using the string quoteText. Reader reader = new StringReader(quoteText); //Create the query that we can then use to search the index Query query = mlt.like( reader); //Search the index using the query and get the top 5 results TopDocs topDocs = is.search(query,5); logger.info("found " + topDocs.totalHits + " topDocs"); //Create an array to hold the quotes we are going to //pass back to the client List<Quote> foundQuotes = new ArrayList<Quote>(); for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) { //This retrieves the actual Document from the index using //the document number. (scoreDoc.doc is an int that is the //doc's id Document doc = is.doc( scoreDoc.doc ); //Get the id that we previously stored in the document from //hibernate and parse it back to a long. String idField = doc.get("id"); long id = Long.parseLong(idField); //retrieve the quote from Hibernate so we can pass //back an Array of actual Quote objects. Quote thisQuote = (Quote)session.get(Quote.class, id); //Add the quote to the array we'll pass back to the client foundQuotes.add(thisQuote); } return foundQuotes; }
People Found This When Looking For:
- lucene example (122)
- lucene morelikethis (77)
- IndexWriterConfig example (59)
- IndexWriterConfig (33)
- morelikethis lucene (30)
- lucene examples (29)
- lucene sample code (27)
- lucene example code (27)
- lucene more like this (27)
- lucene indexwriterconfig (26)
Source: http://blog.markwshead.com/966/lucene-morelikethis-example-code/
Opinions expressed by DZone contributors are their own.
Comments