DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Coding
  3. Java
  4. Lucene 'MoreLikeThis' Example Code

Lucene 'MoreLikeThis' Example Code

Mitch Pronschinske user avatar by
Mitch Pronschinske
·
Oct. 18, 11 · Interview
Like (0)
Save
Tweet
Share
12.31K Views

Join the DZone community and get the full member experience.

Join For Free

We've all seen how important it is for sites like StackOverflow and DZone Links to prevent duplicate questions or links.  Lucene is a well-equipped to handle this sort of problem using it's 'MoreLikeThis' feature.  The following is a post from Mark Shead's blog that gives you a good use case and example where this feature can be used.

I was recently working on a simple application where the user will enter famous quotations.  Obviously we want to avoid duplicates so I needed a way to check for quotations that were substantially similar before a new quote was added to the database.

The idea was to show the top 5 most similar quotes before letting the user save the new quotation to the db. I used Lucene for this which allowed me to punt on the more difficult task of figuring out if two quotes were similar or not. I left that up to Lucene and only had to worry about how to get my information in and out of Lucene in a usable manner.

Below is the interesting method that uses Lucene to build an index of all the quotes in the system and then returns the five quotes that are most similar to the new quote text.  Obviously creating a new index each time a quote is added isn’t particularly efficient, but makes it easier to demonstrate how it works and processor efficiency isn’t much of an issue with this particular task.

public List<Quote> getSimilarQuotes() throws CorruptIndexException, IOException {
 
    String quoteText = quote.getText();
    logger.info("creating RAMDirectory");
    RAMDirectory idx = new RAMDirectory();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_31, new StandardAnalyzer(Version.LUCENE_31));
    IndexWriter writer = new IndexWriter(idx, indexWriterConfig);
 
    List<Quote> quotes =  session.createCriteria(Quote.class).list();
 
    //Create a Lucene document for each quote and add them to the
    //RAMDirectory Index.  We include the db id so we can retrive the
    //similar quotes before returning them to the client.
    for (Quote quote : quotes) {
        Document doc = new Document();
        doc.add(new Field("contents", quote.getText(),Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("id", quote.getId().toString() ,Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
    }
 
    //We are done writing documents to the index at this point
    writer.close();
 
    //Open the index
    IndexReader ir = IndexReader.open(idx);
    logger.info("ir has " + ir.numDocs() + " docs in it");
    IndexSearcher is = new IndexSearcher(idx, true);
 
    MoreLikeThis mlt = new MoreLikeThis(ir);
 
    //lower some settings to MoreLikeThis will work with very short
    //quotations
    mlt.setMinTermFreq(1);
    mlt.setMinDocFreq(1);
 
    //We need a Reader to create the Query so we'll create one
    //using the string quoteText.
    Reader reader = new StringReader(quoteText);
 
    //Create the query that we can then use to search the index
    Query query = mlt.like( reader);
 
    //Search the index using the query and get the top 5 results
    TopDocs topDocs = is.search(query,5);
    logger.info("found " + topDocs.totalHits + " topDocs");
 
    //Create an array to hold the quotes we are going to
    //pass back to the client
    List<Quote> foundQuotes = new ArrayList<Quote>();
    for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
        //This retrieves the actual Document from the index using
        //the document number. (scoreDoc.doc is an int that is the
        //doc's id
        Document doc = is.doc( scoreDoc.doc );
 
        //Get the id that we previously stored in the document from
        //hibernate and parse it back to a long.
        String idField =  doc.get("id");
        long id = Long.parseLong(idField);
 
        //retrieve the quote from Hibernate so we can pass
        //back an Array of actual Quote objects.
        Quote thisQuote = (Quote)session.get(Quote.class, id);
 
        //Add the quote to the array we'll pass back to the client
        foundQuotes.add(thisQuote);
    }
 
    return foundQuotes;
}

People Found This When Looking For:

  • lucene example (122)
  • lucene morelikethis (77)
  • IndexWriterConfig example (59)
  • IndexWriterConfig (33)
  • morelikethis lucene (30)
  • lucene examples (29)
  • lucene sample code (27)
  • lucene example code (27)
  • lucene more like this (27)
  • lucene indexwriterconfig (26)

    Source: http://blog.markwshead.com/966/lucene-morelikethis-example-code/
Lucene

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Best Use Java Records as DTOs in Spring Boot 3
  • Low-Code Development: The Future of Software Development
  • A Gentle Introduction to Kubernetes
  • Journey to Event Driven, Part 1: Why Event-First Programming Changes Everything

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: