DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > How to Handle Stop Words in Hibernate Search 5.5.2 / Apache Lucene 5.4.x

How to Handle Stop Words in Hibernate Search 5.5.2 / Apache Lucene 5.4.x

Making sure that stop words aren't an issue during both indexing and querying your database with Hibernate.

Sumith Puri user avatar by
Sumith Puri
·
May. 06, 16 · Big Data Zone · Tutorial
Like (2)
Save
Tweet
4.98K Views

Join the DZone community and get the full member experience.

Join For Free

The Stop Words ["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", and "with"] and their existence in terms, databases, or files that are to be indexed/searched by Lucene can lead to any of the following:

1. Stop Words being Ignored/Filtered during the Lucene Indexing Process

2. Stop Words being Ignored/Filtered during the Lucene Querying Process

3. No Result for Queries that Include, Start With or End With any Stop Word

The way to solve this problem or to handle them during both indexing and searching process is as follows. The method explained here is specially suitable if you are using Hibernate Search 5.5.2 which in turn is using Apache Lucene 5.3.x/5.4.x

1. Define Your Custom Analyzer, Adapted From the Standard Analyzer

You need to include only the two filters — 'LowerCaseFilterFactory' and 'StandardFilterFactory' — as part of the Tokenizer definition. The filter factory that we have not included here is the 'StopFilter'. This allows Stop Words to be considered as other normal English Words as they are indexed.

@Entity 
@Indexed 
@Table(name="table_name", catalog="catalog_name") 
@AnalyzerDef(name = "fedexTextAnalyzer",
   tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), 
   filters = {
     @TokenFilterDef(factory = LowerCaseFilterFactory.class),
     @TokenFilterDef(factory = StandardFilterFactory.class) 
})


2. Mark the Field With Relevant Annotations (@Analyzer on @Field)

Along with the @Field Annotation on every Entity's or Table's Column Field, declare the Analyzer that we have defined above.

@Column(name="fedex_cs_product_name", nullable=false, length=100) 
@Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO, 
       analyzer=@Analyzer(definition = "fedexTextAnalyzer")) 
public String getFedexCsItemName() {    
  return this.fedexCsItemName;
}


3. Use WhitespaceAnalyzer to Query So That Stop Words Are 'Processed' By Default

Although the official documentation says that we should use 'StandardAnalyzer' by passing in the argument for Stop Words as CharArraySet.EMPTY_SET, I found that the Query was still not able to retrieve any result. On analysis with Luke, I found that in Queries such as 'Computer Science Books for Beginners', the 'for' was being ignored. Strange! I replaced it with WhitespaceAnalyzer, I found that it works for all 'Stop Words' and all 'Cases'.

Image title

I have found that the above is the best/minimal way to fix this issue. Also, our QA has verified that it works for all 'Stop Word' cases! Hope this helps you.

Apache Lucene Lucene Database Hibernate

Published at DZone with permission of Sumith Puri. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The 2-Minute Test for Kubernetes Pod Security
  • Delivering the Future of Uber-Like Apps With AI and ML
  • What Are Ephemeral Environments and How to Deploy and Use Them Efficiently
  • Everything You Need to Know About Web Pentesting: A Complete Guide

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo