DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Leverage Lambdas for Cleaner Code
  • An Overview of Creating Databases With Python
  • The First Annual Recap From JPA Buddy
  • Migrating Couchbase Bucket Based Data to Scopes and Collections via Eventing

Trending

  • Send Your Logs to Loki
  • Automated Testing: The Missing Piece of Your CI/CD Puzzle
  • CI/CD Docker: How To Create a CI/CD Pipeline With Jenkins, Containers, and Amazon ECS
  • AWS vs. Azure vs. Google Cloud: Comparing the Top Cloud Providers
  1. DZone
  2. Data Engineering
  3. Databases
  4. Searching With Hierarchical Fields Using Solr

Searching With Hierarchical Fields Using Solr

John Berryman user avatar by
John Berryman
·
Jun. 04, 13 · Interview
Like (0)
Save
Tweet
Share
9.87K Views

Join the DZone community and get the full member experience.

Join For Free

In our recent and continuing effort to make the world a better place, we have been working with the illustrious Waldo Jaquith on a project called StateDecoded. Basically, we’re making laws easily searchable and accessible by the layperson. Here, check out the predecessor project, Virginia Decoded. StateDecoded will be similar to the predecessor but with extended and matured search capabilities. And instead of just Virginia state law, with StateDecoded, any municipality will be able to download the open source project index their own laws and give their citizens better visibility in to the rules that govern them.

For this post, though, I want to focus upon one of the good Solr riddlers that we encountered related to the hierarchical nature of the documents being indexed. Laws are divided into sections, chapters, and paragraphs and we have documents at every level. In our Solr, this hierarchy is captured in a field labeled “section”. So for instance, here are 3 examples of this section field:

  • <field name="section">30</field> – A document that contains information specific to section 30.
  • <field name="section">30.4</field> – A document that contains information specific to section 30 chapter 4.
  • <field name="section">30.4.15</field> – A document that contains information specific to section 30 chapter 4 paragraph 15.

And our goal for this field is that if anyone searches for a particular section of law, that they will be given the law most specific to their request followed by the laws that are less specific. For instance, if a user searches for “30.4″, then the results should contain the documents for section 30, section 30.4, section 30.4.15, section 30.4.16, etc., and the first result should be for 30.4. Other documents such as 40.4 should not be returned.

Initially we tackled the problem with the PathHierarchyTokenizerFactory.

<fieldname="section"type="path"indexed="true"stored="true"/>

and

<fieldTypename="path"class="solr.TextField"><analyzer><tokenizerclass="solr.PathHierarchyTokenizerFactory"delimiter="/"/></analyzer></fieldType>

Using this analyzer, here’s how several example documents get tokenized:

30--><30>30.4--><30><30.4>30.4.15--><30><30.4><30.4.15>30.4.16--><30><30.4><30.4.16>30.5--><30><30.5>40.5--><40><40.5>

And if anyone searches for a law, the same tokenization occurs. So that if someone wants to look at law “30.4″ then this gets tokenized to <30> and <30.4>. In this case the first 5 documents match. In particular, documents 30.4, 30.4.15, and 30.4.16 all have two matching tokens while 30 and 30.5 have one matching token. Additionally, because of the field length normalization, a match in a shorter field – say 30.4 with 2 tokens gets boosted higher than a match on a field with 3 tokens such as 30.4.15 or 30.4.16.

All things considered, the resulting documents should come in the following order:

  • 30.4
  • 30.4.15
  • 30.4.16
  • 30
  • 30.5

BUT, as fate would have it, our perfect plan has a hole in it somewhere and the documents come back in this order:

  • 30.4.15
  • 30.4
  • 30.4.16
  • 30
  • 30.5

So… what’s up?! (Solr experts reading this, do you know where our theory breaks down?)

Our theory is actually sound, but the problem lies in the practical matter of storing the field norm. Each field of every document is stored with an additional piece of information called the field norm. Ideally, the field norm would be able to hold very specific information about how long a document is, but the more accurate the field norm is, the more space you need to store that information. In the end, the Lucene developers decided to store the field norm as a single byte. This means that the field norm can store exactly 256 different values. And if you look at the Lucene TFIDFSimilarity class (the class responsible for dealing with field norms and scoring documents) you can see exactly where these byte values are getting translated into floating point values. There’s even a comment that gives a hint into what our problem might be:

The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7×10^9 to 2×10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.

Because of this lack of precision, the field norm for fields of length 2 and length 3 is the same! … And that’s why we get the weird ordering. The first three documents have the same score and are therefore returned in index ordering.

So, how do we fix this? Well, one way would be to implement our own Similarity class that uses a different byte-to-float conversion than the default. But one of the main goals for this particular project is to keep this installation of Solr as simple as possible so that anyone can set it up. Therefore, we found that an easier solution was to use a slightly different analysis technique. Consider the following schema fragments:

<field name="section_descendent" type="descendent_path" indexed="true" stored="true"/>
<field name="section_ancestor" type="ancestor_path" indexed="true" stored="true"/>
and

<fieldType name="descendent_path" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
</fieldType>

<fieldType name="ancestor_path" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
  </analyzer>
</fieldType>

In this case we’re splitting up section into two fields, section_descendent and section_ancestor which are tokenized slightly differently than before. The difference is that in the previous example, the PathHierarchyTokenizer was used on both index and query. But in this case descendent_path uses the PathHierarchyTokenizer only at index time and ancestor_path uses the PathHierarchyTokenizer only at query time. (The KeywordTokenizer tokenizes the entire field as a single token.) As a result section_descendent will only match queries for sections that are a descendent of the query – so 30.4 will match 30.4 and 30.4.15 and 30.4.16, but not 30. And similarly section_ancestor will only match queries for sections that are an ancestor of the query – so 30.4 will match 30.4 and 30, but not 30.4.15.

The final piece of the puzzle is the request handler. Here we simply use an edismax request parser with a qf (query fields) set to include both descendent_path ancestor_path. Now, whenever someone queries for 30.4, then in the descendent_path they get matches on 30.4, 30.4.15, and 30.4.16; and in the ancestor_path they get matches on 30.4 and 30. The only document that matches in both fields is 30.4 thus it gets an extra boost and appears at the top of the search results followed by the other documents. Problem solved!

Now it is interesting to note that we’ve lost sibling documents (for instance 30.5 in this example). Personally, that seems fine to me, but if we really wanted it, all we have to do is again include the original section field path-tokenized on both query and at index.

code style Law (stochastic processes) Norm (artificial intelligence) Database

Published at DZone with permission of John Berryman, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Leverage Lambdas for Cleaner Code
  • An Overview of Creating Databases With Python
  • The First Annual Recap From JPA Buddy
  • Migrating Couchbase Bucket Based Data to Scopes and Collections via Eventing

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: