DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Designing a Blog Application Using Document Databases
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • Introduction to Couchbase for Oracle Developers and Experts: Part 2 - Database Objects

Trending

  • Streamlining Event Data in Event-Driven Ansible
  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  • Emerging Data Architectures: The Future of Data Management
  • *You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research
  1. DZone
  2. Data Engineering
  3. Databases
  4. Searching relational content with Lucene's BlockJoinQuery

Searching relational content with Lucene's BlockJoinQuery

By 
Michael Mccandless user avatar
Michael Mccandless
·
Jan. 09, 12 · Interview
Likes (0)
Comment
Save
Tweet
Share
14.1K Views

Join the DZone community and get the full member experience.

Join For Free
Lucene's 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.

Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.

Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join.

If such content is so common then how do search applications handle it today?

One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene.

If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over.

Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries.

Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way.

With BlockJoinQuery you can now directly search relational content yourself!

Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color).

Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color):


  • small, blue
  • small, black
  • medium, black
  • large, gray


Perhaps a user first searches for "wolf shirt", gets a bunch of hits, and then drills down on a particular size and color, resulting in this query:
   name:wolf AND size=small AND color=blue
which should match this shirt. name is a shirt field while the size and color are SKU fields.

But if the user drills down instead on a small gray shirt:
   name:wolf AND size=small AND color=gray
then this shirt should not match because the small size only comes in blue and black.

How can you run these queries using BlockJoinQuery? Start by indexing each shirt (parent) and all of its SKUs (children) as separate documents, using the new IndexWriter.addDocuments API to add one shirt and all of its SKUs as a single document block. This method atomically adds a block of documents into a single segment as adjacent document IDs, which BlockJoinQuery relies on. You should also add a marker field to each shirt document (e.g. type = shirt), as BlockJoinQuery requires a Filter identifying the parent documents.

To run a BlockJoinQuery at search-time, you'll first need to create the parent filter, matching only shirts. Note that the filter must use FixedBitSet under the hood, like CachingWrapperFilter:
  Filter shirts = new CachingWrapperFilter(
                    new QueryWrapperFilter(
                      new TermQuery(
                        new Term("type", "shirt"))));
Create this filter once, up front and re-use it any time you need to perform this join.

Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields:
  BooleanQuery skuQuery = new BooleanQuery();
  skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST);
  skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST);
Next, use BlockJoinQuery to translate hits from the SKU document space up to the shirt document space:
  BlockJoinQuery skuJoinQuery = new BlockJoinQuery(
    skuQuery, 
    shirts,
    ScoreMode.None);
The ScoreMode enum decides how scores for multiple SKU hits should be aggregated to the score for the corresponding shirt hit. In this query you don't need scores from the SKU matches, but if you did you can aggregate with Avg, Max or Total instead.

Finally you are now free to build up an arbitrary shirt query using skuJoinQuery as a clause:
  BooleanQuery query = new BooleanQuery();
  query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST);
  query.add(skuJoinQuery, Occur.MUST);
You could also just run skuJoinQuery as-is if the query doesn't have any shirt fields.

Finally, just run this query like normal! The returned hits will be only shirt documents; if you'd also like to see which SKUs matched for each shirt, use BlockJoinCollector:
  BlockJoinCollector c = new BlockJoinCollector(
    Sort.RELEVANCE, // sort
    10,             // numHits
    true,           // trackScores
    false           // trackMaxScore
    );
  searcher.search(query, c);
The provided Sort must use only shirt fields (you cannot sort by any SKU fields). When each hit (a shirt) is competitive, this collector will also record all SKUs that matched for that shirt, which you can retrieve like this:
  TopGroups hits = c.getTopGroups(
    skuJoinQuery,
    skuSort,
    0,   // offset
    10,  // maxDocsPerGroup
    0,   // withinGroupOffset
    true // fillSortFields
  );
Set skuSort to the sort order for the SKUs within each shirt. The first offset hits are skipped (use this for paging through shirt hits). Under each shirt, at most maxDocsPerGroup SKUs will be returned. Use withinGroupOffset if you want to page within the SKUs. If fillSortFields is true then each SKU hit will have values for the fields from skuSort.

The hits returned by BlockJoinCollector.getTopGroups are SKU hits, grouped by shirt. You'd get the exact same results if you had denormalized up-front and then used grouping to group results by shirt.

You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2).

However, there are some important limitations of index-time joins:


  • The join must be computed at index-time and "compiled" into the index, in that all joined child documents must be indexed along with the parent document, as a single document block.
  • Different document types (for example, shirts and SKUs) must share a single index, which is wasteful as it means non-sparse data structures like FieldCache entries consume more memory than they would if you had separate indices.
  • If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews.
  • There is no QueryParser support, so you need to programmatically create the parent and child queries, separating according to parent and child fields.
  • The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist.
  • The join is a one (parent) to many (children), inner join.


As usual, patches are welcome!

There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations.

Source:  http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html



Database Relational database Lucene Document

Opinions expressed by DZone contributors are their own.

Related

  • Designing a Blog Application Using Document Databases
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps
  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • Introduction to Couchbase for Oracle Developers and Experts: Part 2 - Database Objects

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!