Block Join Faceting: Implementation
In the previous post, we talked about the business motivation behind the support of structured documents in Solr/Lucene index and unique requirements to the faceting engine which is created by such an approach to modeling data. We introduced SOLR-5743, and now it is time to take a deep dive into implementation details.
Join the DZone community and get the full member experience.
Join For FreeIn the previous post, we talked about the business motivation behind the support of structured documents in Solr/Lucene index and unique requirements to the faceting engine which is created by such an approach to modeling data. We introduced SOLR-5743, and now it is time to take a deep dive into implementation details.
First, let us recap how Solr deals with structured data. Solr supports the search of hierarchical documents using Block Join Query (BJQ). Using this query requires a special way of indexing documents. BJQ relies on special positioning of the documents in the index: all documents belonging to same hierarchy have to be indexed together, starting from child documents followed by their parent document. BJQ works as a bridge between levels of document hierarchy, e.g. it transforms matches on child documents to the matches on parent documents. When we search using BJQ, we provide a child query and a parent filter as parameters. Child query represents what we are looking for among child documents, and parent filter tells BJQ how to distinguish parent documents from child documents in the index. For each matched child document, BJQ scans ahead in the index until it finds the nearest parent document, which is sent into the collector chain instead of the child document. This trick of relying on the relative document positioning in the index, or “index-time join”, is the secret behind the high performance of BJQ.
Now, we are ready to look into faceting of BJQ results. We will use an example of an eCommerce catalog with Product-SKU parent-child relationship for our modeling example.
The main idea is quite straightforward. We consider each hierarchy of matched documents separately. As we are using BJQ, each hierarchy is represented in the index as a document block, or DocSet Slice as we call it.
First, we calculate facets based on matched SKUs from our block. Then, we aggregate obtained SKU counts into Product-level facet counts increasing product level facet count by only 1 for every matched block, irrespective to the number of matched SKUs within the block. For example, if we are searching by COLOR:Blue, even though two Blue SKU were found within a block, aggregated Product-level counts will be increased only by 1.
This solution is implemented inside BlockJoinFacetComponent which extends standard SolrSearchComponent. BlockJoinFacetComponent validates the query and injects specialBlockJoinFacetCollector into Solr post-filter collectors chain. When BlockJoinFacetCollector is invoked, it receives a parent document, since the BlockJoinFacetComponent ensures that onlyToParentBlockJoinQuery is allowed as a top-level query.
Now, to adjust facet counts based on this parent, we need to identify the children documents which matched within the current block. To do that we request BlockJoinScorer to trackPendingChildHits() and to swapChildDocs(). Our next step is to calculate facets on the obtained doc slice. Initially, we just invoked standard faceting routine e.g. employed DocValuesFacet.getCounts() providing it DocSet representing matched child documents from current index block. Finally, facet counts calculated on slice were aggregated into global ones counting multiple hits on the child documents as one hit on parent document.
We developed randomized test case BlockJoinFacetRandomTest to ensure functional correctness of this implementation. This test checks facet calculation with different combinations of Products, SKUs, and random filters.
Also, BlockJoinFacetComponent was extended to work with Solr Cloud and supplied withBlockJoinFacetDistribTest as well.
All looked good except one issue: faceting was relatively slow.
Profiling showed that utility DocValuesFacet.getCounts() was developed with the assumption that it should be called only once for each particular field. However, with BlockJoinFacetComponent, we invoked this method for each matched parent document. As a result, lots of work was unnecessarily repeated many times, such as: determining field type, allocating in-memory objects, determining top and segment SortedSetDocValues, migration from segment facet counts to global ones. Thus, the first step of performance optimization was the introduction of BlockJoinFieldFacetAccumulator which is responsible for accumulating facet counts for a particular field. It performs all mentioned tasks during initialization or switching to next index reader. Those improvements showed 25 x performance gain in local benchmarks.
Next improvement was inspired by Reverse Nested Aggregation Elastic Search solution. Suppose, we need to calculate facets for some DocValues field. For each segment, we know SortedSetDocValues of this field, which contains unique values of this field. For each of those values, we need to track a facet count, where each term has corresponding long value positioned in the parallel array of facet counters. Here we use another trick to save memory and ensure CPU cache locality: each long value in our facet counter array combines aggregated (parent-level) facet count for particular term (highest 4 bytes) and position of most recently matched parent document which has a block with hit for this term (lowest 4 bytes).
When BlockJoinFieldFacetAccumulator is requested to update facet counts with a particular parent document and an array of matched child documents, we:
- iterate over matched children documents;
- for each child document we identify terms contained in the document
- for each term we find a corresponding array entry long and check if its encoded parent position matches the provided parent document. If it matches this mean that we have a duplicated hit within the same block, so we just ignore it and proceed with iteration over terms. If positions do not match, we have a first hit in a new block and an encoded long value is updated to increment facet value counter and update the last seen parent position.
Thus, when all matched parent documents are processed, higher halves of the long values in accumulator array will contain required facet counts. This optimization delivered 2x performance boost.
The third optimization was suggested by Mikhail Khludnev. He proposed to calculate doc slice as an intersection of all matched child documents with all children of the current parent document. This approach is implemented as a separate BlockJoinDocSetFacetComponent. On the local benchmark, it’s faster thanBlockJoinFacetComponent by 40%, but we believe that exact circumstances in which one component is better than another requires further analysis. We recommend you experiment with macro benchmarks using both implementations.
Components described in this blog post are committed into Solr codebase with SOLR-5743 in December 2015 and are available since Solr 5.5. Corresponding documentation is available at Solr Wiki . In order to utilize one of the BJQ faceting components, you need to configure them in solrconfig.xml and introduce appropriate search handlers, for example:
<searchComponent name="blockJoinFacet"
class="org.apache.solr.search.join.BlockJoinFacetComponent"/>
<requestHandler name="/blockJoinFacetRH"
class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="shards.qt">blockJoinFacetRH</str>
</lst>
<arr name="last-components">
<str>blockJoinFacet</str>
</arr>
</requestHandler>
<searchComponent name="blockJoinDocSetFacet"
class="org.apache.solr.search.join.BlockJoinDocSetFacetComponent"/>
<requestHandler name="/blockJoinDocSetFacetRH"
class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="shards.qt">blockJoinDocSetFacetRH</str>
</lst>
<arr name="last-components">
<str>blockJoinDocSetFacet</str>
</arr>
</requestHandler>
Since only docValues fields could be used for BJQ faceting, you need to update the corresponding fields configuration in schema.xml file, for example:
<field name="COLOR_s" type="string" indexed="true"
stored="true" docValues="true"/>
<field name="SIZE_s" type="string" indexed="true"
stored="true" docValues="true"/>
Then, after indexing some set of hierarchical documents like:
<doc>
<field name="id">10</field>
<field name="type_s">parent</field>
<field name="BRAND_s">Nike</field>
<doc>
<field name="id">11</field>
<field name="type_s">child</field>
<field name="COLOR_s">Red</field>
<field name="SIZE_s">XL</field>
</doc>
<doc>
<field name="id">12</field>
<field name="type_s">child</field>
<field name="COLOR_s">Blue</field>
<field name="SIZE_s">XL</field>
</doc>
</doc>
you need to pass required ToParentBlockJoinQuery to the configured request handler and request calculating facets using child.facet.field parameter, for example:
If everything is configured correctly, then the Solr response should look like:
{
"responseHeader":{
"status":0,
"QTime":82},
"response":{"numFound":1,"start":0,"docs":[{
"id":"10",
"type_s":["parent"],
"BRAND_s":["Nike"],
"_version_":1526159505694392320}]
}, "facet_counts":{
"facet_queries":{},
"facet_fields":{
"COLOR_s":[
"Blue",1,
"Red",1],
"SIZE_s":[
"XL",1]},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}
}
}
You may also consider an alternative way for calculating facets on hierarchical product structures. Solr JSON Facet API introduced by Yonik Seeley allows to solve the same task in a bit of a different way. This is a new and really powerful feature appeared in Solr 5.4. It provides a very flexible way to control the facets to be calculated. Particularly, it’s possible to calculate facets on child documents with aggregation by unique _root_ field. For example, for request:
curl http://localhost:8983/solr/collection1/query -d
'q=(type_s:child)&rows=0&
json.facet={
colors:{
type : terms,
field : COLOR_s,
facet: {
productsCount: "unique(_root_)"
}
},
sizes:{
type : terms,
field : SIZE_s,
facet: {
productsCount: "unique(_root_)"
}
}
}
'
Solr returns a response that contains expected counts (check productCount field):
"facets":{
"count":2,
"colors":
{
"buckets":[
{
"val":"Blue",
"count":1,
"productsCount":1
},
{
"val":"Red",
"count":1,
"productsCount":1
}
]
},
"sizes":
{
"buckets":
[
{
"val":"XL",
"count":2,
"productsCount":1
}
]
}
}
This way has obvious advantages: it’s easier to programmatically control which facets are to be calculated and how they should be presented in response, includes statistics, nested facet commands, etc. The drawback here is that the search query in the above example, i.e. q=(type_s:child), yields child documents, while facets are calculated on parent ones. So, to present parent documents both in search results and in calculated facets, we need to execute 2 separate requests. Another reason to choose one of the BlockJoinFacetComponents is for much better performance. On my local test data BlockJoinDocSetFacetComponent is about twice as fast as JSON Facet API.
BlockJoinFacetComponent can be further improved by replacing iteration over matched child documents with bitwise operations over whole DocSets. The main idea here is to be able to obtain for each term a DocSet of matched parents with particular term in child documents. In this case, facet count of the term should correspond to the size of this DocSet.
Happy (block) faceting!
Published at DZone with permission of Oleg Savrasov, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Turbocharge Ab Initio ETL Pipelines: Simple Tweaks for Maximum Performance Boost
-
Why I Prefer Trunk-Based Development
-
Java Concurrency: Condition
-
Software Development: Best Practices and Methods
Comments