Block Join Faceting: Introduction
Block Join Faceting: Introduction
The problem of searching structured data is addressed in Solr with a powerful, high performance, and robust solution: Block Join Query. Read on for more.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
Every software application is created to bring business value. Typically, the software development process starts from understanding business requirements and creating a domain model. Such a model is very helpful in communication with business stakeholders and allows to clearly understand their needs and restrictions. Additionally, a simple and flexible domain model is a strong basis for creating effective and extensible software architecture that meets customer’s requirements.
Normally, business modeling starts with identifying entities and relationships between them. Relationships could be association or composition, and have different cardinalities, e.g. one-to-one, one-to-many, and many-to-many relationships. Relationships are so important that they are first class citizens in the relational databases and the majority of data-related specifications and frameworks like JPA or Hibernate.
However, when we deal with search engines like Solr, we see that domain models readily supported by a framework are quite simple. Each entity is represented as a document with some set of fields. That's it. It looks like Solr makes only basic steps in supporting all the variety of possible relationships between indexed documents, leaving the rest to the application developer.
At the same time, for some business areas, support of relationships is very important. In particular, such relationships introduce new challenges to the problem of facets calculation. As an example, let's consider e-commerce platforms where each Product in the catalog has several so-called Stock Keeping Units (SKU). Each SKU defines a different flavor of the same item. Even though customers are purchasing SKUs, e.g. concrete flavor of the product, typical e-commerce businesses merchandise in terms of the product.
The screenshot above is taken from one of the online retailers. As we can see, a dress could be in blue, pink, or red colors, and for a blue color dress, only sizes XS and S are available. However, for the seller and the customers, it’s just a single product. So, when a customer navigates the site, she should see all SKUs belonging to the same product as a single product. This means that for facet calculation, our facet counts should represent products, not SKUs. Thus, we need to find some approach to aggregate SKU-level facets into product ones.
A pretty common solution here is to propagate properties from SKU level to product level and produce single product document with multivalued fields aggregated from SKU. With this approach, our aggregated product will look as follows:
However, this approach creates the possibility for false positive matches with regards to combinations of SKU-level fields. For example, if a customer decides to filter by color ‘Blue’ and size ‘M’, Product_1 will be considered a valid match, even though there is no SKU in the original catalog which is both 'Blue' and 'M'. This happens because when we are aggregating values from SKU level, we are losing information about what value comes from what SKU. Even though this situation looks like an edge case, in a real life application it can cause a really bad customer experience. Imagine a situation when s customer searched for s particular item filtering by colors and sizes only to discover on checkout pages that there is no such item available in the catalog. This really frustrates customers and negatively impacts customer loyalty—not so good for the business.
Getting back to technology, this means that we should carefully support catalog structure when searching and faceting products. The problem of searching structured data is already addressed in Solr with a powerful, high performance, and robust solution: Block Join Query. We wrote about this approach in this blog extensively.
However, the problem of faceting structured data requires further work. So, we created SOLR-5743 in February 2014 and worked on it ever since. Now, we are happy to report that first robust and high-performance implementation is committed to trunk.
We will describe the new BJQ faceting component and related algorithms in our next blog post. Stay tuned and happy faceting!
Published at DZone with permission of Oleg Savrasov , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.