We were recently assisting a leading UK media publisher with a search relevancy issue. The powerful Apache Solr search service was utilised by their e-commerce package but their implementation suffered from a few common problems. On top of this were some more challenges caused by the umbrella package and, in particular, how it abstracted Solr.
The first iteration needed to focus on quick wins that didn't involve 'rip & replace' of portions of the e-commerce framework code.
In summary, the issues that were identified were:
- Search response efficiency
- Poor relevancy (using standard search handler without phrase matches or boosting)
- Using the default example configuration
- Not using synonyms to match aliases or catch common typos
- Over reliance on dynamic fields
- Clumsily hiding SolrJ
- Not staying up to date
Let's have a look at the issues first and then explore ways in which they can be tackled.
1. Search response efficiency
The most efficient way to use a search engine is to deliver the results straight from the index (with appropriate styling applied). Therefore, you should store the fields you need for the results and, additionally, for each query type, you should specify the fields that you want returned using the field list (fl) parameter.
In this case, the application was storing almost all the fields and not specifying which fields it required. To make matters worse, it only really required a 'pk' field back as it was performing a database read over the resulting primary keys.
2. Using standard search handler without phrase matches or boosting
First, there is nothing fundamentally wrong with the standard search handler, however, in a lot of cases the dismax handler (or edismax in newer releases) is a better choice for user queries. In this implementation the poor relevancy for multi-word queries was primarily a result of not using phrase matches.
If a user entered 'Harry Potter', the application was generating an OR clause such as (name_text_en:Harry Potter)
If we ignore the analysis applied to the field, what the query parser will actually construct in this case depends upon the defaultSearchField setting in schema.xml e.g. name_text_en:Harry text:Potter
The relevancy was chiefly addressed by using phrase matching, with slop (term proximity) where appropriate, in conjunction with boosting.
3. Using the default example configuration
Solr provides an example configuration, schema and sample data for making it easy to evaluate Solr.
Like any enterprise-grade tool, it requires environment and workload specific configuration to get the best out it. Whilst most of this is contained within the solrconfig.xml, attention should also be given to amending the logging configuration from the default 'INFO' level.
The installation in question appeared to have an out of the box example configuration with a lightly modified schema where the example fields had been removed. For instance, the dismax handler configuration still referenced the now-missing example fields in query fields, boost fields etc. We were able to remove most of the request handlers after examining usage statistics. This lead to a significantly smaller and more maintainable configuration with the added benefit of improved capacity.
Another configuration improvement was the addition of cache-warming with specific queries relating to the site's use of faceted navigation.
4. No synonyms to match aliases or catch common typos
Let's imagine that you are selling sudoku books and games. Ideally you should serve the results that the user expects, they may not realise that 'suduko' is the incorrect spelling and they'll probably want to see 'Su Doku' results too. These were addressed using the SynonymFilterFactory,, but watch out for synonym terms containing spaces as these may change your expansion strategy!
See the Solr Wiki for more details.
5. Relying on dynamic fields
This application was initially using approximately 20 fields of which 5 had been defined as static fields. All of the dynamic fields were configured to be stored - whilst this was very in enabling the extraction of a sample dataset in XML using a Groovy script to walk the index, it is completely inadvisable from a performance standpoint. Using static fields gave far more granular control.
6. Clumsily hiding SolrJ
The architects of the e-commerce package had decided that they would abstract the underlying search implementation - whilst this is a sensible design decision, the way it was developed caused a number of issues.
The search query object had a slightly leaky abstraction so that you could add Solr parameters - this, combined with the ability to add a raw query string, actually helped greatly for the first iteration. The class responsible for constructing the SolrJ (Solr Java client) query object is proprietary and privately instantiated. Furthermore, it alway set two AND clauses and then ANDed the raw query; this prevented fine-grained query control and the use of function sub-queries such as date boosting. Note that the AND clauses the package dictated should actually have been applied as filter queries (fq).
With the benefit of experience, the package should have looked to use the dismax handler. This would have enabled a clean abstraction, simplified the search client code and given us the ability to control boosting, phrase slop etc. within the configuration. Additionally, had they followed Spring Framework best practice, by coding to interfaces and utilising dependency injection, then it would have been easy to provide a different implementation for constructing the SolrJ query object.
The implications of this were that with the next iteration, we would need to look at sub-classing the whole e-commerce package search service implementation to regain absolute control of the Solr queries.
7. The installation was using Solr 1.4
This release isn't even readily available from the archives (1.4.1 is the oldest release); not staying up-to-date means that they are missing out on important bug fixes, performance improvements and new features. Another alternative would be to upgrade to Lucid Works Enterprise.
Reviewing a Solr implementation
You may find the following checklist of warning signs helpful:
- irrelevant search results
- searching and then performing a database lookup
- storing data unnecessarily
- not setting the defaultSearchField appropriately
- using the default logging configuration
- using the example configuration