Eating Dogfood with Lucene
Eating Dogfood with Lucene
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.
So, for the past few weeks I've been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.
It's currently running at jirasearch.mikemccandless.com and it's still quite rough (feedback welcome!).
It's a good showcase of a number of Lucene features:
- Highlighting using the new
PostingsHighlighter; for example, try searching for
- Autosuggest, using the not-yet-committed
- Sorting by various fields, including a blended recency and relevance sort.
- A few synonym examples, for example try searching for
- Near-real-time searching, and controlled searcher versions: the server uses
ToParentBlockJoinQuery: each issue is indexed as a parent document, and then each comment on the issue is indexed as a separate child document. This allows the server to know which specific comment, along with its metadata, was a match for the query, and if you click on that comment (in the highlighted results) it will take you to that comment in Jira. This is very helpful for mega-issues!
- Okapi BM25 for ranking.
The drill-downs on the left also show a number of features from Lucene's facet module:
- Drill sideways for all fields, so that the field does not disappear when you drill down on it.
- Dynamic range faceting: the
Updateddrill-down is computed dynamically, e.g. all issues updated in the past week.
- Hierarchical fields, which are simple since the Lucene facet module supports hierarchy natively. Only the
Componentdimension is hierarchical, e.g. look at the
Componentdrill down for all Lucene core issues.
- Multi-select faceting (hold down the shift key when clicking on a value), e.g. all improvements and new features.
- Multi-valued fields (e.g.
This is really eating two different dog foods: first, as a developer I see what one must go through to build a search server on top of Lucene's APIs, but second, as an end user, I experience the resulting search user interface whenever I need to find a Lucene or Solr issue. It's like having to eat both wet and dry dog food at once, and both kinds of dog food have uncovered numerous issues!
The issues ranged from outright bugs such as
PostingsHighlighterpicking the worst snippets instead of the best ( LUCENE-4826), to missing features like dynamic numeric range facets ( LUCENE-4965), to issues that make consuming Lucene's APIs awkward, especially when mixing different features, such as the difficulty of mixing non-range and range facets with
DrillSideways( LUCENE-4980) and the difficulty of
using NRTManagerwith both a taxonomy index and a search index ( LUCENE-4967), or finally just inefficient, such as the inability to customize how
PostingsHighlighterloads its field values ( LUCENE-4846).
The process is far from done! There are still a number of issues that need fixing. For example, it's not easy to mix
ToParentBlockJoinQueryand grouping, which is frustrating because fields like who reported an issue, severity, issue status, component would all be natural group-by fields. Some issues, such as the inability of
PostingsFormatterto render directly to a
JSONObject( LUCENE-4896) are still open because they are challenging to fix cleanly. Others, such as the infix suggester ( LUCENE-4845) are in limbo because of disagreements on the best approach, while still others, like
BlendedComparatorused to sort by mixed relevance and recency, I just haven't pushed back into Lucene yet.
There are plenty of ways to provoke an error from the server; here are two fun examples: try fielded search such as
summary:python, or a multi-select drilldown on the
Much work remains and I'll keep on eating both wet and dry dog food: it's a very productive way to find problems in Lucene!
Published at DZone with permission of Michael Mccandless , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.