Over a million developers have joined DZone.

Apache Lucene: Fast Range Faceting Using Segment Trees and the Java ASM Library

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets.

Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing.

To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges.

But this is inefficient!

Segment Trees

There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges).

If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that.

Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree.

Consider these ranges; the lower number is inclusive and the upper number is exclusive:

 0:  0 – 10
 1:  0 – 20
 2: 10 – 30
 3: 15 – 50
 4: 40 – 70

The elementary intervals (think Venn diagram!) are:

  -∞ – 0
   0 – 10
  10 – 15
  15 – 20
  20 – 30
  30 – 40
  40 – 50
  50 – 70
  70 – ∞

Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs:

  -∞ – 0 →
   0 – 10 → 0
  10 – 15 → 1, 2
  15 – 20 → 1, 2, 3
  20 – 30 → 2, 3
  30 – 40 → 3
  40 – 50 → 3, 4
  50 – 70 → 4
  70 – ∞  →

Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges.

Looking Up Matched Ranges

I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks.

I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method:

  private int lookup(Node node, long v, int[] answers, int upto) {
    if (node.outputs != null) {
      for(int range : node.outputs) {
        answers[upto++] = range;
      }
    }
    if (node.left != null) {
      if (v <= node.left.end) {
        upto = lookup(node.left, v, answers, upto);
      } else {
        upto = lookup(node.right, v, answers, upto);
      }
    }

    return upto;
  }

This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower.

After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges:

  void lookup(long v, int[] answers) {
    int upto = 0;
    if (v <= 19) {
      if (v <= 9) {
        if (v >= 0) {
          answers[upto++] = 0;
          answers[upto++] = 1;
        }
      } else {
        answers[upto++] = 1;
        answers[upto++] = 2;
        if (v >= 15) {
          answers[upto++] = 3;
        }
      }
    } else {
      if (v <= 39) {
        answers[upto++] = 3;
        if (v <= 29) {
          answers[upto++] = 2;
        }
      } else {
        if (v <= 49) {
          answers[upto++] = 3;
          answers[upto++] = 4;
        } else {
          if (v <= 69) {
            answers[upto++] = 4;
          }
        }
      }
    }
  }

Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases).

As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree.

The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method.

Counting All Matched Ranges

While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values.

This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit.

So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene.

Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why.

Lucene

I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now.

Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Michael Mccandless, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}