Inside the Apache Solr JSON Facet API
Solr 5 includes a re-written faceted search and analytics module with a structured JSON API to control the faceting and analytics commands. Here’s how it works.
Join the DZone community and get the full member experience.
Join For FreeSince I joined Cloudera a few years ago to help bring search-powered analytics to Cloudera’s platform, I’ve been working actively upstream alongside the rest of the Solr community to develop new functionality that will drive more interesting applications on Cloudera Search (which is based on an integration of Solr with the Apache Hadoop ecosystem). In the following re-post from my personal blog, I describe one of these features — improved support for nested facets via JSON — that I wrote at the time of code check-in. (Note: this feature is targeted for a future release of Cloudera Enterprise, and thus is not yet supported for production use.)
Why JSON?
The structured nature of nested sub-facets is more naturally expressed in a nested structure like JSON rather than the flat structure that normal query parameters provide. For that reason, starting in 5.0, Solr includes a JSON Facet API. The Facet API is now part of the JSON Request API, so a complete request may be expressed in JSON.
Goals of the new faceting module include:
- First-class JSON support
- Easier programmatic construction of complex, nested facet commands
- Support for a much more canonical response format that is easier for clients to parse
- First-class analytics support
- Ability to sort facet buckets by any calculated metric
- A cleaner way to do distributed faceting
- Better integration with other search features
Of course, if you prefer to use Solr’s existing faceting capabilities, that’s fine, too. (You can even use both simultaneously if you want to!)
Next, let’s get into the details. (Note: Some examples here use syntax supported only in later Solr 5 releases, or even Solr 6.)
Ease of Use
Some of the ease-of-use enhancements over traditional Solr faceting come from the inherently nested structure of JSON.
As an example, here is the faceting command for two different range facets usingSolr’s Flat API:
&facet=true
&facet.range={!key=age_ranges}age
&f.age.facet.range.start=0
&f.age.facet.range.end=100
&f.age.facet.range.gap=10
&facet.range={!key=price_ranges}price
&f.price.facet.range.start=0
&f.price.facet.range.end=1000
&f.price.facet.range.gap=50
And here is the equivalent faceting command in the new JSON Faceting API:
age_ranges:{
type:range
field:age,
start:0,
end:100,
gap:10
price_ranges:{
type:range
field:price,
start:0,
end:1000,
gap:50
These aren’t even nested facets, but already, one can see how much nicer the JSON API looks. With deeply nested sub-facets and statistics, the clarity of the inherently nested JSON API only grows.
JSON Extensions
A number of JSON extensions have been implemented to further increase the clarity and ease of constructing a JSON faceting command by hand. For example:
{// this is a single-line comment, which can help add clarity to large JSON commands
/* traditional C-style comments are also supported */
x:"avg(price)",// Simple strings can occur unquoted
y:'unique(manu)'// Strings can also use single quotes (easier to embed in another String)
Debugging JSON
Nicely-indented JSON is very easy to understand. If you get a large piece of non-indented JSON somehow and are trying to make sense of it, you can cut and paste into an online validator like JSON Lint or JSON Formatter.
Both of these validators will indent your JSON, even when it contains extensions unsupported by them (such as comments or bare strings).
Facet Types
There are two types of facets: one that breaks up the domain into multiple buckets, and aggregations or facet functions that provide information about the set of documents belonging to each bucket.
Faceting can be nested. Any bucket produced by faceting can further be broken down into multiple buckets by a subfacet.
Statistics Are Facets
Statistics are now fully integrated into faceting. Since we start off with a single facet bucket with a domain defined by the main query and filters, we can even ask for statistics for this top-level bucket, before breaking up into further buckets via faceting. Example:
json.facet={
x:"avg(price)",// the average of the price field will appear under "x"
y:"unique(manufacturer)"// the number of unique manufacturers will appear under "y"
See facet functions for a complete list of the available aggregation functions.
JSON Facet Syntax
The general form of the JSON facet commands are:
<facet_name>:{<facet_type>:<facet_parameter(s)>}
Example:
top_authors:{terms:{field:authors,limit:5}}
After Solr 5.2, a flatter structure with a “type”
field may also be used:
<facet_name>:{"type":<facet_type>,<other_facet_parameter(s)>}
Example:
top_authors:{type:terms,field:authors,limit:5}
The results will appear in the response under the facet name specified. Facet commands are specified using json.facet
request parameters.
Test Using Curl
To test out different facet requests by hand, it’s easiest to use curl
from the command line. Example:
$curl http://localhost:8983/solr/query -d 'q=*:*&rows=0&
json.facet={
categories:{
type:terms,
field:cat,
sort:{x:desc},
facet:{
x:"avg(price)",
y:"sum(price)"
Terms Facet
The termsfacet, or field facet, produces buckets from the unique values of a field. The field needs to be indexed or have docValues
.
The simplest form of the terms facet:
top_genres:{terms:genre_field}
An expanded form allows for more parameters:
top_genres:{
type:terms,
field:genre_field,
limit:3,
mincount:2
Example response:
"top_genres":{
"buckets":[
"val":"Science Fiction",
"count":143},
"val":"Fantasy",
"count":122},
"val":"Biography",
"count":28}
Parameters:
Query Facet
The query facet produces a single bucket that matches the specified query.
Here’s an example of the simplest form of the query facet:
high_popularity:{query:"popularity:[8 TO 10]"}
An expanded form allows for more parameters (or sub-facets/facet functions):
high_popularity:{
type:query,
q:"popularity:[8 TO 10]",
facet:{average_price:"avg(price)"}
Example response:
"high_popularity":{
"count":147,
"average_price":74.25
Range Facet
The range facet produces multiple range buckets over numeric fields or date fields.
Range facet example:
prices:{
type:range,
field:price,
start:0,
end:100,
gap:20
Example response:
"prices":{
"buckets":[
"val":0.0,// the bucket value represents the start of each range. This bucket covers 0-20
"count":5},
"val":20.0,
"count":3},
"val":40.0,
"count":2},
"val":60.0,
"count":1},
"val":80.0,
"count":1}
To ease migration, these parameter names, values, and semantics were taken directly from the old-style (non-JSON) Solr range faceting.
Parameters:
Common Parameters
Parameters that all faceting methods have in common include:
domain
: facet domain transformations, to change the incoming domain of the facet command before faceting is executed. This is useful for multi-select faceting and nested document (block join) faceting.
Conclusion
Hopefully, you now have a good understanding of the JSON API introduced in Solr 5. Again, this feature is scheduled to ship/be certified in a future Cloudera release but is not yet supported for production use.
Yonik Seeley is a Software Engineer at Cloudera, a committer and PMC member for Apache Lucene, and the creator of Solr. Previously, he was chief open source architect and cofounder at LucidWorks.
Published at DZone with permission of Yonik Seeley. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments