Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Use Prefix Operators instead of Boolean Operators

· Java Zone ·
Free Resource

Comment (1)

Save
{{ articles[0].views | formatCount}} Views

Get the Edge with a Professional Java IDE. 30-day free trial.

The following is written with Solr users in mind, but the principles apply to Lucene users as well.

I really dislike the so called “Boolean Operators” (“AND”, “OR”, and “NOT”) and generally discourage people from using them. It’s understandable that novice users may tend to think about the queries they want to run in those terms, but as you become more familiar with IR concepts in general, and what Solr specifically is capable of, I think it’s a good idea to try to “set aside childish things” and start thinking (and encouraging your users to think) in terms of the superior “Prefix Operators” (“+”, “-”).

## Background: Boolean Logic Makes For Terrible Scores

Boolean Algebra is (as my father would put it) “pretty neat stuff” and the world as we know it most certainly wouldn’t exist with out it. But when it comes to building a search engine, boolean logic tends to not be very helpful. Depending on how you look at it, boolean logic is all about truth values and/or set intersections. In either case, there is no concept of “relevancy” — either something is true or it’s false; either it is in a set, or it is not in the set.

When a user is looking for “all documents that contain the word ‘Alligator’” they aren’t going to very be happy if a search system applied simple boolean logic to just identify the unordered set of all matching documents. Instead algorithms like TF/IDF are used to try and identify the ordered list of matching documents, such that the “best” matches come first. Likewise, if a user is looking for “all documents that contain the words ‘Alligator’ or ‘Crocodile’”, a simple boolean logic union of the sets of documents from the individual queries would not generate results as good as a query that took into the TF/IDF scores of the documents for the individual queries, as well as considering which documents matches both queries. (The user is probably more interested in a document that discusses the similarities and differences between Alligators to Crocodiles then in documents that only mention one or the other a great many times).

This brings us to the crux of why I think it’s a bad idea to use the “Boolean Operators” in query strings: because it’s not how the underlying query structures actually work, and it’s not as expressive as the alternative for describing what you want.

## BooleanQuery: Great Class, Bad Name

To really understand why the boolean operators are inferior to the prefix operators, you have to start by considering the underlying implementation. The BooleanQuery class is probably one of the most misleading class names in the entire Lucene code base because it doesn’t model simple boolean logic query operations at all. The basic function of a BooleanQuery is:

1. A BooleanQuery consists of one or more BooleanClauses, each of which contains two pieces of information:

• A nested Query
• An Occur flag, which has one of three values

• MUST – indicating that documents must match this nested Query in order for the document to match the BooleanQuery, and the score from this subquery should contribute to the score for the BooleanQuery
• MUST_NOT – indicating that documents which match this nested Query are prohibited from matching the BooleanQuery
• SHOULD – indicating that documents which match this nested Query should have their score from the nested query contribute to the score from the BooleanQuery, but documents can be a match for the BooleanQuery even if they do not match the nested query

2. If a BooleanQuery contains no MUST BooleanClauses, then a document is only considered a match against the BooleanQuery if one or more of the SHOULD BooleanClauses is a match.
3. The final score of a document which matches a BooleanQuery is based on the sum of the scores from all the matching MUST and SHOULD BooleanClauses, multiplied by a “coord factor” based on the ratio of the number of matching BooleanClauses to the total number of BooleanClauses in the BooleanQuery.

These rules are not exactly simple to understand. They are certainly more complex then boolean logic truth tables, but that’s because they are more powerful. The examples below show how easy it is to implement “pure” boolean logic with BooleanQuery objects, but they only scratch the surface of what is possible with the BooleanQuery class:

• Conjunction: (X ∧ Y)

```BooleanQuery q = new BooleanQuery();
```
• Disjunction: (X ∨ Y)

```BooleanQuery q = new BooleanQuery();
```
• Negation: (X ¬ Z)

```BooleanQuery q = new BooleanQuery();
```

## Query Parser: Prefix Operators

In the Lucene QueryParser (and all of the other parsers that are based on it, like DisMax and EDisMax) the “prefix” operators “+” and “-” map directly to the Occur.MUST and Occur.MUST_NOT flags, while the absence of a prefix maps to the Occur.SHOULD flag by default. (If you have any suggestions for a one character prefix syntax that could be used to explicitly indicate Occur.SHOULD, please comment with your suggestions, I’ve been trying to think of a good one for years.) So using the prefix syntax, you can express all of the permutations that the BooleanQuery class supports — not just simple boolean logic:

• (+X +Y) … Conjunction, ie: (X ∧ Y)

```BooleanQuery q = new BooleanQuery();
```
• (X Y) … Disjunction, ie: (X ∨ Y)

```BooleanQuery q = new BooleanQuery();
```
• (+X -Z) … Negation, ie: (X ¬ Z)

```BooleanQuery q = new BooleanQuery();
```
• ((X Y) -Z) … ((X ∨ Y) ¬ Z)

```BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
```
• (X Y -Z) … Not expressible in simple boolean logic

```BooleanQuery q = new BooleanQuery();
```

Note in particular the differences between the last two examples. (X Y -Z) is a single BooleanQuery object containing three clauses, while ((X Y) -Z) is a BooleanQuery containing two clauses, one of which is a nested BooleanQuery containing two clauses. In both cases a document must match either “X” or “Y” and it can not match against “Z” (so the set of documents matched by each query will be the same) and in both cases the score of a document will be higher if it matches both the “X” and “Y” clauses; but because of the difference in their structures, the scores for these queries will be different for every document. In particular, the coord factor will cause documents matching only one of “X” or “Y” (but not both) to have extremely different scores from each of these queries. (This assumes that the DefaultSimilarity is being used; it would be possible to write a custom Similarity to force the scores to be equivalent)

## Query Parser: “Boolean Operators”

The query parser also supports the so called “boolean operators” which can also be used to express boolean logic, as demonstrated in these examples:

• (X AND Y) … Conjunction, ie: (X ∧ Y)

```BooleanQuery q = new BooleanQuery();
```
• (X OR Y) … Disjunction, ie: (X ∨ Y)

```BooleanQuery q = new BooleanQuery();
```
• (X NOT Z) … Negation, ie: (X ¬ Z)

```BooleanQuery q = new BooleanQuery();
```
• ((X AND Y) OR Z) … ((X ∧ Y) ∨ Z)

```BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
```
• ((X OR Y) AND Z) … ((X ∨ Y) ∧ Z)

```BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
```
• (X AND (Y NOT Z)) … (X ∧ (Y ¬ Z))

```BooleanQuery q = new BooleanQuery();
BooleanQuery inner = new BooleanQuery();
```

Please note how import it is to use parentheses to combine multiple operators in order in order to generate queries that correctly model boolean logic. As mentioned before, the BooleanQuery class supports an arbitrary number of clauses, so (X OR Y OR Z) is a single BooleanQuery with three clauses — it is not equivalent to either ((X OR Y) OR Z) or (X OR (Y OR Z)) because those result in a BooleanQuery with two clauses, one of which is a nested BooleanQuery. As mentioned above when discussing the prefix operators, the scores from each of those queries will all be different depending on which clauses are matched by each document.

Things definitely get very confusing when these “boolean operators” are used in ways other then those described above. In some cases this is because the query parser is trying to be forgiving about “natural language” style usage of operators that many boolean logic systems would consider a parse error. In other cases, the behavior is bizarrely esoteric:

• Queries are parsed left to right
• NOT sets the Occurs flag of the clause to it’s right to MUST_NOT
• AND will change the Occurs flag of the clause to it’s left to MUST unless it has already been set to MUST_NOT
• AND sets the Occurs flag of the clause to it’s right to MUST
• If the default operator of the query parser has been set to “And”: OR will change the Occurs flag of the clause to it’s left to SHOULD unless it has already been set to MUST_NOT
• OR sets the Occurs flag of the clause to it’s right to SHOULD

Practically speaking this means that NOT takes precedence over AND which takes precedence over OR — but only if the default operator for the query parser has not been changed from the default (“Or”). If the default operator is set to “And” then the behavior is just plain weird.

## In Conclusion

I won’t try to defend or justify the way the query parser behaves when it encounters these “boolean operators”, because in many cases I don’t understand or agree with the behavior myself — but that’s not really the point of this article. My goal isn’t to convince you that the behavior of these operators makes sense, quite the contrary my goal today is to point out that regardless of how these operators are parsed, they aren’t a good representation of the underlying functionality available in the BooleanQuery class.

Do yourself a favor and start thinking about BooleanQuery as a container of arbitrary nested queries annotated with MUST, MUST_NOT, or SHOULD and discover the power that is available to you beyond simple boolean logic.

Many thanks to Bill Dueber whose recent related blog post reminded me that I had some draft notes on this subject floating around my laptop waiting to finished up and posted online.

Source:  http://www.lucidimagination.com/blog/2011/12/28/why-not-and-or-and-not/

Get the Java IDE that understands code & makes developing enjoyable. Level up your code with IntelliJ IDEA. Download the free trial.

Topics:

Comment (1)

Save
{{ articles[0].views | formatCount}} Views

Opinions expressed by DZone contributors are their own.