Full-Text Search Fundamentals in RavenDB
This article addresses the full-text search functionality in RavenDB; how it works, how to use it, and what bonus features are available.
Join the DZone community and get the full member experience.
Join For FreeRavenDB is a NoSQL database system with full-text search capabilities out of the box. While most databases offer some kind of search functionality, the issues of speed and relevancy are often treated as an afterthought, if they’re addressed at all.
In RavenDB, speed and relevancy are on par with top-tier search engines, querying is highly customizable, and there’s extensive support for extra features like wildcards and suggestions.
In this article, I’ll go through how RavenDB conducts full-text searches, how to adjust the process to suit your needs, and some of the extra features that are available.
The First Step - Indexing
Full-text search should be lightning-fast, even on large text fields. We achieve this using indexing; a process in which we create data structures called indexes designed to facilitate our full-text queries.
Indexes in databases are similar to the ones you find in the back of books. They make the books a bit longer, but they tell you exactly where topics are mentioned and save you from having to search through the entire text. Likewise, a database index requires some extra storage space, but makes queries much, much faster. The more text there is to search, the greater the relative speed and efficiency of using the index
In RavenDB indexing is performed in the background whenever data is added or changed. This allows the server to respond to queries quickly, even after large-scale changes.
Read more about deploying indexes in RavenDB.
The Inverted Index
There are various types of indexes; the kind RavenDB uses is called an "inverted index". To create an inverted index, a text field is split into smaller sections called tokens or terms. Each unique token, with references to its occurrences, is stored in the index.
An inverted index stores all the terms in alphabetical order grouped by the field they appear in. This dictionary or vocabulary of words shows how many times and where each term appears in the original text.
Creating Indexes with Analyzers
The process of splitting the field into tokens is called tokenization, and how tokenization should be performed is determined by the analyzer used.
An analyzer is an object that determines where to split the text to create tokens.
Given the text “Email me, sample@example.com” one analyzer might split the text at whitespace and produce tokens like so:
[Email] [me,] [sample@example.com]
While another might split it on all non-alpha characters and produce these tokens:
[Email] [me] [sample] [example] [com]
Analyzers can also perform a process called filtering in which various filters remove or alter the tokens in different ways. Alterations may include normalizing characters to all lower case, or to versions without diacritics. Punctuation, and stopwords like “the” and “is” are often removed, as are other unhelpful tokens that might impact the quality of the search.
Harnessing the Power of Lucene
RavenDB uses Lucene to power its indexing and full-text search.
Lucene is described on its website as a “Java library providing powerful indexing and search features, as well as spell checking, hit highlighting and advanced analysis/tokenization capabilities.”
Lucene powers search for many large companies such as LinkedIn and Twitter. It’s very fast, efficient, and packed with features. It’s also constantly being updated/improved.
Lucene’s StandardAnalyzer
is perfect for a general search, but many other analyzers are available. You can even create your own if you need something unique.
An important feature of Lucene is that it automatically scores documents on their relevancy to the user’s query and ranks them accordingly in the search results. It’s extremely fast and requires no input or complex understanding from the user, though modifications are possible.
Enabling Full-Text Search in RavenDB
RavenDB by default indexes using LowerCaseKeywordAnalyzer
, which is a custom analyzer that sets the entire field as a single token, then converts all upper case characters to lower case.
Obviously, this isn’t useful for full-text search, so for that, we need to tell RavenDB to use a different analyzer.
Selecting an Analyzer
Selecting an analyzer in the management studio is as simple as editing the index definition and entering the fully qualified name of the analyzer class in the Analyzer field.
To add an analyzer using code is also straightforward. For a field called Content we would use:
Indexes.Add(x => x.Content, FieldIndexing.Search);
which makes the Content field searchable by applying Lucene’s StandardAnalyzer
.
If we wanted to add a specific analyzer, such as SimpleAnalyzer
, we would use the line:
Analyzers.Add(x => x.Content, "SimpleAnalyzer");
Bonus Full-Text Search Features
Anyone who’s used a search engine should be familiar with the useful additional functions that can complement full-text search.
The ability to detect and correct typos or suggest similar terms. The use of prefixes or wildcards. The ability to preview snippets of text with keywords highlighted. These powers and many more are available in RavenDB, and you can read about how to apply them here.
In all, with the help of Lucene, RavenDB offers powerful and fast full-text search capabilities, a simple setup, and extensive customization options. Google eat your heart out.
For more detailed information on functionality or implementation, you can check the RavenDB documentation or read up on Lucene.
Happy searching.
Opinions expressed by DZone contributors are their own.
Comments