Introducing Full-Text Search Capability via the Query Interface
Take a look at Couchbase's full-text search capabilities via the query interface.
Join the DZone community and get the full member experience.
Join For FreeCouchbase provides users a database query language that mirrors SQL in properties for JSON data — N1QL. Couchbase also facilitates users with its Full-Text-Search (FTS) engine for text search capabilities over JSON data.
Topics This Article Will Cover
- What's good with N1QL?
- What about FTS?
- But why FTS within N1QL?
- Basic N1QL + FTS queries
- Deploying N1QL + FTS
- Syntax(es)
- Abilities and limitations
- N1QL + FTS internals
- Covered-index queries vs non-covered-index queries
- More N1QL + FTS query examples
- What's next?
1. N1QL
N1QL is Couchbase's SQL offering for manipulating JSON data stored within the Couchbase server.
N1QL statements can be used to create, modify, drop indexes, and select, insert, update, delete, and upsert data into JSON documents.
N1QL expressions allow the user to do aggregate, arithmetic, collection, comparison, conditional, etc. operations.
All this comes with the support from global secondary indexes (GSI) to enable the above operations very efficiently.
2. FTS
Couchbase's Full-Text Search offering provides extensive capabilities for natural language querying and is meant to enable users to search text across multiple fields in JSON documents stored within a Couchbase server.
It supports language-aware searching and scoring results based on relevancy which can be configured by the user.
FTS sets up fast indexes that are very specifically designed and meant to handle a wide range of text search workloads very efficiently.
3. Why Support FTS via N1QL?
N1QL queries can perform a search on strings, numbers, arrays, etc. and with secondary indexing (b-tree index) — support point lookup and range scans as well but FTS delivers performance for text search (simple and compound queries) with the support of its underlying inverted index.
Applications need the capability to leverage both the capabilities using a single API and language.
Supporting compound/complex operations such as applying aggregations, arithmetic, and other SQL operations over FTS results for ease of development.
Extend FTS's visibility (more than just via SDK, curl or Couchbase's UI).
4. N1QL + FTS
Couchbase 6.5 will support the proposed interface between N1QL and FTS. The user will still be required to set up FTS indexes to support their use case just like with GSI. With this new interface, users will be allowed to merge and execute FTS queries from within N1QL queries seamlessly.
A new SEARCH(..) predicate will now be supported as part of the N1QL query syntax. Before getting into the internals of what on when a N1QL query with the SEARCH(..) predicate is executed, here is documentation on how to create and manage FTS indexes and here are a few sample queries..
Say I have an FTS index set up over some travel documents and I want to fetch the FTS results (just document IDs) for all documents carrying "San Francisco" in their city field:
SELECT meta().id
FROM `travel-sample` as t
WHERE SEARCH(t, 'city:"San Francisco"');
As you can see above, the FTS query string 'city:"San Francisco"' is embedded within the N1QL query. Alternatively, the N1QL query will also support a FTS query object as is:
SELECT meta().id
FROM `travel-sample` as t
WHERE SEARCH(t, {"match_phrase": "San Francisco", "field": "city"})
LIMIT 10;
The above example will limit the FTS result set to 10.
Or even an FTS search request object:
SELECT meta().id
FROM `travel-sample` as t
WHERE SEARCH(t, {"query": {"match_phrase": "San Francisco", "field": "city"}, "limit": 10});
The above example also limits the result set to1 0, but FTS will enforce it.
OFFSET/LIMIT filters can be set either within the N1QL query syntax or within the FTS search request object. If these parameters are set within the FTS search request, FTS will stream only the requested number of results. N1QL parameters will be applied on whatever results FTS has sent it. If these settings are not set within the FTS object but are set within the N1QL query — FTS will stream all results to N1QL until such a time that N1QL has received all results that it needs.
Also, like in the last 3 examples, one doesn't have to explicitly specify the name of the FTS index to pick — N1QL will determine which is the best index among available ones to run the FTS query against. Should one want N1QL to use a specific index (for example — an index named "travel"), here's how:
SELECT meta().id
FROM `travel-sample` as t
WHERE SEARCH(t, {"match_phrase": "San Francisco", "field": "city"}, {"index": "travel"});
Note that for all the example queries above, results are streamed and are not sorted based on relevance (tf-idf score — default FTS behavior). Sorting can be achieved by explicitly stating it within the search request or using N1QL's ORDER BY clause. Pagination of results can also be achieved by only explicitly stating within the search request or by using N1QL's OFFSET and LIMIT clauses.
5. Deploying N1QL + FTS
To allow N1QL queries with the SEARCH(..) capability, the Couchbase cluster needs to have at least 1 node that runs the search service and 1 node that runs the query service (both these services can be configured on the same node as well).
FTS indexes are to be set up by the user to index the necessary content that one wants to search over.
If no FTS indexes were found by N1QL to execute the query, it searches for GSI indexes that can potentially handle the query. In this case, the SEARCH(..) predicate is applied on the intermediate results obtained. While this would work, it isn't the recommended approach since SEARCH(..) evaluation can be expensive.
N1QL queries with SEARCH(..) can be run from the query workbench, curl (via REST), SDK or the command line interface that Couchbase offers.
5.1 Search Syntax
Here is what's supported within the SEARCH(<field>, <FTS query>, [options]) function:
field | string |
|
Mandatory |
FTS query | string/object |
|
Mandatory |
options | object |
|
Optional |
5.2 Abilities and Limitations
All the features that the FTS search requests offer will be supported by the N1QL queries.
Here are some highlights on what to throw in the "query" section within the SEARCH(..) function.
FTS indexes that support multiple type mappings will be disallowed in the first release of N1QL + FTS interface so that false-positives wouldn't sneak into the result set.
6. Internal Implementation
Internally the N1QL + FTS interface supports 4 APIs that the N1QL service will invoke during it's prepare and execution phases for queries with the SEARCH(..) predicate.
Prepare | Sargable... An FTS index that can support all fields requested in the search query |
Pageable... An FTS index that would handle pagination, ordering for the search query | |
Execution | Search... The FTS index's search function, FTS will return document ids as result set |
Verify:Evaluate... Evaluation of documents fetched from the database |
Before describing the above APIs in a little more detail, here's a flow chart of operations supported within the interface:
6.1 Sargable
This FTS Index API will be used to determine whether the index is capable of handling the query request without returning false negatives. In the first release, the index is only chosen if all the query fields are indexed within it, or the index has within its definition a dynamic mapping that would cater to all the requested fields. If multiple indexes are sargable for a given query, one that has the least number of fields indexed satisfying the sargability clause is chosen (for performance reasons).
6.2 Pageable
This API will be used to determine if the results obtained from the underlying FTS index will be pageable or not. If the index is pageable, N1QL will apply the filters (offset, limit, order by clauses) for FTS prior to issuing the Search(..), oterwise the filters are applied on the result set after FTS has shipped to it.
Note that this API is not invoked if there are no filters (offset, limit, order by) in the N1QL query.
6.3 Search
This API is invoked for the most sargable index and is essentially responsible for getting the search request through to the FTS index and streaming results back from it via a channel. If the amount of data to stream exceeds the available buffer size at N1QL's end of the channel, FTS will backfill the data to a file and a separate routine is responsible for streaming this content to N1QL. This is done so that FTS's resources aren't held up by a slow connection from N1QL. Internally, the gRPC protocol is used for streaming data from FTS to N1QL.
6.4 Verify:Evaluate
This API is used by N1QL to ensure that the results/hits returned by an FTS index are indeed valid for the query. FTS returns only the key-ids and some FTS related metadata (if requested) like score etc. N1QL fetched documents from KV and invokes Verify for them iff the SELECT predicate requests for some document fields.
7. Covered-Index Queries vs Non-Covered-Index Queries
If the SELECT statement's predicate requests only keys or the metadata, the Verify API isn't invoked at all. This kind of query is referred to as a covered-index query. If the request is for some other document content, N1QL will use the keys returned by FTS to fetch the document data from the Key-Value (KV) store, after which it invokes the Verify method for each of the fetched documents to re-check whether the documents retrieved are indeed valid matches. This kind of query is referred to as a non-covered-index query. Non-covered-index queries tend to have higher latencies as they involve the KV fetch and verify for each hit.
If the user requires other fields the documents to be a part of the result set, a faster approach would be to tune the FTS index definition to store the desired fields. Now, in the search request within the SEARCH(..) function, include a section called "field": ["*"] to fetch all stored fields as part of the result set. This way N1QL will not have to do a separate document fetch and can skip the verify as well => essentially converting a non-covered-index query to a covered-index query at the cost of a larger FTS index.
Consider the following FTS index definition with fields "country" and "content" indexed:
{
"name": "sample",
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": true,
"dynamic": true,
"properties": {
"country": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "country",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"docvalues": true
}
]
},
"content": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "content",
"type": "text",
"store": false,
"index": true,
"docvalues": true
}
]
}
}
}
}
}
}
Here's an example of a slower non-covered index query that fetches the document field "content" for documents that have "united states" in their "country" field:
SELECT meta().id, t.content
FROM `travel-sample` as t
WHERE SEARCH(t, {“query”: {“match_phrase”: “united states”, “field”: “country”}})
ORDER BY search_score() DESC
LIMIT 10;
Now, let's update the index definition to store the field "content" which is of interest:
{
"name": "sample",
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": true,
"dynamic": true,
"properties": {
"country": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "country",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"docvalues": true
}
]
},
"content": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "content",
"type": "text",
"store": true,
"index": true,
"docvalues": true
}
]
}
}
}
}
}
}
Now here's the same query that qualifies as a covered-index query:
SELECT meta().id, search_meta().fields.content
FROM `travel-sample` as t
WHERE SEARCH(t, {“query”: {“match_phrase”: “united states”, “field”: “country”}, “fields”: [“content”]})
ORDER BY search_score() DESC
LIMIT 10;
8. More N1QL + FTS Examples
8.1 Complex-er queries
Running a compound conjunction/disjunction FTS query within a N1QL query => fetch top 100 document IDs ordered by score (tf-idf .. which is FTS's default scoring algorithm) highest to lowest, whose category is "landmark" and country is "United States".
SELECT meta().id, search_score() as score
FROM `travel-sample` as t
WHERE SEARCH(t, {“conjuncts”: [{“match”: “landmark”, “field”: category”}, {“match_phrase”: “united states”, “field”: “country”}]})
ORDER BY score DESC
LIMIT 100;
Here's an equivalent query with embedded FTS settings within the SEARCH(..) function:
SELECT meta().id, search_score() as score
FROM `travel-sample` as t
WHERE SEARCH(t, {
“query”: {
“conjuncts”: [
{“match”: “landmark”, “field”: category”},
{“match_phrase”: “united states”, “field”: “country”}
]
},
“sort”: [“-_score”],
“limit”: 10
});
Running another query with FTS settings embedded within the SEARCH(..) function... fetch all document IDs that contain within their description field the term gothic without considering score. Here we can optimize the FTS search request to not determine the score at all.
SELECT meta().id
FROM `travel-sample` as t
WHERE SEARCH(t, {“query”: {“match”: “gothic”, “field”: “description”}, “score”: “none”});
Various supported FTS query types are described in more detail here.
8.2 Query Sargability vs Index Definitions
Before we jump into some examples on how queries are deemed sargable for FTS index definitions, learn more about FTS index definitions here.
Consider the following query, which looks for term "gothic" in the field "description" ..
SELECT meta().id, search_score()
FROM `travel-sample`
WHERE SEARCH(`travel-sample`, {“query”: {“match”: “gothic”, “field”: “description”});
In the Couchbase system at hand, let's assume there are several FTS indexes defined.
The first FTS index we encounter has the following definition:
{
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": true,
"dynamic": true
}
}
}
}
This index is what is referred to as a default dynamic index which covers all fields available across all documents and also includes content within the default field "_all". This default field is what is looked into when an FTS query does not carry "field" information for a search criteria. This index is deemed SARGABLE for the above query.
A second FTS index is found to have the following definition:
{
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": true,
"dynamic": false,
"properties": {
"description": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "description",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"include_in_all": false,
"docvalues": true
}
]
}
}
}
}
}
}
This index only has the field "description" indexed, which would cater to the query's request and hence the index is SARGABLE for this query.
A third FTS index is found to have the following definition:
{
"type": "fulltext-index",
"params": {
"mapping": {
"default_mapping": {
"enabled": true,
"dynamic": false,
"properties": {
"city": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "city",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"include_in_all": true,
"docvalues": true
}
]
},
"country": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "country",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"include_in_all": true,
"docvalues": true
}
]
},
"name": {
"enabled": true,
"dynamic": false,
"fields": [
{
"name": "name",
"type": "text",
"store": false,
"index": true,
"include_term_vectors": true,
"include_in_all": false,
"docvalues": true
}
]
}
}
}
}
}
}
This index has a few fields indexed but none of them match the request field "description". The index is deemed NOT-SARGABLE for the query.
N1QL now has the option to select from the first 2 indexes for the query which will be able to deliver accurate results for the query. Since the number of fields indexed within the second index is precise and smaller (and therefore since the search across this index would be faster), N1QL chooses the second index for the execution of the query.
9. Future
Establishing sargability better by supporting some flexibility of the FTS index definitions — as in FTS indexes that don't support all the requested fields.
Support for FTS indexes with multiple type mappings.
Extending N1QL query interface to create and edit FTS indexes.
Thanks for reading!
Further Reading
Opinions expressed by DZone contributors are their own.
Comments