More Than LIKE: Efficient JSON Searching With N1QL

See how the latest iteration of Couchbase allows you to split and search JSON data with tokens.

Keshav Murthy

CORE ·

Nov. 30, 16 · Tutorial

Likes (6)

Comment

Save

12.2K Views

Enterprise applications require both exact search (equality, range predicate) and pattern search (LIKE or text search). The B-Tree based indexes in databases help perform an exact search in milliseconds. Pattern search is a whole different problem. Predicates of patterns (name LIKE "%turin%") will scan the whole index — unsuitable for application use. Recognizing the need for speed, we've introduced a TOKENS() functions in Couchbase 4.6 which retuns an array of tokens from each field or document. We then index the array using the array indexing in Couchbase N1QL. This transforms the pattern search problem to a simple equality look up, improves the performance by orders of magnitude.

Introduction

Whether you’re booking a hotel in Turin or Timbuktu, Milan or Malibu, you need to search based on various criteria: location, dates, rooms, number of people, their age and more. The numerical requirements (number of rooms, number of people in the party), date ranges (check-in and checkout dates) are exact predicates. The target place to visit could be a search within a city name, region name, nearby places, or interesting landmarks.

Here is the problem.

How do you make this query run really, really quickly?
How do you get thousands of users searching and buying your product and service concurrently?

Search is a broad term. So, we divide searches into distinct types.

Exact search with one or more predicates.
Pattern search to find a word or a pattern within a larger string.
Fuzzy search — text search with a lot of text data.

Of course, all of these search operations have to come back with results in milliseconds and not seconds.

In Couchbase:

For exact searches, you can use B-tree- or skip-list-based global secondary indexes with a simple or composite index to achieve your goals.
For full-text search, which requires heavy text processing — stemming, scoring, and more — you can use Couchbase full-text search (FTS).

The hotel search problem doesn’t fall nicely into either category completely. Exact predicates require exact index; search predicates require FTS index. Using only index scans, you’ll get too much data, and you’d need filter further. On the other hand, if you search using full-text index, you’ll need to filter further for dates, rooms, etc. This second level of filtering will slow things down significantly.

Real world problems like you saw above are not always clean cut. You want the exact search for numerical and date range; you want the string search with smaller data like city, neighborhood, region.

Here is the problem. How do you make this query run really, really quickly? How do you get thousands of users searching and buying your product and service concurrently?

Let’s use sample Hotel documents in the Couchbase travel-sample data set to explore the use case.

select hotel, meta().id as dockey 
from `travel-sample` hotel 
where hotel.type = "hotel"
limit 1;

[
  {
     "dockey": "hotel_10025",
     "hotel": {
      "address": "Capstone Road, ME7 3JE",
      "alias": null,
      "checkin": null,
      "checkout": null,
      "city": "Medway",
      "country": "United Kingdom",
      "description": "40 bed summer hostel about 3 miles from Gillingham, housed in a distinctive converted Oast House in a semi-rural setting.",
      "directions": null,
      "email": null,
      "fax": null,
      "free_breakfast": true,
      "free_internet": false,
      "free_parking": true,
      "geo": {
        "accuracy": "RANGE_INTERPOLATED",
        "lat": 51.35785,
        "lon": 0.55818
      },
      "id": 10025,
      "name": "Medway Youth Hostel",
      "pets_ok": true,
      "phone": "+44 870 770 5964",
      "price": null,
      "public_likes": [
        "Julius Tromp I",
        "Corrine Hilll",
        "Jaeden McKenzie",
        "Vallie Ryan",
        "Brian Kilback",
        "Lilian McLaughlin",
        "Ms. Moses Feeney",
        "Elnora Trantow"
      ],
      "reviews": [
        {
          "author": "Ozella Sipes",
          "content": "This was our 2nd trip here and we enjoyed it as much or more than last year. Excellent location across from the French Market and just across the street from the streetcar stop. Very convenient to several small but good restaurants. Very clean and well maintained. Housekeeping and other staff are all friendly and helpful. We really enjoyed sitting on the 2nd floor terrace over the entrance and \"people-watching\" on Esplanade Ave., also talking with our fellow guests. Some furniture could use a little updating or replacement, but nothing major.",
          "date": "2013-06-22 18:33:50 +0300",
          "ratings": {
            "Cleanliness": 5,
            "Location": 4,
            "Overall": 4,
            "Rooms": 3,
            "Service": 5,
            "Value": 4
          }
        },
        {
          "author": "Barton Marks",
          "content": "We found the hotel de la Monnaie through Interval and we thought we'd give it a try while we attended a conference in New Orleans. This place was a perfect location and it definitely beat staying downtown at the Hilton with the rest of the attendees. We were right on the edge of the French Quarter within walking distance of the whole area. The location on Esplanade is more of a residential area so you are near the fun but far enough away to enjoy some quiet downtime. We loved the trolly car right across the street and we took that down to the conference center for the conference days we attended. We also took it up Canal Street and nearly delivered to the WWII museum. From there we were able to catch a ride to the Garden District - a must see if you love old architecture - beautiful old homes(mansions). We at lunch ate Joey K's there and it was excellent. We ate so many places in the French Quarter I can't remember all the names. My husband loved all the NOL foods - gumbo, jambalaya and more. I'm glad we found the Louisiana Pizza Kitchen right on the other side of the U.S. Mint (across the street from Monnaie). Small little spot but excellent pizza! The day we arrived was a huge jazz festival going on across the street. However, once in our rooms, you couldn't hear any outside noise. Just the train at night blowing it's whistle! We enjoyed being so close to the French Market and within walking distance of all the sites to see. And you can't pass up the Cafe du Monde down the street - a busy happening place with the best French doughnuts!!!Delicious! We will definitely come back and would stay here again. We were not hounded to purchase anything. My husband only received one phone call regarding timeshare and the woman was very pleasant. The staff was laid back and friendly. My only complaint was the very firm bed. Other than that, we really enjoyed our stay. Thanks Hotel de la Monnaie!",
          "date": "2015-03-02 19:56:13 +0300",
          "ratings": {
            "Business service (e.g., internet access)": 4,
            "Check in / front desk": 4,
            "Cleanliness": 4,
            "Location": 4,
            "Overall": 4,
            "Rooms": 3,
            "Service": 3,
            "Value": 5
          }
        }
      ],
      "state": null,
      "title": "Gillingham (Kent)",
      "tollfree": null,
      "type": "hotel",
      "url": "http://www.yha.org.uk",
      "vacancy": true
    }
  }
]

Now, let’s see some common questions from your application and how these questions can be translated to queries.

1. Find all the hotel information for "Medway Youth Hostel"

* This is an exact search of the predicate (hotel.name = "Medway Youth Hostel")

Query:

SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel"
AND hotel.name = "Medway Youth Hostel";

Index:

CREATE INDEX idx_hotel_typename 
ON `travel-sample`(name) WHERE type = "hotel";

2. Find all the hotels in "Medway", "United Kingdom"

* This is an exact search with the predicate: ( hotel.country = “United Kingdom”
AND hotel.city = “Medway” )

Query:

SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel" 
AND hotel.country = "United Kingdom"
AND hotel.city = "Medway"

Index:

CREATE INDEX idx_hotel_typecountryname 
ON `travel-sample`(country, name) WHERE type = "hotel"

3. Find all the hotels liked (in the field public_field) by the a person with a name (first or last) Vallie

* public_likes is an array of field. It has names of the peopel who liked the hotel.

* This is a pattern search. We're looking for a first name or last name Vallie.

      "public_likes": [
        "Julius Tromp I",
        "Corrine Hilll",
        "Jaeden McKenzie",
        "Vallie Ryan",
        "Brian Kilback",
        "Lilian McLaughlin",
        "Ms. Moses Feeney",
        "Elnora Trantow"
      ]

So we use the array predicate to search through array elements. To make this search faster, we create an array index on public_likes. This index scan will still fetch all the elements from the index and apply the LIKE "%vallie%" on top of it.

Query:

SELECT hotel 
FROM `travel-sample` hotel
WHERE hotel.type = "hotel"
AND ANY p IN hotel.public_likes 
        SATISFIES LOWER(p) LIKE "%vallie%"
    END

Index:

CREATE INDEX idx_publiclikes ON `travel-sample` 
(DISTINCT ARRAY LOWER(p) FOR p IN public_likes END) 
WHERE type = "hotel";

Explain:

        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "idx_publiclikes",
            "index_id": "9a10e62d4ad4387a",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "[]"
                  ],
                  "Inclusion": 1,
                  "Low": [
                    "\"\""
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        }

4. Find all the hotels that mention architecture anywhere in the document. This anywhere -- could be in keyname (attribute name), any value — in name, description, reviews, etc.

This has a pattern search. Search for architecture anywhere in the document is a pattern search.Index: This query can use a primary index.

SELECT COUNT(1) totalcount FROM `travel-sample` hotel
WHERE ANY v WITHIN OBJECT_PAIRS(hotel) 
          SATISFIES LOWER(v)  LIKE "%architecture%"
      END
{
    "requestID": "426b993e-a214-4e5f-824e-aef2ebfd3262",
    "signature": {
        "$1": "number"
    },
    "results": [
        {
            "$1": 94
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "12.130759027s",
        "executionTime": "12.130687174s",
        "resultCount": 1,
        "resultSize": 32
    }
}

While this query works, it takes 12.13 seconds to find the 41 documents which have the word architecture somewhere in the document. Too slow and too expensive.

Here is where the TOKENS come into the picture. TOKENS() function takes a JSON document or an expression, analyzes all of name value pairs, arrays, objects and returns an array of values of basic types: numeric, strings, boolean, null.

SELECT TOKENS(hotel)
FROM `travel-sample` hotel
WHERE hotel.type = 'hotel' 
LIMIT 1;

[
  {
    "$1": [
      null,
      false,
      true,
      51.35785,
      0.55818,
      10025,
      4,
      3,
      5,
      "friendly",
      "distance",
      "once",
      "rooms",
      "close",
      "Hilll",
      "all",
      "Monnaie",
      "homes",
      "huge",
      "Kilback",
      "vacancy",
      "Value",
      "Sipes",
      "call",
      "converted",
      "770",
…
      "Marks",
      "Youth",
      "watching",
      "Hilton",
      "noise",
      "received"
    ]
  }
]

Once we get this array, we can use the array indexing, to index the result of TOKENS and query them:

CREATE INDEX idx_travel_tokens ON `travel-sample`
(DISTINCT ARRAY s for s IN TOKENS(`travel-sample`, {"case":"lower", "names": true}) END) 

select count(1) FROM `travel-sample` h 
WHERE ANY s IN tokens(h, {"case":"lower", "names": true}) SATISFIES s = 'architecture' END;
{
    "requestID": "dd6cfb24-ae5f-4667-a332-fc21ec48f808",
    "signature": {
        "$1": "number"
    },
    "results": [
        {
            "$1": 94
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "33.398256ms",
        "executionTime": "33.359819ms",
        "resultCount": 1,
        "resultSize": 32
    }
}

Let’s look at the query plan. It chose the tokens right index and more importantly, the predicates were pushed correctly into index scans (reported here as spans). The appropriate index is selected and the predicate looking for architecture is pushed down to index scan.

      "~children": [
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "idx_travel_tokens",
            "index_id": "9f370cd078ebd2eb",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"architecture\""
                  ],
                  "Inclusion": 3,
                  "Low": [
                    "\"architecture\""
                  ]
                }
              }
            ],
            "using": "gsi"
          }
        }

Finally, let’s put all of this together to answer this question: Find all the hotels in United Kingdom which have vacancies and with a website.
- We can check for (country = ‘United Kingdom’ and vacancy = true) easily via composite index.
- Having a “website” needs to be defined well.
- When you get information from hotels, unless you go through a ETL process to save the information in an exact field and format, you’d have this information anywhere in the hotel document.

Example:
"url": "http://www.yha.org.uk",
"Website": "http://www.yha.org.uk”
“Http”: "www.yha.org.uk"
“Http address”: "www.yha.org.uk"
“Web” : "Our website: http://www.yha.org.uk "
“Contact” : 
     “Phone” : "+44 482 492 292",
     “Web” : "http://www.yha.org.uk",
     “Mail” : "help@yha.org.uk"
   }
“Info”: " +44 482 492 292 http://www.yha.org.uk"
“Info”: ["+44 482 492 292”,  “http://www.yha.org.uk"

Problem: You’d need to search in every field name, every object, every array.

Query:
EXPLAIN 
SELECT * 
FROM `travel-sample` hotel 
WHERE country = "United Kingdom"
AND vacancy = true
AND ANY v IN TOKENS(hotel, {"case":"lower", "name":true})
        SATISFIES v = "http"
    END
AND type = "hotel"


Index
CREATE INDEX idx_countryvacancytoken ON `travel-sample`    
  (country, 
   vacancy,
   DISTINCT 
     ARRAY v for v IN 
     tokens(`travel-sample`, {"case":"lower", "name":true})      
     END
  ) WHERE type = "hotel";

Plan
====
        {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "idx_countryvacancytoken",
            "index_id": "c44e9b1d7102a736",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"United Kingdom\"",
                    "true",
                    "\"http\""
                  ],
                  "Inclusion": 3,
                  "Low": [
                    "\"United Kingdom\"",
                    "true",
                    "\"http\""
                  ]
                }
              }
            ],

In this case, we push down all of the predicates to the index scan.

The query runs in about 233 milliseconds!

SELECT * 
FROM `travel-sample` hotel 
WHERE country = "United Kingdom" 
AND vacancy = true 
AND ANY v IN TOKENS(hotel, {"case":"lower", "name":true}) SATISFIES v = "http" 
    END 
AND type = "hotel";

Results (partially shows for brevity):

{
    "requestID": "c4a30037-d808-419c-a691-ab783f46866c",
    "signature": {
        "*": "*"
    },
    "results": [
        {
            "hotel": {
                "address": "Capstone Road, ME7 3JE",
                "alias": null,
                "checkin": null,
                "checkout": null,
                "city": "Medway",
                "country": "United Kingdom",
                "description": "40 bed summer hostel about 3 miles from Gillingham, housed in a districtive converted Oast House in a semi-rural setting.",
                "directions": null,
                "email": null,
                "fax": null,
                "free_breakfast": true,
                "free_internet": false,
                "free_parking": true,
                "geo": {
                    "accuracy": "RANGE_INTERPOLATED",
                    "lat": 51.35785,
                    "lon": 0.55818
                },
                "id": 10025,
                "name": "Medway Youth Hostel",
                "pets_ok": true,
                "phone": "+44 870 770 5964",
                "price": null,
                "public_likes": [
                    "Julius Tromp I",
                    "Corrine Hilll",
                    "Jaeden McKenzie",
                    "Vallie Ryan",
                    "Brian Kilback",
                    "Lilian McLaughlin",
                    "Ms. Moses Feeney",
                    "Elnora Trantow"
                ],
                "reviews": [
                    {
                        "author": "Ozella Sipes",
                        "content": "This was our 2nd trip here and we enjoyed it as much or more than last year. Excellent location across from the French Market and just across the street from the streetcar stop. Very convenient to several small but good restaurants. Very clean and well maintained. Housekeeping and other staff are all friendly and helpful. We really enjoyed sitting on the 2nd floor terrace over the entrance and \"people-watching\" on Esplanade Ave., also talking with our fellow guests. Some furniture could use a little updating or replacement, but nothing major.",
                        "date": "2013-06-22 18:33:50 +0300",
                        "ratings": {
                            "Cleanliness": 5,
                            "Location": 4,
                            "Overall": 4,
                            "Rooms": 3,
                            "Service": 5,
                            "Value": 4
                        }
                    },
                    {
                        "author": "Barton Marks",
                        "content": "We found the hotel de la Monnaie through Interval and we thought we'd give it a try while we attended a conference in New Orleans. This place was a perfect location and it definitely beat staying downtown at the Hilton with the rest of the attendees. We were right on the edge of the French Quarter withing walking distance of the whole area. The location on Esplanade is more of a residential area so you are near the fun but far enough away to enjoy some quiet downtime. We loved the trolly car right across the street and we took that down to the conference center for the conference days we attended. We also took it up Canal Street and nearly delivered to the WWII museum. From there we were able to catch a ride to the Garden District - a must see if you love old architecture - beautiful old homes(mansions). We at lunch ate Joey K's there and it was excellent. We ate so many places in the French Quarter I can't remember all the names. My husband loved all the NOL foods - gumbo, jambalya and more. I'm glad we found the Louisiana Pizza Kitchen right on the other side of the U.S. Mint (across the street from Monnaie). Small little spot but excellent pizza! The day we arrived was a huge jazz festival going on across the street. However, once in our rooms, you couldn't hear any outside noise. Just the train at night blowin it's whistle! We enjoyed being so close to the French Market and within walking distance of all the sites to see. And you can't pass up the Cafe du Monde down the street - a busy happenning place with the best French dougnuts!!!Delicious! We will defintely come back and would stay here again. We were not hounded to purchase anything. My husband only received one phone call regarding timeshare and the woman was very pleasant. The staff was laid back and friendly. My only complaint was the very firm bed. Other than that, we really enjoyed our stay. Thanks Hotel de la Monnaie!",
                        "date": "2015-03-02 19:56:13 +0300",
                        "ratings": {
                            "Business service (e.g., internet access)": 4,
                            "Check in / front desk": 4,
                            "Cleanliness": 4,
                            "Location": 4,
                            "Overall": 4,
                            "Rooms": 3,
                            "Service": 3,
                            "Value": 5
                        }
                    }
                ],
                "state": null,
                "title": "Gillingham (Kent)",
                "tollfree": null,
                "type": "hotel",
                "url": "http://www.yha.org.uk",
                "vacancy": true
            }

.....

    "status": "success",
    "metrics": {
        "elapsedTime": "233.475262ms",
        "executionTime": "233.397438ms",
        "resultCount": 185,
        "resultSize": 1512687
    }
}

TOKENS() Function

This function is explained in detail in Couchbase documentation. The important thing to note is that first expression passed into token can be anything: constant literal, simple JSON value, JSON key name or the whole document itself.

TOKENS (Expression, Options)

You can invoke TOKENS() simply with the expressions or with the option. The function always returns an array of tokens it extracted for the given (implied) options.

TOKENS(address)

This returns an array of values including the attribute (key) names. It retrieves the data in their basic types and returns

Options can take the following options. Each invocation of TOKENS() can choose one or more of the options, passed in as JSON.

{"name": true}: Valid values are true or false. Default is true.

{"case":"lower"}: Valid values are "upper" and "lower". Default is neither — return the case in the data.

{"specials": true}: Preserves strings with special characters such as email addresses, URLs, and hyphenated phone numbers. The default is false.

Let's use all of the options on TOKENS() to see what it produces.

Query:

SELECT 
TOKENS ({"info":"Name is keshav email is keshav@couchbase.com Blog is http://planetnosql.com twitter handle is @rkeshavmurthy"}, 
{"specials":true, "names":true, "case":"lower"}) AS words

Result:

[
  {
    "words": [
      "info",
      "planetnosql",
      "http://planetnosql.com",
      "rkeshavmurthy",
      "name",
      "@rkeshavmurthy",
      "com",
      "http",
      "couchbase",
      "blog",
      "twitter",
      "is",
      "keshav",
      "handle",
      "email",
      "keshav@couchbase.com"
    ]
  }
]

With names set to true, you'll get the field name, in this case info.

With specials set to true, you'll get delimited words as well as special works like keshav@couchbase.com or @rkeshavmurthy as a single word to index.

With case set to lower, all of the results will be in the lower case

Let's look at an example usage on the travel-sample data.

SELECT TOKENS(url) AS defaulttoken, 
       TOKENS(url, {"specials":true, "case":"UPPER"}) AS specialtoken
FROM `travel-sample` h WHERE h.type = 'hotel' 
LIMIT 1;

[
  {
    "defaulttoken": [
      "uk",
      "www",
      "http",
      "yha",
      "org"
    ],
    "specialtoken": [
      "ORG",
      "WWW",
      "HTTP://WWW.YHA.ORG.UK",
      "YHA",
      "UK",
      "HTTP"
    ]
  }
]

In this case, TOKENS() converted all of the URL data into UPPER case and added the full URL in addition to delimited words.

Use {"case":"lower"} or {"case":"upper"} to have case insensitive search. Index creation and querying can use this and other parameters in combination. As you saw in the examples, these parameters should be passed within the query predicates as well. The parameters and values have to match exactly for N1QL to pick up and use the index correctly.

Here is an example of how you can create the index and use it your applicaiton.

Index definition:

CREATE INDEX idx_url_upper_special on `travel-sample`(
  DISTINCT ARRAY v for v in 
       tokens(url, {"specials":true, "case":"UPPER"}) 
  END ) where type = 'hotel' ;

Query:

SELECT name, address, url
FROM `travel-sample` h
WHERE ANY  v in tokens(url, {"specials":true, "case":"UPPER"})
     SATISFIES v = "HTTP://WWW.YHA.ORG.UK" 
    END  
AND h.type = 'hotel' ;

Results:

[
  {
    "address": "Capstone Road, ME7 3JE",
    "name": "Medway Youth Hostel",
    "url": "http://www.yha.org.uk"
  }
]

Query Plan:

       {
          "#operator": "DistinctScan",
          "scan": {
            "#operator": "IndexScan",
            "index": "idx_url_upper_special",
            "index_id": "ec73b96583ac2edf",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "Range": {
                  "High": [
                    "\"HTTP://WWW.YHA.ORG.UK\""
                  ],
                  "Inclusion": 3,
                  "Low": [
                    "\"HTTP://WWW.YHA.ORG.UK\""
                  ]
                }
              }
            ],

Summary

Couchbase 4.6 has introduced a new way to split the JSON document into simpler values using TOKENS(). Using the flexible array indexing in Couchbase, you can now create secondary indexes for scalar and array values. This is a high-performance alternative to the LIKE predicate. Its performance is so much more likable.

References

Couchbase : http://www.couchbase.com
N1QL: http://query.couchbase.com
Couchbase Full Text Search: http://developer.couchbase.com/documentation/server/current/fts/full-text-intro.html
SPLIT and CONQUER: https://dzone.com/articles/split-and-conquer-efficient-string-search-with-n1q
A Couchbase Index Technique for LIKE Predicates With Wildcard: https://dzone.com/articles/a-couchbase-index-technique-for-like-predicates-wi

JSON Database Data structure Document Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending