COUNT and GROUP Faster With N1QL

Couchbase 4.5 has put work into optimizing queries. See how you can use COUNT, MIN, and other parameters to more effectively search your data.

Keshav Murthy

CORE ·

Oct. 11, 16 · Tutorial

Likes (8)

Comment

Save

12.0K Views

Count Group

Humans have counted things for a very long time. In database applications, COUNT() is frequently used in various contexts. The COUNT performance affects both application performance and user experience. Keeping this mind, Couchbase supported generalized COUNT() and has improved its performance in Couchbase 4.5 release.

Two Couchbase 4.5 Features

There are two features in Couchbase 4.5 helping the COUNT performance.

When the query is interested only in the COUNT of a range of data that’s indexed, the indexer does the counting itself. In other words, the query pushes the counting to the index. This reduces the amount of data exchanged between indexer and query, improving the query speed.

CREATE INDEX idxname ON `travel-sample` (name);
SELECT COUNT(name) 
FROM `travel-sample` 
WHERE name = 'Air Alaska';

First, when all of the predicates can be pushed down to index scan and we know index scan won’t result in any false positives, the index can simply do the COUNT() of qualifying values without returning the qualifying data to the query engine. This improves the COUNT performance. When the COUNT() is pushed to the index scan, the optimizer will denote this by choosing the IndexCountScan method in the query plan. The qualifying queries benefit directly, without any effort from the application developer.

The predicate (name = 'Air Alaska') is pushed down to the index scan in addition to the COUNT() aggregate as well.

{
    "#operator": "IndexCountScan",
    "covers": [
        "cover ((`travel-sample`.`name`))",
        "cover ((meta(`travel-sample`).`id`))"
    ],
    "index": "ixname",
    "index_id": "ea81dd71a0a98351",
    "keyspace": "travel-sample",
    "namespace": "default",
    "spans": [
        {
            "Range": {
                "High": [
                    "\"Air Alaska\""
                ],
                "Inclusion": 3,
                "Low": [
                    "\"Air Alaska\""
                ]
            }

Second, the optimal execution of the MIN() aggregate when the query is covered and can be pushed down to index.

SELECT MIN(name) 
FROM `travel-sample 
WHERE prodname > 'Air Alaska';

For this query, the first name that’s greater than “Air Alaska” is determined by the index scan. Because we’re interested in the just one minimum value and therefore, in the query plan, we push the {limit:1} parameter to the index scan. This works very quickly compared to getting all the qualifying values from the index scan to determine the minimum value.

{
    "#operator": "IndexScan",
    "covers": [
        "cover ((`travel-sample`.`name`))",
        "cover ((meta(`travel-sample`).`id`))"
    ],
    "index": "ixname",
    "index_id": "ea81dd71a0a98351",
    "keyspace": "travel-sample",
    "limit": "1",
    "namespace": "default",
    "spans": [
    {
        "Range": {
        "Inclusion": 0,
        "Low": [
            "\"Air Alaska\""
        ]
    }

Use Cases

Let’s walk through common use cases and how these features work together to speed up many queries. The examples use travel-sample data set shipped along with Couchbase server.

There are some interesting uses with a little bit of help from the application developer. Since multiple customers are using this approach, it’s worth checking out if this is going to help you as well.

Let’s look at common questions and how to improve the performance with Couchbase 4.5.

Use Case 1

What’s the total number of documents in the bucket?

cbq> select count(*) from `travel-sample`;
{
    "requestID": "15549343-54ec-44d8-b62a-a4d98445b484",
    "signature": {
        "$1": "number"
    },
    "results": [
        {
            "$1": 31591
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "7.724136ms",
        "executionTime": "7.693342ms",
        "resultCount": 1,
        "resultSize": 35
    }
}

Counting of the whole bucket runs in a few milliseconds because we can directly get the total number of documents directly from the bucket metadata.

{
    "#operator": "CountScan",
    "keyspace": "travel-sample",
    "namespace": "default"
}

Use Case 2

Let’s see how many document types we have in the travel-sample bucket:

> select distinct type from `travel-sample`;
{
    "requestID": "584a7785-95fe-4d4a-b72b-7e161858bf2a",
    "signature": {
        "type": "json"
    },
    "results": [
        {
            "type": "airline"
        },
        {
            "type": "route"
        },
        {
            "type": "hotel"
        },
        {
            "type": "landmark"
        },
        {
            "type": "airport"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.031288771s",
        "executionTime": "1.031253954s",
        "resultCount": 5,
        "resultSize": 202
    }
}

One full second for this seems quite expensive. Let’s see the query plan.

{
    "#operator": "PrimaryScan",
    "index": "def_primary",
    "keyspace": "travel-sample",
    "namespace": "default",
    "using": "gsi"
}

To determine the number of distinct items, this plan will scan the whole bucket, which determines the distinct types. Let’s see how we can improve this.

The MIN() optimization in Couchbase 4.5 can be exploited to optimize this. We can evaluate if the field is the leading key of any index, as the index has pre-sorted data. Using this, we can write a small script to calculate the distinct values much faster.

Step 1

Get the first entry in the index.

      SELECT MIN(type) 
      FROM `travel-sample` 
      WHERE type IS NOT MISSING;

{
"results": [
        {
            "$1": "airline"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "15.050713ms",
        "executionTime": "15.023588ms",
        "resultCount": 1,
        "resultSize": 39
    }
}

The Query Plan shows that this is done by index scan only. The scan parameter "limit": "1" shows why we get the results so fast. Because the index stores data in a sorted order, the first item will be the lowest value.

        {   
            "#operator": "IndexScan",
            "covers": [
                "cover ((`travel-sample`.`type`))",
                "cover ((meta(`travel-sample`).`id`))"
            ],  
            "index": "def_type",
            "index_id": "7cea57503ecfe0d3",
            "keyspace": "travel-sample",
            "limit": "1",
            "namespace": "default",
            "spans": [
                {   
                    "Range": {
                        "Inclusion": 0,
                        "Low": [
                            "null"
                        ]   
                    }   
                }   
            ],  
            "using": "gsi"
        }

Step 2

Now we use the index to find the next value for the type.

Use the predicate (type > "airline"). Now you get airport:

SELECT MIN(type)
FROM `travel-sample` 
WHERE type > "airline";
{
"results": [
        {
            "$1": "airport"
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "10.370383ms",
        "executionTime": "10.346036ms",
        "resultCount": 1,
        "resultSize": 39
    }
}

Repeat this step 2 using the previous value in the predicate until you get NULL as the result.

A simple Python script will get the job done.

import requests
import json

url="http://localhost:8093/query"
s = requests.Session()
s.keep_alive = True
s.auth = ('Administrator','password')

query = {'statement':'SELECT MIN(type) minval FROM `travel-sample` WHERE type IS NOT MISSING ;'}
r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
result = r.json()['results'][0]
lastval = result['minval']

while lastval != None:
    print lastval
    stmt = 'SELECT MIN(type) minval FROM `travel-sample` WHERE type > "' + lastval + '";'; 
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    lastval = result['minval']

Pro Tip

Use a prepared statement for the SELECT statement in the while loop to improve the performance even further.

Use Case 3

Grouping works closely with counting. It’s common to group the data by type, name, and date and count them. If you have a dataset with millions of documents, the index scan itself may take minutes to scan and retrieve all the data.

Here is a simple example on travel-sample.

SELECT type, count(type) 
FROM `travel-sample`
GROUP BY type;

    "metrics": {
        "elapsedTime": "904.607551ms",
        "executionTime": "904.585043ms",
        "resultCount": 5,
        "resultSize": 321
    }

This scans the entire bucket to calculate the groups and counts. You can improve the performance by using the index scan.

SELECT type, count(type) 
FROM `travel-sample`
WHERE type IS NOT MISSING
GROUP BY type;

The query plan uses  the index now.
        {
          "#operator": "IndexScan",
          "covers": [
            "cover ((`travel-sample`.`type`))",
            "cover ((meta(`travel-sample`).`id`))"
          ],
          "index": "def_type",
          "index_id": "a1ae13a30ad4408",
          "keyspace": "travel-sample",
          "namespace": "default",
 ...

This query runs faster as well.

    "metrics": {
        "elapsedTime": "212.792255ms",
        "executionTime": "212.771933ms",
        "resultCount": 5,
        "resultSize": 321
    }

Let’s see if we can make this faster.

Using the combination of MIN() and COUNT() optimization features in 4.5, we can improve this further by writing a small script.

Here is the outline.

Step 1: Get the first entry in the index for the type.

Step 2: Then, COUNT() from the data set where type = first-value.

Step 3: Now we use the index to find the next value for type.

Step 4: Repeat step 2 and 3 for all the values of type.

Here’s the actual Python script:

import requests
import json

url="http://localhost:8093/query"
s = requests.Session()
s.keep_alive = True
s.auth = ('Administrator','password')

query = {'statement':'SELECT MIN(type) minval FROM `travel-sample` WHERE type IS NOT MISSING ;'}
r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
result = r.json()['results'][0]
lastval = result['minval']

while lastval != None:
    stmt = 'SELECT COUNT(type) tcount FROM `travel-sample` WHERE type = "' + lastval + '";';
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    tcount = result['tcount']
    print lastval, tcount

    stmt = 'SELECT MIN(type) minval FROM `travel-sample` WHERE type > "' + lastval + '";'; 
    query = {'statement':stmt}
    r = s.post(url, data=query, stream=False, headers={'Connection':'close'})
    result = r.json()['results'][0]
    lastval = result['minval']

Let's run this!

$ time python ./group.py 
airline 187
airport 1968
hotel 917
landmark 4495
route 24024

real0m0.372s
user0m0.079s
sys0m0.036s
$

This is run on travel-sample with about 31,000 documents. The Python script itself runs about 10 SELECT statements to retrieve this result. As the number of documents in the dataset increases, the script approach will run much faster to a single statement. Further, using the prepared statement, you can save more CPU cycles.

Conclusion

The MIN() and COUNT() optimizations in 4.5 improves query performance. Using additional logic, you can use these optimizations to improve the performance of a variety of queries.

Database

Opinions expressed by DZone contributors are their own.

Related

Trending