Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Query Performance Monitoring Made Easy With Couchbase N1QL

DZone's Guide to

Query Performance Monitoring Made Easy With Couchbase N1QL

If you're using Couchbase Server, here's how you can make sure your N1QL queries are up to snuff.

· Database Zone
Free Resource

Learn how to move from MongoDB to Couchbase Server for consistent high performance in distributed environments at any scale.

This article discusses the new monitoring capabilities of N1QL introduced in Couchbase Server 4.5

One of the major breakthroughs of Couchbase Server 4.0 was the introduction of a SQL-based query language specifically created to handle JSON documents: N1QL.

Extolling the virtues of N1QL is not the aim of this article, but you can sample its capabilities here, here, and here.

What we are going to concentrate on here is how to use N1QL monitoring capabilities introduced in version 4.5, specifically with a goal to discover and fix poorly performing requests.

Let's face it — even with all the good will in the world, there will be times when things don't work as expected, and when this happens you need tools to discover the culprit and investigate the erroneous behaviour.

More importantly, and this is almost invariably overlooked, monitoring is not only for rainy days, but for good weather too!

When the sun shines, you can use monitoring tools preventively to find the likely direction the next rain is going to come from.

But let's start from the beginning.

N1QL’s Monitoring Infrastructure

In Couchbase 4.5 N1QL exposes the active requests executing in the Query Service and the prepared statements dataset both via N1QL itself and via specific administrative REST endpoints.

More interestingly, the query service can now track requests that have executed for longer than a predetermined amount of time in the completed requests log, again exposed via a REST endpoint and N1QL.

All three endpoints and keyspaces report a host of useful request related data which, when aggregated, can give good insight as to how the N1QL service resources are being used.

Lifecycle of a N1QL Request

Before we delve into the nitty gritty of monitoring, it may be worth painting the relationship between the life of a N1QL request and the various monitoring datasets.

A picture is worth a thousand words, so rather than describing in detail how things change in time, I’ll let some fancy infographics do the talking:

Image title

Image title

Image title

REST Endpoints

  • http://<N1QL node IP address>:8093/admin/vitals

  • http://<N1QL node IP address>:8093/admin/active_requests

  • http://<N1QL node IP address>:8093/admin/completed_requests
  • http://<N1QL node IP address>:8093/admin/prepareds

To be clear, each endpoint only returns information collected from the specific query node in the URL accessed, and not the entire Couchbase cluster: to collect information from all nodes via REST, you will have to manually loop through all Query nodes.

You will note that an endpoint we have so far not mentioned has sneakily made its way through the previous endpoint list: /admin/vitals.

Vitals

Vitals, as the name implies, returns vital signs of the local query node.

Among the information provided is the node's uptime, local time, version, and more interesting stuff like memory and CPU usage, cores and allocated threads, garbage collector operation, request counts and request time percentiles.

curl -X GET http://localhost:8093/admin/vitals

Which produces documents of this sort:

{"uptime":"66h6m9.965439555s","local.time":"2016-11-04 11:55:31.340325632 +0000 GMT",
 "version":"1.5.0","total.threads":213,"cores":8,"gc.num":8379306,
 "gc.pause.time":"3.500846278s","gc.pause.percent":0,"memory.usage":6710208,
 "memory.total":36982771064,"memory.system":52574456,"cpu.user.percent":0,
 "cpu.sys.percent":0,"request.completed.count":61,"request.active.count":0,
 "request.per.sec.1min":0,"request.per.sec.5min":0,"request.per.sec.15min":0,
 "request_time.mean":"365.708343ms","request_time.median":"784.055µs",
 "request_time.80percentile":"417.369525ms",
 "request_time.95percentile":"2.347575326s",
 "request_time.99percentile":"5.669627409s","request.prepared.percent":0}

The only supported method is GET, and the output is not pretty-formatted (by way of clarification, pretty is an attribute of JSON documents returned by the N1QL service which determines if the document is formatted or not and can be set at the service or request level to either true or false, but for admin endpoints, it is always set to false).

Active Requests

The active_request endpoints provide information about all active requests...

  • GET http://<N1QL node IP address>:8093/admin/active_requests

...individual active requests...

  • GET http://<N1QL node IP address>:8093/admin/active_requests/<request id>

...and, more importantly, the ability to cancel requests for whatever reason.

  • DELETE http://<N1QL node IP address>:8093/admin/active_requests/<request id>

As an example

curl -X GET http://localhost:8093/admin/active_requests

Again, the resulting document is not pretty-formatted.

[{"elapsedTime":"1.525668112s","executionTime":"1.525635694s",
  "phaseCounts":{"Fetch":11280,"PrimaryScan":11716},
  "phaseOperators":{"Fetch":1,"PrimaryScan":1},
  "requestId":"0b1d33a7-db3f-4a8e-86ff-63aee969fbbe",
  "requestTime":"2016-11-04T12:03:13.25494174Z",
  "scanConsistency":"unbounded","state":"running",
  "statement":"select *  from `travel-sample`;"}]


Completed Requests

This endpoint reports the same information as active_requests, but for the completed requests cache.

  • GET http://<N1QL node IP address>:8093/admin/completed_requests

  • GET http://<N1QL node IP address>:8093/admin/completed_requests/<request id>

  • DELETE http://<N1QL node IP address>:8093/admin/completed_requests/<request id>

Unlike active_requests, a DELETE on completed_requests only removes an entry from the cache.

The completed request cache by default logs the last 4000 requests lasting more than 1 second that have run on the local node.

The cache size and time threshold can be set via the /admin/settings endpoint via the "completed-limit" and "completed-threshold" fields, for example:

curl http://localhost:8093/admin/settings -u Administrator:password -d '{ "completed-limit": 10000 }'

Prepared Statements

...and again for prepared statements

  • GET http://<N1QL node IP address>:8093/admin/prepareds

  • GET http://<N1QL node IP address>:8093/admin/prepareds/<prepared statement name>

  • DELETE http://<N1QL node IP address>:8093/admin/prepareds/<prepared statement name>

Getting Monitoring Information via System Keyspaces

A more interesting way to access the N1QL caches is via N1QL itself.

For each of the admin REST endpoints (active_requests, completed_requests, prepareds) there is a corresponding keyspace in the system namespace.

So, for example, active requests can be directly accessed from the query service via a simple

select * from system:active_requests;

While the REST API can provide monitoring information at a relatively low cost, using system keyspaces offers the opportunity to unleash all of N1QL's query capabilities over the monitoring information.

We will explore a few more advanced examples to make the point.

The examples below are specific to Couchbase Server 4.5.

Preparation

We will use the beer-sample bucket in order to provide some load for the following examples.

Create a secondary index on `beer-sample`(city)

Start up a few client applications preparing

select * 
  from `beer-sample`
  use keys["21st_amendment_brewery_cafe-amendment_pale_ale"];

and

select * 
  from `beer-sample` 
  where city = "Lawton";

And then executing the statements above for, say, 10000 times per client.

Too much load on the node

Among the useful information reported by the system:active_requests keyspace there are two pieces of timing information:

  • Elapsed time

  • Execution time

The first starts being accrued as soon as the N1QL service becomes aware of the request, the second as soon as the request is being executed.

Under normal circumstances there should be very little difference in between the two times, so the following query...

select avg(100*
           (str_to_duration(ElapsedTime) - str_to_duration(ExecutionTime))/
           str_to_duration(ElapsedTime)), count(*)
  from system:active_requests;

...should return only a few percentage points (say 1 to 3%):

{
    "requestID": "137b5777-d420-45e9-bbef-0b15197c9a93",
    "signature": {
        "$1": "number",
        "$2": "number"
    },
    "results": [
        {
            "$1": 3.210601953400783,
            "$2": 8
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "1.938543ms",
        "executionTime": "1.910804ms",
        "resultCount": 1,
        "resultSize": 68
    }
}

The query itself should execute quite quickly. When the server is under stress, requests will most probably be queued before being executed and the ratio of elapsed to execution time will increase — to put it in perspective a value of 50 means that the request is spending just as much time queued as being executed.

In general terms, once the node is under stress, the query above will plateau at a specific percentage point.

On my development machine this is about 25% (although higher spikes will make an appearance from time to time), however, a tell tale sign of the node being under stress is that the elapsed to execution ratio of the query itself is quite high:

{
    "requestID": "8cafcd14-fbdb-48bd-a35d-9f0fb03752ae",
    "signature": {
        "$1": "number",
        "$2": "number"
    },
    "results": [
        {
            "$1": 35.770020645707156,
            "$2": 94
        }
    ],
    "status": "success",
    "metrics": {
        "elapsedTime": "29.999017ms",
        "executionTime": "3.864563ms",
        "resultCount": 1,
        "resultSize": 48
    }
}

A similar query can be run against the system:completed_requests keyspace. Again high execution to elapsed ratio is an indication that the system is under stress: long running queries on a node with spare capacity instead show a high execution time with a very queueing ratio.

Prepared Statements With Highest Total Execution Time

The system:prepareds keyspace returns average execution time and number of executions for each prepared statement.

Since the same statement can be prepared anonymously by several clients under several names, summing average service times multiplied by uses and grouped by statement text gives a good idea of the total execution cost of each statement.

select sum(str_to_duration(avgServiceTime)*uses/1000000000) ttime, statement
  from system:prepareds
  where avgServiceTime is not missing
  group by statement
  order by ttime;

The query above returns the overall execution time in seconds for each statement

[
  {
    "statement": "PREPARE SELECT * FROM `beer-sample` USE KEYS[\"21st_amendment_brewery_cafe-amendment_pale_ale\"];",
    "ttime": 515.6239400000001
  },
  {
    "statement": "PREPARE select * from `beer-sample` where city = \"Lawton\";",
    "ttime": 1506.407181327
  }
]


Prepared Statements With Highest Execution Count

Extending the concept further, it’s possible to determine the prepared statements that have been used the most.

For example...

select sum(uses) usecount, statement
  from system:prepareds 
  where uses is not missing 
  group by statement 
  order by usecount desc;

...returns:

[
  {
    "statement": "PREPARE SELECT * FROM `beer-sample` USE KEYS[\"21st_amendment_brewery_cafe-amendment_pale_ale\"];",
    "usecount": 200000
  },
  {
    "statement": "PREPARE select * from `beer-sample` where city = \"Lawton\";",
    "usecount": 5393
  }
]


Prepared Statements With Highest Prepare Count

...or the statements that have been prepared by the most clients:

select count(*) prepcount, statement 
  from system:prepareds 
  group by statement 
  order by prepcount desc;

which yields:

[
  {
    "prepcount": 20,
    "statement": "PREPARE SELECT * FROM `beer-sample` USE KEYS[\"21st_amendment_brewery_cafe-amendment_pale_ale\"];"
  },
  {
    "prepcount": 2,
    "statement": "PREPARE select * from `beer-sample` where city = \"Lawton\";"
  }
]


Requests Using Suboptimal Paths

Another important piece of information reported by both system:active_requests and system:completed_requests is the number of documents processed by each execution phase.

This information is interesting for at least two different reasons.

The first is, even though the system keyspaces do not report a request plan, and therefore individual query paths are not known, we can still draw a fair amount of information from the fact that specific phases are used.

For example, primary scans are the N1QL equivalent of RDBMS’s sequential scans - unless very small buckets are involved, where an ad hoc index adds no performance benefit, primary scans are best avoided.

Similarly, you may have already heard that covering indexes (indexes whose keys encompass all the expressions in the select list) are preferable to non covering indexes, because when the index chosen is not covering, request execution will require a Fetch phase after the index scan, in order to access each document, which clearly requires further effort on the N1QL service part.

The mere presence of a Fetch phase is a dead giveaway that covering indexes are not being used.

The second reason, more obvious, is that document counts instantly give a sense of the cost of individual phases, and by extension, individual requests.

We will now explore a few use cases exploiting phase counts.

Preparation

From cbq, execute a few.

select * from `travel-sample`;


Requests Using Primary Scans

As stated earlier, all we are looking for is requests where the PhaseCounts subdocuments has a PrimaryScan field:

select *
  from system:completed_requests
  where PhaseCounts.`PrimaryScan` is not missing;

Which in our case yields:

[
  {
    "completed_requests": {
      "ClientContextID": "02a9db12-1bb8-4669-b2c6-c6e1d50525d2",
      "ElapsedTime": "35.638999968s",
      "ErrorCount": 0,
      "PhaseCounts": {
        "Fetch": 31591,
        "PrimaryScan": 31591
      },
      "PhaseOperators": {
        "Fetch": 1,
        "PrimaryScan": 1
      },
      "RequestId": "ade384a9-367b-4241-8307-68cf923f9b8c",
      "ResultCount": 31591,
      "ResultSize": 107829252,
      "ServiceTime": "35.638946836s",
      "State": "completed",
      "Statement": "select * from `travel-sample`",
      "Time": "2016-11-10 13:26:53.80370308 -0800 PST"
    }
  },
  {
    "completed_requests": {
      "ClientContextID": "0afc6ce3-5c09-4fbd-9cec-6f8719a3dff1",
      "ElapsedTime": "42.179613979s",
      "ErrorCount": 0,
      "PhaseCounts": {
        "Fetch": 31591,
        "PrimaryScan": 31591
      },
      "PhaseOperators": {
        "Fetch": 1,
        "PrimaryScan": 1
      },
      "RequestId": "874d4493-af76-49e6-9b89-e9323c33915f",
      "ResultCount": 31591,
      "ResultSize": 107829252,
      "ServiceTime": "42.17955482s",
      "State": "completed",
      "Statement": "select * from `travel-sample`",
      "Time": "2016-11-10 13:27:45.641104619 -0800 PST"
    }
  },
  {
    "completed_requests": {
      "ClientContextID": "6850f34d-5d9c-49c1-91ac-88ce6265c5ac",
      "ElapsedTime": "36.769111048s",
      "ErrorCount": 0,
      "PhaseCounts": {
        "Fetch": 31591,
        "PrimaryScan": 31591
      },
      "PhaseOperators": {
        "Fetch": 1,
        "PrimaryScan": 1
      },
      "RequestId": "a6ebd56f-9562-4532-8d97-d26bacc2ce81",
      "ResultCount": 31591,
      "ResultSize": 107829252,
      "ServiceTime": "36.769063345s",
      "State": "completed",
      "Statement": "select * from `travel-sample`",
      "Time": "2016-11-10 13:26:05.510344804 -0800 PST"
    }
  }
]


Preparation

The example below requires a secondary index on `travel-sample`(type)

Execute the following query:

select id
  from `travel-sample`
  where type="route" and any s in schedule satisfies s.flight="AZ917" end;


Requests not Using Covering Indexes

In a similar way, we can find non covering indexes by stipulating that the requests we are looking for have both index scans and fetches in the PhaseCounts subdocument:

select *
  from system:completed_requests 
  where PhaseCounts.`IndexScan` is not missing
    and PhaseCounts.`Fetch` is not missing;

Which, in our case yields:

[
  {
    "completed_requests": {
      "ClientContextID": "e5f0e69a-ff3e-47e4-90c7-881772c2e1b7",
      "ElapsedTime": "2.877312961s",
      "ErrorCount": 0,
      "PhaseCounts": {
        "Fetch": 24024,
        "IndexScan": 24024
      },
      "PhaseOperators": {
        "Fetch": 1,
        "IndexScan": 1
      },
      "RequestId": "12402dd8-430a-4c32-84c1-a59b88695e66",
      "ResultCount": 10,
      "ResultSize": 350,
      "ServiceTime": "2.8772514s",
      "State": "completed",
      "Statement": "select id from `travel-sample` where type=\"route\" and any s in schedule satisfies s.flight=\"AZ917\" end;",
      "Time": "2016-11-10 13:10:00.04988112 -0800 PST"
    }
  }
]


Requests Using a Poorly Selective Index

Loosely speaking, Selectivity is that property of a predicate which indicates how good that predicate is at isolating those documents which are relevant for the query result set.

In general terms, when using an index scan, we would like to have the scan return only the relevant document keys, and no more, so that later stages of the plan (eg, fetch or filter) only have to operate on relevant documents.

A simple measure of poor selectivity of a predicate is the total number of keys (or documents) to be considered by the predicate divided by the number of documents that satisfy the predicate - the higher the number, the worse the selectivity (in general terms selectivity is defined as the number qualifying items divided by the total number of items).

Clearly, when an index path is in use, we would like that the relevant filters are as relevant as possible, and we can determine if this is the case by just measuring the number of documents fetched against the number of documents returned.

The query below finds requests that do use indexes and that have a number of fetches to results ratio greater than ten.

This means that the request fetches ten documents for each document that satisfies all the predicates in the where clause: you threshold may very well be different (and lower, I would suggest!).

select PhaseCounts.`Fetch` / ResultCount efficiency, *
  from system:completed_requests
  where PhaseCounts.`IndexScan` is not missing
   and PhaseCounts.`Fetch` / ResultCount > 10
  order by efficiency desc;

In our case, this yields:

[
  {
    "completed_requests": {
      "ClientContextID": "e5f0e69a-ff3e-47e4-90c7-881772c2e1b7",
      "ElapsedTime": "2.877312961s",
      "ErrorCount": 0,
      "PhaseCounts": {
        "Fetch": 24024,
        "IndexScan": 24024
      },
      "PhaseOperators": {
        "Fetch": 1,
        "IndexScan": 1
      },
      "RequestId": "12402dd8-430a-4c32-84c1-a59b88695e66",
      "ResultCount": 10,
      "ResultSize": 350,
      "ServiceTime": "2.8772514s",
      "State": "completed",
      "Statement": "select id from `travel-sample` where type=\"route\" and any s in schedule satisfies s.flight=\"AZ917\" end;",
      "Time": "2016-11-10 13:10:00.04988112 -0800 PST"
    },
    "efficiency": 2402.4
  }
]


Determining Requests With the Highest Cost

As we were saying earlier, PhaseCounts give a ready sense of individual phases cost. Sum them up, and you can get a measure of the cost of individual requests.

In the statement below we turn the PhaseCounts subdocument into an array in order to tot up the cost:

select Statement, array_sum(object_values(PhaseCounts)) cost
  from system:completed_requests
  order by cost desc;

This returns the statements we run earlier in order of decreasing cost:

[
        {
            "Statement": "select * from `travel-sample`",
            "cost": 63182
        },
        {
            "Statement": "select * from `travel-sample`",
            "cost": 63182
        },
        {
            "Statement": "select * from `travel-sample`",
            "cost": 63182
        },
        {
            "Statement": "select id from `travel-sample` where type=\"route\" and any s in schedule satisfies s.flight=\"AZ917\" end",
            "cost": 48048
        }
]


Aggregating Execution Times for Completed Requests

Determining total execution times for individual statement texts is not only the preserve of prepared statements.

A very similar technique can be used on system:completed_requests.

Further, this can be combined with filtering for suboptimal paths.

The following query looks for requests having used a primary scan, returning several aggregates for each individual statement:

select count(*) num,
       min(ServiceTime) minEx, max(ServiceTime) maxEx,
       avg(str_to_duration(ServiceTime)/1000000000) avgEx,
       Statement
  from system:completed_requests
  where PhaseCounts.PrimaryScan is not missing
  group by Statement;

Note that execution times are strings, and in principle they should be turned into durations in order to be aggregated properly. This is what we are doing for avg():

[
  {
    "Statement": "select * from `travel-sample`",
    "avgEx": 38.19585500033333,
    "maxEx": "42.17955482s",
    "minEx": "35.638946836s",
    "num": 3
  }
]


Canceling Requests

We have seen before that a rogue request can be cancelled using the REST API.

This can be achieved through N1QL directly by simply deleting from system:active_requests:

delete from system:active_requests
  where RequestId="d3b607d4-1ff1-49c8-994e-8b7e6828f6a0";

(replace the request id in quotes with the RequestId of the request you wish to cancel).

But while REST can operate only on a request by request basis, N1QL allows for more creative operations. For example, we can stop all executing instances of a specific statement:

delete from system:active_requests
  where Statement = "select * from `travel-sample`";

Similarly, we can stop all instances of a specific prepared statement:

delete from system:active_requests
  where `Prepared.Name` = "s1";

Further, we can prevent all instances of the culprit prepared statement from executing again (at least, until it is prepared again!) by removing said instances from the prepared statement cache:

delete from system:prepareds
  where `statement` = "prepare select * from `travel-sample`";

In all of the examples above, please be mindful that we are using an equality predicate on the statement text: if you have several variations of the same statement (eg lower case vs upper case, or different amounts of spacing, or…), you’ll either have to match them individually, or have a more comprehensive filter.

Summary

N1QL offers a wide array of powerful monitoring tools which can quickly identify poor performing statements or statements that are likely to become a bottleneck.

Examples have been provided to detect several different types of requests which are interesting for resource planning purposes.

Author’s Note

The examples provided are specific to Couchbase Server 4.5. Future versions will introduce additional system keyspaces and contents, and some field names may be changed to a consistent format.

Want to deliver a whole new level of customer experience? Learn how to make your move from MongoDB to Couchbase Server.

Topics:
couchbase ,n1ql ,performance ,monitoring ,database

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}