Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Create the Right Index, Get the Right Performance, Part 2

DZone's Guide to

Create the Right Index, Get the Right Performance, Part 2

We take a look at some of the rules you can follow to create indexes with Couchbase to optimize its performance for your development team.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

Rules for Creating the Indexes.

So far, we looked at the types of indexes. Let's now look at how we go about designing the indexes for your workload.

Rule #1: Use Keys

In Couchbase, each document in a bucket has a user generated unique key. The documents are distributed among different nodes by hashing this key (we use consistent hashing ). When you have the document key, you can fetch the documents directly from the Applications (via SDKs). Even when you have the document keys, you may want to do fetch and do some post-processing via N1QL. That's when you use the USE KEYS method.

Example:


SELECT name, address
FROM `travel-sample` h USE KEYS [
  "hotel_10025",
  "hotel_10026",
  "hotel_10063",
  "hotel_10064",
  "hotel_10138",
  "hotel_10142",
  "hotel_10158",
  "hotel_10159",
  "hotel_10160",
  "hotel_10161",
  "hotel_10180",
  "hotel_10289",
  "hotel_10290",
  "hotel_10291",
  "hotel_1072",
  "hotel_10848",
  "hotel_10849",
  "hotel_10850",
  "hotel_10851",
  "hotel_10904"
]
WHERE  h.country = "United Kingdom"
AND ARRAY_LENGTH(public_likes) > 3;

The USE KEYS access method can be used even when you do joins. Here's an example

SELECT * FROM ORDERS o USE KEYS ["ord::382"]  INNER JOIN CUSTOMER c ON KEYS o.id;

In Couchbsae 5.0, indexes are used only to process the first keyspace (bucket) of each FROM clause. Subsequent keyspaces are processed via direct fetch of the document.

SELECT * FROM ORDERS o INNER JOIN CUSTOMER c ON KEYS o.id WHERE o.state = "CA";

In this statement, we process the ORDERS keyspace via an index on (state) if it's available. Otherwise, we use the primary index to scan ORDERS. We then fetch the CUSTOMER documents matching the id in the ORDERS document.

Rule #2: Use Covering Index

We discussed types of index earlier in the article. The right index serves two purposes:

  1. Reduce the working set for the query to speed up query performance.
  2. Store and provide additional data even.

When a query can be answered completely by the data stored in the index, the query is said to be the covering index. You should try to have most, if not all, of your queries covered. This will reduce the processing burden on the query service and reduce additional fetch from the data service.

The index selection is still done based on the predicates in the query. Once the index selection is done, the optimizer will evaluate to see if the index contains all the required attributes for the query and creates a covered index path access.

Examples:

CREATE INDEX idx_cx3 ON CUSTOMER(state, city, name.lastname) WHERE status = 'premium';

/* The query below won’t be covered since you said: SELECT * */
SELECT * FROM CUSTOMER
WHERE state = 'CA’ AND status = 'premium';

/* The index has all three fields required by the query. */
/* Query will be covered, as shown in the explain plan. */

SELECT status, state, city FROM CUSTOMER
WHERE state = 'CA' AND status = 'premium';


        "#operator": "IndexScan2",
        "covers": [
          "cover ((`CUSTOMER`.`state`))",
          "cover ((`CUSTOMER`.`city`))",
          "cover (((`CUSTOMER`.`name`).`lastname`))",
          "cover ((meta(`CUSTOMER`).`id`))"
        ],
        "filter_covers": {
          "cover ((`CUSTOMER`.`status`))": "premium"
        },
        "index": "idx_cx3",
        "index_id": "18f8209144215971",
        "index_projection": {
          "entry_keys": [
        0,
        1
          ]
        }

Note that the status field in the WHERE clause of the index (status = 'premium') is also covered. We know every document in the index has a field called status with a value 'premium.' We can simply project this value. The "Filter_covers" field in the explanation shows this information.

As long as the index has the field, a query can do additional filtering, joins, aggregation, and pagination after fetching the data from the indexer without fetching the full document.

Rule #3: Use the Index Replication

In a Couchbase cluster, you have multiple index services. Prior to Couchbase 5.0, you could manually create replica (equivalent) indexes to improve throughput, load balancing, and high availability.

Prior to 5.0:

CREATE INDEX idx1 ON CUSTOMER(state, city, name.lastname) WHERE status = 'premium';
CREATE INDEX idx2 ON CUSTOMER(state, city, name.lastname) WHERE status = 'premium';
CREATE INDEX idx3 ON CUSTOMER(state, city, name.lastname) WHERE status = 'premium';

We recognize the equivalency of these three indexes because the key expressions and the WHERE clause are exactly same.

During the query optimization phase, N1QL engine picks up one of the three indexes for index scan (assuming other requirements are met) to create the query plan. During query execution, query prepares the scan package and sends an index scan-request. During this process, based on the load statistics, we send the request to one of them. The idea is, over time, each of them will have a similar load.

This process of creating replica indexes (equivalent indexes) is made easier with a simple parameter.

CREATE INDEX idx1 ON CUSTOMER(state, city, name.lastname) WHERE status = 'premium' WITH {"num_replica":2};

This is the same as creating three distinct, but equivalent indexes.

Rule #4: Index by Workload, Not by Bucket/Keyspace

Consider the whole application workload and service level agreements (SLAs) for each of the queries. The queries that have millisecond latency requirements with high throughput will require customized and replica indexes, whereas others could share indexes.

There may be keyspaces on which you simply do set and get operations or can issue queries with USE KEYS. These keyspaces won't need any indexes.

Analyze the queries to find the common predicates, projections from a keyspace. You can optimize the number of indexes based on common predicates. If one of your queries does not have a predicate on the leading key or keys, see if adding (field IS NOT MISSING) makes sense to that the index can be shared.

It's fine to have a primary index while developing your application or queries. But, before you test, create the right indexes and drop the primary index from your system, unless your application has use cases described in the "Primary Index" section. If you do have a primary index in production and queries end up doing a full primary scan with a full span on the index, you're asking for trouble. In Couchbase, the primary index indexes all the documents in the bucket.

Every secondary index in Couchbase should have a WHERE clause, with at least a condition on the document type. This isn't enforced by the system, but it's a good design.

CREATE INDEX def_route_src_dst ON `travel-sample`  (`sourceairport`, `destinationairport`)   WHERE (`type` = "route");

Creating the right indexes is one of the best practices for your performance optimization. This isn't the only thing you'd need to do to get the best performance. Cluster configuration, tuning, SDK configuration, and the use of prepared statements all play a significant role.

Rule #5: Index by Predicate, Not by Projection

This seems like an obvious rule. But, I come across folks making this mistake every now and then.

Consider the query:

SELECT city, state, status FROM CUSTOME WHERE state = 'CA' AND status = 'premium';

Any of the following indexes can be used by the query:

Create index i1 on CUSTOMER(state);
Create index i2 on CUSTOMER(status);
Create index i3 on CUSTOMER(state, status);
Create index i4 on CUSTOMER(status, state);
Create index i5 on CUSTOMER(state) WHERE status = “premium”;
Create index i6 on CUSTOMER(status) WHERE status = “CA”;
  1. To make the index completely cover the query, simply add the city field to index 3-6.

  2. However, if you have an index that has the city as the leading key, optimizer won’t pick up the index.

See the detailed article on how the index scan work in various scenarios to optimize the index.

Rule #6: Add Indexes to Meet the SLAs

For relational databases, three things were most important: performance, performance, performance.

For NoSQL databases, three things matter most: performance at scale, performance at scale, performance at scale.

Your queries running basic performance tests on your laptop is one thing, running the high throughput, low latency queries on the cluster is another thing. Fortunately, in Couchbase, it's easy to identify and scale the bottleneck resources independently, thanks to the multi-dimensional scaling. Each of the services in Couchbase is abstracted into a distinct service: data, index, query. The Couchbase console has statistics on each of the services independently.

After you've created indexes for your queries, and optimized the indexes for the workload, you can add additional replica (equivalent) indexes to improve the latency because we load balance the scans among the replica indexes.

Rule #7: Index to Avoid Sorting

Index already has the data in the sorted order of the index keys. After the scan, index returns the results in the index key order.

CREATE INDEX idx3 ON `travel-sample`(state, city, name.lastname) WHERE status = 'premium';

The data is stored and returned in the order: state, city, name.lastname. So, if you have a query, that expects the data in the order of state, city, name.lastname, the index will help you to avoid the sort.

In the example below, the results are ordered by name.lastname, the third key of the index. Therefore, it's necessary to sort the resultset on name.lastname. EXPLAIN will tell you if the plan requires this sort.

EXPLAIN SELECT state, city, name.lastname 
FROM `travel-sample`
WHERE status = ‘premium’ AND state = ‘CA’ AND city LIKE ‘san%’
ORDER BY name.lastname;

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan2",
            "covers": [
              "cover ((`travel-sample`.`state`))",
              "cover ((`travel-sample`.`city`))",
              "cover (((`travel-sample`.`name`).`lastname`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "filter_covers": {
              "cover ((`travel-sample`.`status`))": "premium"
            },
            "index": "idx3",
            "index_id": "19a5aed899d281fe",
            "index_projection": {
              "entry_keys": [
                0,
                1,
                2
              ]
            },
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": "\"CA\"",
                    "inclusion": 3,
                    "low": "\"CA\""
                  },
                  {
                    "high": "\"sao\"",
                    "inclusion": 1,
                    "low": "\"san\""
                  }
                ]
              }
            ],
            "using": "gsi"
          },
          {
            "#operator": "Parallel",
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "(((cover ((`travel-sample`.`status`)) = \"premium\") and (cover ((`travel-sample`.`state`)) = \"CA\")) and (cover ((`travel-sample`.`city`)) like \"san%\"))"
                },
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "expr": "cover ((`travel-sample`.`state`))"
                    },
                    {
                      "expr": "cover ((`travel-sample`.`city`))"
                    },
                    {
                      "expr": "cover (((`travel-sample`.`name`).`lastname`))"
                    }
                  ]
                }
              ]
            }
          }
        ]
      },
      {
        "#operator": "Order",
        "sort_terms": [
          {
            "expr": "cover (((`travel-sample`.`name`).`lastname`))"
          }
        ]
      },
      {
        "#operator": "FinalProject"
      }
    ]
  },
  "text": "SELECT state, city, name.lastname \nFROM `travel-sample`\nWHERE status = 'premium' AND state = 'CA' AND city LIKE 'san%'\nORDER BY name.lastname;"
}

The query below has the perfect match for the index keys. So, the sort is unnecessary. In the EXPLAIN output, the order operator is missing.

EXPLAIN SELECT state, city, name.lastname 
FROM `travel-sample`
WHERE status = ‘premium’ AND state = ‘CA’ AND city LIKE ‘san%’
ORDER BY state, city, name.lastname;


{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan2",
            "covers": [
              "cover ((`travel-sample`.`state`))",
              "cover ((`travel-sample`.`city`))",
              "cover (((`travel-sample`.`name`).`lastname`))",
              "cover ((meta(`travel-sample`).`id`))"
            ],
            "filter_covers": {
              "cover ((`travel-sample`.`status`))": "premium"
            },
            "index": "idx3",
            "index_id": "19a5aed899d281fe",
            "index_projection": {
              "entry_keys": [
                0,
                1,
                2
              ]
            },
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": "\"CA\"",
                    "inclusion": 3,
                    "low": "\"CA\""
                  },
                  {
                    "high": "\"sao\"",
                    "inclusion": 1,
                    "low": "\"san\""
                  }
                ]
              }
            ],
            "using": "gsi"
          },
          {
            "#operator": "Parallel",
            "maxParallelism": 1,
            "~child": {
              "#operator": "Sequence",
              "~children": [
                {
                  "#operator": "Filter",
                  "condition": "(((cover ((`travel-sample`.`status`)) = \"premium\") and (cover ((`travel-sample`.`state`)) = \"CA\")) and (cover ((`travel-sample`.`city`)) like \"san%\"))"
                },
                {
                  "#operator": "InitialProject",
                  "result_terms": [
                    {
                      "expr": "cover ((`travel-sample`.`state`))"
                    },
                    {
                      "expr": "cover ((`travel-sample`.`city`))"
                    },
                    {
                      "expr": "cover (((`travel-sample`.`name`).`lastname`))"
                    }
                  ]
                },
                {
                  "#operator": "FinalProject"
                }
              ]
            }
          }
        ]
      }
    ]
  },
  "text": "SELECT state, city, name.lastname \nFROM `travel-sample`\nWHERE status = 'premium' AND state = 'CA' AND city LIKE 'san%'\nORDER BY state, city, name.lastname;"
}

Exploiting the index sort order may not seem important until you see the pagination use case. When the query has specified OFFSET and LIMIT, an index can be used for efficiently eliminating the documents which the application does not care about or need. See the article on pagination for details on this.

N1QL optimizer first selects the index based on predicates in the query (filters) and then verifies if the index can cover all the query references in projection and order. After that, the optimizer tries to eliminate the sorting and decide on the OFFSET and LIMIT pushdown. The below code shows if the OFFSET and LIMIT were pushed to the index scan.

              "keyspace": "travel-sample",
              "limit": "20",
              "namespace": "default",
              "offset": "100",
              "spans": [
                {
                  "exact": true,
                  "range": [
                    {
                      "high": "\"CA\"",
                      "inclusion": 3,
                      "low": "\"CA\""
                    },
                    {
                      "high": "\"sao\"",
                      "inclusion": 1,
                      "low": "\"san\""
                    }
                  ]
                }
              ],

Rule #8: Number of Indexes

There's no artificial limit on the number of indexes you can have in the system. If you're creating a large number of indexes on a bucket that has the data, use the deferred build option so the data transfer between the data service and index service is efficient.

Rule #9: Index During INSERT, DELETE, UPDATE

The index is maintained asynchronously. Your data updates via key-value API or any N1QL statements and only update the documents in the bucket. The index receives the notification of changes via the stream and applies the changes to the index. Here is the sequence of operations for an UPDATE statement. The statement uses the index to qualify the documents to update; fetch the documents and update them; then write the documents back and return any data requested from the UPDATE statement.

Rule #10: Index Key Order and Predicate Types

Index scan requests created by the query users first N consecutive keys of the index. So, the order of the index key is important.

Consider a query with various predicates:

SELECT cid, address FROM CUSTOMER
WHERE state = ‘CA’
AND type = ‘premium’
AND zipcode IN [29482, 29284, 29482, 28472]
AND salary < 50000 AND age > 45 ;

These are general rules for the order of keys in the index. Keys can be simpler scalar attributes or expressions which return scalar values: e.g. UPPER(name.lastname).

  1. First priority is the keys with equality predicates. In this query, it's on state and type. When there are multiple predicates of the same type, choose any combination.
  2. The second priority is the keys with IN predicates. In this query, it's on zipcode.
  3. The third priority is the less than (<) predicates. In this case, it's on salary.
  4. The fourth priority is the between predicates. This query does not have a between predicate.
  5. The fifth priority is the greater than (>) predicates. In this query, it's on age.
  6. The sixth priority is the array predicates: ANY, EVERY AND ANY, predicates after UNNEST.
  7. Look to add additional fields for an index to cover the query.
  8. After doing this analysis, look for any expressions that can be moved to the WHERE clause. For example, in this case, type = "premium" can be moved because the type field is designated by the users to identify the type of customers.

With this, we come up with the following index.

CREATE INDEX idx_order ON CUSTOMER 
( state, zipcode, salary, age, address, cid) 
WHERE type = "premium";

Rule #11: Understand How to Read EXPLAIN and PROFILING

No matter how many rules you follow, you'll have to understand query plans and profile, monitor the system under load, and tune it. The ability to understand and analyze query planning and profiling information is the key to tuning a query and a workload. There are two fine articles on those topics. Go through and try out the examples.

  1. https://dzone.com/articles/understanding-index-scans-in-couchbase-50-n1ql-que
  2. https://blog.couchbase.com/profiling-monitoring-update-2/

References

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

Topics:
performance ,database performance ,couchbase

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}