Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Create the Right Index, Get the Right Performance, Part 1

DZone's Guide to

Create the Right Index, Get the Right Performance, Part 1

In this post, we take a look at some of the ways you can work with Couchbase to optimize its performance for your development team.

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

For database systems in general, there are three important things:  performance, performance, performance.   For NoSQL database systems, there are three important things: performance at scale, performance at scale, performance at scale.

Understanding the index options, creating the right index, with the right keys, right order, and right expression is critical to query performance and performance at scale on Couchbase. We've discussed data modeling for JSON and querying on JSON earlier. In this article, we'll discuss indexing options for JSON in Couchbase.

Couchbase 5.0 has three types of index categories. Each Couchbase cluster can only have one category of index, either a standard global secondary index or a memory optimized global secondary index.

Standard Secondary: Release 4.0 and above

*Based on ForestDB

*Released with Couchbase 4.0

Memory Optimized Index: 4.5 and above

*100% of the index is in memory

*Index is written to disk for recovery only

*Predictable Performance

*Better mutation rate

Standard Secondary: Release 5.0

* Uses the lockless skiplist based Plasma storage engine for enterprise edition

* Uses ForestDB storage engine for community edition

* Released with Couchbase 5.0.

* Designed to handle very large datasets

The standard secondary index (from 4.0 to 4.6.x) stores uses the ForestDB storage engine to store the B-Tree index and keeps the optimal working set of data in the buffer. That means, the total size of the index can be much bigger than the amount of memory available in each index node.

A memory-optimized index uses a novel lock-free skiplist to maintain the index and keeps 100% of the index data in memory. A memory-optimized index (MOI) has better latency for index scans and can also process the mutations of the data much faster.

The standard secondary index in 5.0 uses the plasma storage engine in the enterprise edition, which uses the lock-free skip list like MOI, but supports large indexes that don't fit in memory.

All three types of indexes implement multi-version concurrency control (MVCC) to provide consistent index scan results and high throughput. During cluster installation, choose the type of index.

The goal is to give you an overview of various indices you create in each of these services so that your queries can execute efficiently. The goal of this article is not to describe or compare and contrast these two types of index services. It does not cover the Full Text Search Index (FTS), released in Couchbase 5.0.

Let's take a travel-sample dataset shipped with Couchbase 4.5 to walk through this. To try out these indices, install Couchbase 4.5. On your web console, go to Settings->Sample Buckets to install travel-sample.

Here are the various indices you can create.

  • Primary Index
  • Named primary index
  • Secondary index
  • Composite Secondary Index
  • Functional index
  • Array Index
  • ALL array
  • ALL DISTINCT array
  • Partial Index
  • Adaptive Index
  • Duplicate Indices
  • Covering Index

Background

Couchbase is a distributed database. It supports a flexible data model using JSON. Each document in a bucket will have a user-generated unique document key. This uniqueness is enforced during insertion of the data.

Here's an example document.

select meta().id, travel
from `travel-sample` travel
where type = 'airline' limit 1;
[
  {
      "id": "airline_10",
      "travel": {
          "callsign": "MILE-AIR",
          "country": "United States",
          "iata": "Q5",
          "icao": "MLA",
          "id": 10,
          "name": "40-Mile Air",
          "type": "airline"
          }
      }
  ]

Types of Indexes

1. Primary Index

Create the primary index on 'travel-sample.'

The primary index is simply the index on the document key on the whole bucket. The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index in Couchbase, is maintained asynchronously. You set the recency of the data by setting the consistency level for your query.

Here is the metadata for this index:

select * from system:indexes where name = ‘#primary’;
"indexes": {
  "datastore_id": "http://127.0.0.1:8091",
  "id": "f6e3c75d6f396e7d",
  "index_key": [],
  "is_primary": true,
  "keyspace_id": "travel-sample",
  "name": "#primary",
  "namespace_id": "default",
  "state": "online",
  "using": "gsi"
  }

The metadata gives you additional information on the index: Where the index resides (datastore_id), its state (state), and the indexing method (using).

The primary index is used for full bucket scans (primary scans) when the query does not have any filters (predicates) or no other index or access path can be used. In Couchbase, you store multiple keyspaces (documents of a different type, customer, orders, inventory, etc.) in a single bucket. So, when you do the primary scan, the query will use the index to get the document-keys and fetch all the documents in the bucket and then apply the filter. So, this is VERY EXPENSIVE.

The document key design is somewhat like primary key design with multiple parts.

Lastname:firstname:customerid

Example: smith:john:X1A1849 

In Couchbase, it's a best practice to prefix the key with the type of the document. Since this is a customer document, let's prefix it with CX. Now, the key becomes:

Example: CX:smith:john:X1A1849

So, in the same bucket, there will be other types of documents.

ORDERS type:  OD:US:CA:294829

These are simply best practices. There is no restriction on the format or structure of the document key in Couchbase, except they have to be unique within a bucket.

Now, if you have documents with various keys and have a primary index, you can use the following queries efficiently.

Example 1: Looking for a specific document key.

SELECT * FROM  sales WHERE META().id = “CX:smith:john:X1A1849”;

      {
        "#operator": "IndexScan2",
        "index": "#primary",
        "index_id": "4c92ab0bcca9690a",
        "keyspace": "sales",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CX:smith:john:X1A1849\"",
                "inclusion": 3,
                "low": "\"CX:smith:john:X1A1849\""
              }
            ]
          }
        ],

If you do know the full document key, you can use the following statement and avoid the index access altogether.

SELECT * FROM sales USE KEYS [“CX:smith:john:X1A1849”] 

You can get more than one document in a statement.

SELECT * FROM sales USE KEYS [“CX:smith:john:X1A1849”, “CX:smithjr:john:X2A1492”]

Example 2: Look for a pattern. Get ALL the customer documents.

SELECT * FROM  sales WHERE META().id LIKE “CX:%”;
      {
        "#operator": "IndexScan2",
        "index": "#primary",
        "index_id": "4c92ab0bcca9690a",
        "keyspace": "sales",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CX;\"",
                "inclusion": 1,
                "low": "\"CX:\""
              }
            ]
          }
        ],

Example 3: Get all the customers with Smith as their last name.

The following query uses the primary index efficiently, only fetching the customers with a particular range. Note: This scan is case sensitive. To do a case-insensitive scan, you have to create a secondary index with UPPER() or LOWER() of the document key.

SELECT * FROM  sales WHERE META().id LIKE "CX:smith%";

      {
        "#operator": "IndexScan2",
        "index": "#primary",
        "index_id": "4c92ab0bcca9690a",
        "keyspace": "sales",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CX:smiti\"",
                "inclusion": 1,
                "low": "\"CX:smith\""
              }
            ]
          }
        ],

Example 4: It's common for some applications to use email address as part of the document key since they're unique. In that case, you need to find out all of the customers with gmail.com. If this is a typical requirement, then, store the REVERSE of the email address as the key and simply do the scan of leading string pattern.

Email:johnsmith@gmail.com;   key:reverse("johnsmith@gmail.com") => moc.liamg@htimsnhoj  

Email: janesnow@yahoo.com  key: reverse("janesnow@yahoo.com") =>moc.oohay@wonsenaj 


2. Named Primary Index

In Couchbase 5.0, you can create multiple replicas of any index with a simple parameter to CREATE INDEX. The following will create 3 copies of the index and there has to be a minimum of 3 index nodes in the cluster.

CREATE PRIMARY INDEX ON 'travel-sample' WITH {"num_replica":2}; 
CREATE PRIMARY INDEX `def_primary` ON `travel-sample` ;

You can also name the primary index. For the rest of the features of the primary index are the same, except the index is named. A good side effect of this is that you can have multiple primary indices in Couchbase versions before 5.0 using different names. Duplicate indices help with high availability as well as query load distribution throughout them. This is true for both primary indices and secondary indices.

select meta().id as documentkey, `travel-sample` airline
from `travel-sample`
where type = 'airline' limit 1;
{
  "airline": {
    "callsign": "MILE-AIR",
    "country": "United States",
    "iata": "Q5",
    "icao": "MLA",
    "id": 10,
    "name": "40-Mile Air",
    "type": "airline"
  },
  "documentkey": "airline_10"
}

3. Secondary Index

The secondary index is an index on any key-value or document-key. This index can be any key within the document. The key can be of any time: scalar, object, or array. The query has to use the same type of object for the query engine to exploit the index.

CREATE INDEX travel_name ON `travel-sample`(name);

name is a simple scalar value.
{    "name": "Air France"  }

CREATE INDEX travel_geo on `travel-sample`(geo);

geo is an object embedded within the document.  Example:
  "geo": {
      "alt": 12,
      "lat": 50.962097,
      "lon": 1.954764
      }

Creating indexes on keys from nested objects is straightforward.
CREATE INDEX travel_geo on `travel-sample`(geo.alt);
CREATE INDEX travel_geo on `travel-sample`(geo.lat);

schedule is an array of objects with flight details. This indexes on the complete array. Not exactly useful unless you're looking for the whole array.

CREATE INDEX travel_schedule ON `travel-sample`(schedule);

Example:

"schedule": [
      {
          "day": 0,
          "flight": "AF198",
          "utc": "10:13:00"
          },
      {
          "day": 0,
          "flight": "AF547",
          "utc": "19:14:00"
          },
      {
          "day": 0,
          "flight": "AF943",
          "utc": "01:31:00"
          },
      {
          "day": 1,
          "flight": "AF356",
          "utc": "12:40:00"
          },
      {
          "day": 1,
          "flight": "AF480",
          "utc": "08:58:00"
          },
      {
          "day": 1,
          "flight": "AF250",
          "utc": "12:59:00"
          }
  ]

4. Composite Secondary Index

It's common to have queries with multiple filters (predicates). So, you want the indices with multiple keys so the indices can return only the qualified document keys. Additionally, if a query is referencing only the keys in the index, the query engine will simply answer the query from the index scan result without going to the data nodes. This is a commonly exploited performance optimization.

CREATE INDEX idx_stctln ON `travel-sample` (state, city, name.lastname)

Each of the keys can be a simple scalar field, object, or an array. For the index filtering to be exploited, the filters have to use respective object type in the query filter. The keys to the secondary indices can include document keys (meta().id) explicitly if you need to filter on it in the index.

Let's look at the queries that exploit and cannot exploit the index.

1.SELECT * FROM `travel-sample` WHERE state = 'CA';
The predicate matches the leading key of the index. So, this query uses the index to fully evaluate the predicate (state = ‘CA’). 

2.SELECT * FROM `travel-sample` WHERE state = 'CA' AND city = 'Windsor';
The predicates match the leading two keys.  So this is good fit as well.

3.SELECT * FROM `travel-sample` WHERE state = 'CA' AND city = 'Windsor' AND name.lastname = 'smith';

The three predicates in this query matches the three index keys perfectly. So, this is a good match.

4.SELECT * FROM `travel-sample` WHERE city = 'Windsor' AND name.lastname = 'smith';

In this query, although predicates match two of the index keys, the leading key isn’t matched.  So, the index cannot and is not used for this query plans.

5.SELECT * FROM `travel-sample` WHERE name.lastname = 'smith';
Similar to previous query, this query has the predicate on the third key of the index.  So, this index cannot be used.

6.SELECT * FROM `travel-sample` WHERE state = 'CA' AND name.lastname = 'smith';

This query has predicate on first and the third key.  While this index is and can be chosen, we cannot push down the predicate after skipping an index key (second key in this case).   So, only the first predicate (state = "CA") will be pushed down to index scan.
        "#operator": "IndexScan2",
        "index": "idx_stctln",
        "index_id": "dadbb12da565ed28",
        "index_projection": {
          "primary_key": true
        },
        "keyspace": "travel-sample",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"CA\"",
                "inclusion": 3,
                "low": "\"CA\""
              }
            ]
          }

7.SELECT * FROM `travel-sample` WHERE state IS NOT MISSING AND city = 'Windsor' AND name.lastname = 'smith';

This is a modified version of query 4 above.  To use this index, query needs to have additional predicate (state IS NOT MISSING) assuming that represents your application requirement.

5. Functional (Expression) Index

It's common to have names in the database with a mix of upper and lower cases. When you need to search, "John," you want it to search for any combination of "John," "john," etc. Here's how you do it.

CREATE INDEX travel_cxname ON `travel-sample`(LOWER(name));

Provide the search string in lowercase and the index will efficiently search for already lowercased values in the index.

EXPLAIN SELECT * FROM `travel-sample` WHERE LOWER(name) = "john";
{
  "#operator": "IndexScan",
  "index": "travel_cxname",
  "index_id": "2f39d3b7aac6bbfe",
  "keyspace": "travel-sample",
  "namespace": "default",
  "spans": [
  {
      "Range": {
          "High": [
          "\"john\""
          ],
          "Inclusion": 3,
          "Low": [
          "\"john\""
          ]
      }
  }
  ],

You can use complex expressions in this functional index.

CREATE INDEX travel_cx1 ON `travel-sample`(LOWER(name), length*width, round(salary));

You'll also see that an array of indexes can be created on an expression that returns an array in the next section.

6. Array Index

JSON is hierarchical. At the top level, it can have scalar fields, objects, or arrays. Each object can nest other objects and arrays. Each array can have other objects and arrays. And so on. The nesting continues.

When you have this rich structure, here's how you index a particular array or a field within the sub-object.

Consider the array, schedule:

schedule:
[
  {
      "day" : 0,
      "special_flights" :
      [
      {
          "flight" : "AI111", "utc" : ”1:11:11"
          },
      {
          "flight" : "AI222", "utc" : ”2:22:22"
          }  
          ]
      },
  {
      "day": 1,
      "flight": "AF552",
      "utc": "14:41:00”
      }
  ]

CREATE INDEX travel_sched ON `travel-sample`
(ALL DISTINCT ARRAY v.day FOR v IN schedule END)

This index key is an expression on the array to clearly reference only the elements needed to be indexed. schedule is the array we're dereferencing into. v is the variable we've implicitly declared to reference each element/object within the array: schedule v.day refers to the element within each object of the array schedule.

The query below will exploit the array index.

EXPLAIN SELECT * FROM `travel-sample`
WHERE ANY v IN SCHEDULE SATISFIES v.day = 2 END;
{
  "#operator": "DistinctScan",
  "scan": {
      "#operator": "IndexScan",
      "index": "travel_sched",
      "index_id": "db7018bff5f10f17",
      "keyspace": "travel-sample",
      "namespace": "default",
      "spans": [
      {
          "Range": {
              "High": [
              "2"
              ],
              "Inclusion": 3,
              "Low": [
              "2"
              ]
              }
          }
      ],
      "using": "gsi"
  }

Because the key is a generalized expression, you get the flexibility to apply additional logic and processing on the data before indexing. For example, you can create functional indexing on elements of each array. Because you're referencing individual fields of the object or element within the array, the index creation, size, and search are efficient. The index above stores only the distinct values within an array. To store all elements of an array in an index, use the DISTINCT modifier to the expression.

CREATE INDEX travel_sched ON `travel-sample` (ALL DISTINCT ARRAY v.day FOR v IN schedule END)

Array Index can be created on static values (like above) or an expression that returns an array. The below code is one such expression, returning an array of tokens from an object. You can create an index on this array and search using the index.

Couchbase 5.0 makes it simpler to create and match the array indexes. Providing ALL (or ALL DISTINCT) prefix to the key will make it an array key.

CREATE INDEX idx_cx6 ON `travel-sample`(ALL TOKENS(public_likes)) WHERE type = ‘hotel’;

select t.name, t.country, t.public_likes
FROM `travel-sample` t
WHERE t.type = 'hotel’
AND ANY p IN TOKENS(public_likes) SATISFIES p = 'Vallie' END;

Array indexes can be created on elements within arrays of arrays as well. There is no limit to the nested level of the array expression. The query expression does has to match the index expression.

7. Partial Index

So far, the indices we've created will create indices on the whole bucket. Because the Couchbase data model is JSON and JSON schema are flexible, an index may not contain entries to documents with absent index keys. That's expected. Unlike relational systems, where each type of row is in a distinct table, Couchbase buckets can have documents of various types. Typically, customers include a type field to differentiate distinct types.

{
  "airline": {
      "callsign": "MILE-AIR",
      "country": "United States",
      "iata": "Q5",
      "icao": "MLA",
      "id": 10,
      "name": "40-Mile Air",
      "type": "airline"
      },
  "documentkey": "airline_10"
  }

When you want to create an index of airline documents, you can simply add the type field for the WHERE clause of the index.

CREATE INDEX travel_info ON `travel-sample`(name, id, icoo, iata) WHERE type = 'airline';

This will create an index only on the documents that have (type = 'airline'). In your queries, you'd need to include the filter (type = 'airline') in addition to other filters so this index qualifies.

You can use complex predicates in the WHERE clause of the index. Various use cases to exploit partial indexes are:

  1. Partitioning a large index into multiple indices using the mod function.
  2. Partitioning a large index into multiple indices and placing each index into distinct indexer nodes.
  3. Partitioning the index based on a list of values. For example, you can have an index for each state.
  4. Simulating index range partitioning via a range filter in the WHERE clause. One thing to remember is Couchbase N1QL queries will use one partitioned index per query block. Use UNION ALL to have a query exploit multiple partitioned indices in a single query.

8. Adaptive Index

Adaptive index creates a single index on the whole document or set of fields in a document. This is a form or array index using {"key":value} pair as the single index key. The purpose is to avoid the bane of the query having to match the leading keys of the index in traditional indexes.

There are two advantages with Adaptive index:

  • Multiple predicates on the keyspace can be evaluated using different sections of the same index.
  • Avoid creating multiple indexes just to reorder the index keys.
  • Avoid the index key-order.

Example:

 CREATE INDEX `ai_self`
   ON `travel-sample`(DISTINCT PAIRS(ai_self))
   WHERE type = "airport";

   EXPLAIN SELECT * FROM `travel-sample`
   WHERE faa = "SFO" AND `type` = "airport";


     {
       "#operator": "IntersectScan",
       "scans": [
         {
           "#operator": "IndexScan2",
           "index": "ai_self",
           "index_id": "c564a55225d9244c",
           "index_projection": {
             "primary_key": true
           },
           "keyspace": "travel-sample",
           "namespace": "default",
           "spans": [
             {
               "exact": true,
               "range": [
                 {
                   "high": "[\"faa\", \"SFO\"]",
                   "inclusion": 3,
                   "low": "[\"faa\", \"SFO\"]"
                 }
               ]
             }
           ],
           "using": "gsi"
         }
...

The same index can be used for queries with other predicates as well. This reduces the number of indexes you'd need to create as the document grows.

EXPLAIN SELECT * FROM `travel-sample`  WHERE city = "Seattle" AND `type` = "airport";

Considerations for usage:

  1. Since each attribute field has an index entry, the size of the indexes can be huge.
  2. An adaptive index is an array index. It's bound by the restriction of the array indexes.

Please see the detailed documentation on adaptive index in Couchbase's documentation.

9. Duplicate Index

This isn't really a special type of index, but a feature of Couchbase indexing. You can create duplicate indexes with distinct names.

CREATE INDEX i1 ON `travel-sample`(LOWER(name),id, icoo) WHERE type = ‘airline’;

CREATE INDEX i2 ON `travel-sample`(LOWER(name),id, icoo) WHERE type = ‘airline’;

CREATE INDEX i3 ON `travel-sample`(LOWER(name),id, icoo) WHERE type = ‘airline’;

All three indices have identical keys, an identical WHERE clause; the only difference is the name of the indices. You can choose their physical location using the WITH clause of the CREATE INDEX. During query optimization, the query will choose one of the names. You see that in your plan. During query runtime, these indices are used in a round-robin fashion to distribute the load. This gives you scale-out, multi-dimensional scaling, performance, and high availability. Not bad!

Couchbase 5.0 makes the duplicate index SIMPLER. Instead of creating multiple indexes with distinct names, you can simply specify the number of replica indexes you require.

CREATE INDEX i1 ON `travel-sample`(LOWER(name),id, icoo) WHERE type = ‘airline’ WITH {"num_replica" : 2 };

This will create 2 additional copies of the index in addition to the index i1. Load balancing and HA features are the same as an equivalent index.

10. Covering Index

Index selection for a query solely depends on the filters in the WHERE clause of your query. After the index selection is made, the engine analyzes the query to see if it can be answered using only the data in the index. If it does, the query engine skips retrieving the whole document. This is a performance optimization to consider while designing the indices.

All Together Now!

Let's put together a partitioned composite functional array index now!

CREATE INDEX travel_all ON `travel-sample`(
iata,
LOWER(name),
UPPER(callsign),
ALL DISTINCT ARRAY p.model FOR p IN jets END),
TO_NUMBER(rating),
meta().id
)
WHERE LOWER(country) = "united states" AND type = "airline"
WITH {"num_replica" : 2}

References

Tune in tomorrow when we'll cover rules for creating indexes!

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.

Topics:
couchbase ,database performance ,performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}