Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Indexing Best Practices

DZone's Guide to

Indexing Best Practices

Look at fourteen best practices four creating and managing indexes in order to deliver the best possible customer experiences.

· Database Zone
Free Resource

Traditional relational databases weren’t designed for today’s customers. Learn about the world’s first NoSQL Engagement Database purpose-built for the new era of customer experience.

Global secondary indexes in Couchbase can be created, updated, and deleted without impacting the reads and writes on the JSON documents in data nodes. This means that index inserts/updates/deletes happen asynchronously, and index workloads are isolated from the rest of the system.

Indexes are directly related to the N1QL queries that are run. N1QL and GSI tango together. Indexes are to be created with N1QL queries in mind, as indexes are meant to be steroids for queries, to reduce the query latency costs, and to increase throughput. Indexes also demand their own storage, but the risk of losing business due to slow/shoddy customer engagement experience is higher than the associated cost of having indexes. Also, indexes in most cases live outside the confines of an application, which helps in managing the lifecycle of indexes suitably.

Let’s look at some best practices in creating and managing indexes for delivering the best customer experiences.

1. Run Index Service on Its Own Set of Nodes

Though all services (data, query, index, search, etc.) in Couchbase can be run on all nodes, we recommend that individual workloads run on their own set of nodes. This affords workload isolation and independent scaling. Also, hardware can be apportioned suitably based on the nature of the workload. For example, indexes are generally memory-intensive and queries are CPU-intensive. You can have different hardware for these different services. Independent scalability for best computational capacity per service is achieved by the architectural nicety that multi-dimensional-scaling (MDS) offers us.

When all services are run on all available nodes

When individual Services are run on their own dedicated nodes

2. MOI vs. Standard GSI

Couchbase 5.0 introduced Plasma as a new storage engine for GSI. Plasma is the underlying storage engine when the standard global secondary is chosen at the time of setup. Both storage types have different characteristics. When the use-case demands that entire index be memory-resident (stricter SLAs, lower latencies, higher throughputs), then choose MOI. Standard GSI (Plasma) is extremely useful when the entire index cannot be memory-resident; we call this a data-greater-than-memory (DGM) scenario. Also, this is useful when memory costs are a factor in deciding the index type. While MOI might get into a "pause" mode when memory is completely used up (i.e. index updates stop, though queries will be serviced), Plasma suitably spills over to disk and operates (and indexes do get updated) with ease. In Couchbase 5.0, Plasma works well up to 20% DGM scenarios (i.e. 20% of the index data is in-memory). If the query accesses keys across both memory and disk, then there is a suitable impact to query performance due to the obvious disk access while querying.

Due to the nature of being completely memory resident, MOI is generally much faster than Standard GSI (especially the former ForestDB). As of now, it is not possible to have both index types reside in the same cluster.

The following schematic explains the different index storage engines available and their high-level features:

3. Use Index Replicas

Replicas in GSI are active replicas; i.e. they serve the twin purpose of load balancing N1QL queries and also take in the traffic if the other index replica fails.

create index idx on bucket(field1) with {“num_replica”: 2}

Or:

create index idx on bucket(field1) with {“nodes”:[“1.2.3.1:8091”, “1.2.3.2:8091”, “1.2.3.3:8091”]}

Both copies of the index are automatically and asynchronously updated as the updates to documents happen on the data nodes. Always have at least one replica so that there can be a minimum of two index nodes for servicing N1QL queries. Due to the support of swap rebalance in 5.0, if an index node goes down and a new node is added back, then the topology of index nodes is maintained. This is extremely helpful for scale-up/down operations due to seasonalities in queries when you want to move between larger nodes and smaller nodes.

If you're already using equivalent indexes, then transition to index replicas. Read more about this process here.

4. Indexes Variants

GSI has different variants based on different use cases. These different variants were purposely built for the nature of the queries and hence it is extremely important to understand the behavior of the queries and leverage these index variants.

Primary Index Functional Index
Named Primary Index Array Index
Composite Index Covered Index
Partial Index Adaptive Index

Do check out this DZone article and this documentation for more on the above.

For example, let’s look at the covered index. This index variant contains predicates and all attributes that were indexed in the definition, because of which additional hops to data nodes are avoided. Query latencies are also significantly lowered.

For example, if we have the following index:

CREATE INDEX `idx_ts_type_iata` ON `travel-sample`(`type`,`iata`);

And use the query:

select iata from `travel-sample` where type="airline" and iata = "TQ"

The EXPLAIN plan will reveal that the query is being "covered" by the index:

"~children": [      
  {        
    "#operator": "IndexScan2",       
      "covers": [          
      "cover ((`travel-sample`.`type`))",          
      "cover ((`travel-sample`.`iata`))",          
      "cover ((meta(`travel-sample`).`id`))"        
  ],        
  "index": "idx_ts_type_iata",

And if we try to select "all" (by using select *) attributes:

<span style="font-weight: 400">select * from `travel-sample` where type="airline" and iata = "TQ";</span>

The EXPLAIN plan reveals that the query is not covered (missing the covers field), as the query service has to hop to the Data Service to fetch all the attributes:

"~children": [      
  {        
    "#operator": "IndexScan2",        
    "index": "idx_ts_type_iata",        
    "index_id": "240cf64d8c6ddce3",        
    "index_projection": {
          "primary_key": true        
     },

Similarly, array indexes were created mainly to help our customers query JSON data, wherein arrays are very common. A detailed post on this is coming soon!

5. Avoid Primary Keys in Production

Unexpected full primary scans are a possibility and should be removed by avoiding primary indexes altogether in production. N1QL Index Selection is a rule-based system that checks for a possible index that will satisfy the query, and if there is no such thing, then it resorts to using the primary index. The primary index has all the keys of the documents, and hence the query will fetch all keys from the primary index and then hop to the Data Service to fetch the documents and then apply filters. As you can see, this is a very expensive operation and should be avoided at all costs.

If there are no primary indexes created and the query is not able to find a matching index to serve the query, then the query service errors with the following message. This is helpful and should help you in creating the required secondary index suitably:

“No index available on keyspace travel-sample that matches your query. Use CREATE INDEX or CREATE PRIMARY INDEX to create an index, or check that your expected index is online.”

Also, as an inherent best practice, partitioning of primary indexes is not supported by Couchbase. Unlike in many RDBMSes, primary keys are optional in Couchbase.

6. Use EXPLAIN PLANs

To validate if the N1QL query is indeed using the indexes created, check the EXPLAIN PLAN results. In the Couchbase Admin Console, this can easily be obtained by pasting the query in the code editor and clicking the Explain button. Look out for the #operator and index attributes of the result to confirm the index usage. 

7. Index by Predicate

The WHERE clause in a query is called the predicate and the fields/attributes selected in the SELECT clause are called the projection. Indexes should always be created with the predicate clause in mind. This is because index selection happens based on the leading key of the index present in the predicate.

For example, if we have the following index on four attributes:

CREATE INDEX `idx_ts_type_iata_name_icao` ON `travel-sample`(`type`,`iata`, `name`,`icao`);

And fire the following query that actually skips the icao attribute while querying, the query engine is smart enough to know to use the above index for best query performance.

select name from `travel-sample` where icao="MLA" and type="airline";

The index selected can be seen in the EXPLAIN plan below. It is to be noted that the query becomes a covering query, as name, despite not being in predicate, is in projection, and hence the hop to Data Service is avoided.

"~children": [      
   {        
      "#operator": "IndexScan2",        
      "covers": [          
        "cover ((`travel-sample`.`type`))",          
        "cover ((`travel-sample`.`iata`))",          
        "cover ((`travel-sample`.`name`))",          
        "cover ((`travel-sample`.`icao`))",          
        "cover ((meta(`travel-sample`).`id`))"       
      ],        
      "index": "idx_ts_type_iata_name_icao",

8. Use a Leading Key to Force Index Selection

An index is not automatically selected for a query if the predicate used in the query does not match the leading key of the index. If you happen to see that the EXPLAIN plan does not force any index to be selected then use IS NOT MISSING or IS NOT NULL clause to force the index to be selected.

For example, either of the following queries:

select count(1) from `travel-sample` where type IS NOT NULL; 
select count(1) from `travel-sample` where type IS NOT MISSING;

Will use the following index, as the leading key of the index is type:

CREATE INDEX `idx_ts_type_iata` ON `travel-sample`(`type`,`iata`);

To choose only an already created INDEX, use the USE INDEX directive as part of the N1QL query. This is helpful in cases wherein you know that the index mentioned in USE INDEX has better selectivity than the one chosen by the N1QL rule-based optimizer:

select count(1) from `travel-sample` USE INDEX (idx_ts_type_iata) where type="airline";

9. Use Partial Indexes

Sometimes, the predicate to be indexed on might not fit into one node due to size limitations. GSIs in Couchbase are not automatically partitioned for now. This mandates that the administrator creates partial indexes; N1QL queries are smart enough to choose the appropriate index based on the kind of predicate used in the query when partial indexes are present.

For example, we create the following two indexes based on the name being in two different ranges:

CREATE INDEX `idx_ts_name_ak` ON `travel-sample`(`name`) WHERE name BETWEEN "A" AND "K"; 
CREATE INDEX `idx_ts_name_kz` ON `travel-sample`(`name`) WHERE name BETWEEN "K" AND "Z";

Now, following queries will automatically choose the suitable index as is evidenced in the respective EXPLAIN PLANS:

select * from `travel-sample` where name="Astraeus"; 

EXPLAIN PLAN : 
"~children": [      
   {        
     "#operator": "IndexScan2",        
     "index": "idx_ts_name_ak",
select * from `travel-sample` where name="Texas Wings"; 

EXPLAIN PLAN : 
"~children": [      
  {        
     "#operator": "IndexScan2",        
     "index": "idx_ts_name_kz",

Indexes are suitably chosen when using the LIKE clause in the predicate. For example, let’s say we want to get all names that sound like a French name (starting with “L”):

select * from `travel-sample` where name like "L'%";
EXPLAIN PLAN : 
"~children": [      
  {        
     "#operator": "IndexScan2",        
     "index": "idx_ts_name_kz",

10. Consistency Options

Due to its asynchronous nature, GSIs in Couchbase are eventually consistent by default. And, as already mentioned, they are asynchronously updated. However, using change feeds (DCP), we update indexes as quickly possible, so it is very much possible that certain document mutations haven’t been updated in the indexes. If the query semantics demand stricter data consistency, Couchbase offers tunable consistency models at the time of querying.

The three consistency options available in Couchbase are:

  1. scan_consistency=not_bounded
  2. scan_consistency=at_plus
  3. scan_consistency=request_plus

For more, see here.

Though request_plus semantics enforce data integrity, there is an impact on performance as query latencies increase. A query waits for the relevant index to catch up with the latest mutations before the data is returned. not_bounded (the default consistency option) is the fastest of all three consistency options.

11. Monitor Index Catchup

Generally, the index service catches up with document mutations very quickly so as to leave little to no user impact. But as an administrator, if you want to make sure that the document mutations (to be updated in the index) are as minimal as possible and does not keep increasing, then look at the ItemsRemaining metric under the index name.

12. Use Defer Builds

Defer builds offers a two-stage process of creating indexes. It is recommended that defer builds be put to optimum use always, as the same change feed is used to create indexes on a node. If defer builds are not used, the change feed from the data nodes has to be accessed multiple times, leading to more data transfer across the network and a slightly increased load on the data nodes.

Example:

CREATE INDEX `idx_ts_type_iata` ON `travel-sample`(`type`,`iata`) WITH { "defer_build":true }; 
BUILD INDEX ON `travel-sample`(`idx_ts_type_iata`);

For more on CREATE INDEX syntax, please refer to the documentation.

13. Avoid Large Keys to Index

Prior to 5.0, there was a limitation on the key size for indexes(maximum 4k). This limitation has been removed in 5.0. It should be noted that indexes are meant for data access path and hence the data model and query(with indexes) should be structured to get the necessary information in shortest time. Though customers can have any number of fields in a composite index, the index key size also grows proportionally. Really large key size might impact performance. As a general rule of thumb, prefer to have 1kB as the combined size of all fields in a composite index; and if that is not possible, refactor the queries suitably.

14. USE KEYs Avoid Indexes

It is not necessary that all N1QL queries require indexes. If your N1QL queries can work independently of indexes by directly querying the documents using keys, then the directive USE KEYS is helpful.

For example:

SELECT * FROM `travel-sample` USE KEYS ["landmark_37588"];

The resulting explain plan will show a KeyScan being performed (without any mention of an IndexScan):

"~children": [
      {
        "#operator": "KeyScan",        
        "keys": "[\"landmark_37588\"]"      
      }

This is more of a good thing to know than a best practice, as USE KEYS does not use indexes for the results to be returned from the query service. Though it's highly unlikely that customers can have only queries that always use USE KEYS, this is helpful in the edge cases that mandate such behavior.

This was a long post, but I hope it was helpful!

Learn how the world’s first NoSQL Engagement Database delivers unparalleled performance at any scale for customer experience innovation that never ends.

Topics:
database ,indexing ,tutorial ,asynchronous ,couchbase

Published at DZone with permission of Venkat Subramanian, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}