Monitoring a Couchbase Cluster
Originally written by Justin Michaels
Couchbase is a distributed, high performance, Cache and NoSQL database in a single cluster architecture. Despite a similar name and shared heritage, Couchbase is a very different product than CouchDB or any other NoSQL offering. Being able to monitor and profile Couchbase performance alongside application metrics is critical. Over time, monitoring is the element for a successful deployment of any mission critical system. This is true in general and even more important in distributed computing environments in particular. It’s the only way to ensure long-term success. For this discussion, we want to focus on monitoring but it’s important to note we do provide verbose Couchbase logging to facilitate application troubleshooting. These logs are stored in ‘/opt/couchbase/var/lib/couchbase/logs’. For more insight into these logs please review the Couchbase flight recorder (https://blog.couchbase.com/couchbase-server-recorder).
Couchbase provides an information rich administration console and everything captured in our console is also available through command line utilities and our REST interface. As a result, we not only provide the needed statistics but facilitate third-party tools to monitor mission critical production clusters. While our cluster stores historical information we aggregate the data over time to save space and third-party tools can poll on regular intervals in order to preserve historical information.
Monitoring Best Practices
So let’s get get in the weeds …
To monitor Couchbase efficiently we need two different perspectives. The cluster as a whole is made up of individual nodes or servers. Each node provides our compute capacity for any application that leverages the database distributed across the cluster. As a result, we want to monitor resource consumption and available compute capacity per node. Within any specific application we also want to know how many requests are being served via the cache and be sure our disk is capable of persisting data quickly to disk. As we walk through the metrics below we want to keep in mind our cluster behavior will depend on what our application demands. Our goal is to manage an in-memory working set and catch early warning events that indicate additional nodes are needed in the cluster.
Leveraging our REST API (http://<ip>:8091/pools/default) gives us insight into compute resources consumed per node.
- Node State - (clusterMembership) As part of the JSON returned by the nodeStatuses end point the state of the node is returned via clusterMembership. This provides a facility on a per node basis to monitor for ‘active’ to guarantee the nodes are participating in the cluster. Critical events would be defined by “inactiveFailed” showing the node has failed and administrator intervention is needed.
- System Statistics (systemStats) - Additional information is available in the end point by providing basic capacity consumption statistics for CPU (cpu_utilization), SWAP (swap_used) and free memory (mem_free). For each node in the cluster if any of these resources show constraints we will want to address the individual node and evaluate if additional nodes are merited.
- Couchbase Specific (interestingStats) - The final section provides additional insight into resources consumed by the individual node. The individual node disk consumed by Couchbase is defined by couch_docs_actual_disk_size, physical memory used within the node is defined memory used per node and the number of background fetches ep_bg_fetched (data not in cache and pulled from disk) is also measured.
Leveraging our REST API (http://<ip>:8091/pools/default/buckets/<bucket_name>/stats) provides insight to the health of the bucket itself. A given bucket might be constrained even though each individual node has additional compute capacity to offer. As a result, keeping a pulse on bucket statistics will provide the most insight into application health.
- Operations Per Second (ops) - This is a fundamental measure of the total number of gets, sets, increment, and decrements that occur for a given bucket. While views would not be factored in to the metric it does provide a very quick measure of the load per application. In short, all resources may be allocated to the cluster but this would provide insight to a given application producing load.
- Cache Miss (ep_cache_miss_rate) - This is a metric is a good example of what might or might not be problematic. Fundamentally the metric counts the ratio of requested objects found in the cache in relation to what is needed to be fetched from disk. For example, if ten requests entered the database and one request needed to be retrieved from disk our miss rate would be 10%. The real question … is this a problem? This depends on what we expect to hold in memory with the best performance coming from a cluster that holds this number as close to 0 as possible.
- Fragmentation (couch_docs_fragmentation) - Couchbase stores in an append only format on disk; as a result, we need to keep an eye on fragmentation occurring within a cluster. This is particularly important to measure should auto-compaction be set on a scheduled basis. This would provide insight if the schedule is running long enough and on a frequent enough basis to keep your database healthy.
- Working Set (ep_bg_fetched and vb_active_resident_items_ratio) - You can use the ep_cache_miss_ratio mentioned above in conjunction with the resident items ratio and memory headroom metrics to understand if your bucket has enough capacity to store the most requested objects in memory. More importantly, you can forecast the need for additional nodes to expand memory capacity to the cluster.
- Disk Drain (ep_queue_size) - One of the most important metrics to monitor regardless of what you’re application is doing is the drain rate. Keep careful watch on the amount of changes pending in the queue. Additional information can be found in the command line utility below. From a REST standpoint we can monitor both the queue fill (ep_diskqueue_fill) and how quickly the queue is draining (ep_diskqueue_drain) to track the trend over time.
A whole volume of additional information can be monitored within the REST interface. We won’t cover everything available in the API here but mainly focus on key statistics to keep a healthily cluster. In addition to REST we can also run scripted monitoring via couchbase-cli and cbstats utilities to capture operations occurring within each node of the cluster.
CBSTATS can be found in ‘/opt/couchbase/bin’ and provides insight to a host of what’s occurring within a cluster. Below are some of the key metrics and what they’re telling you:
- By tracking the number of open and rejected connections via curr_connections and rejected_conns statistics we can understand if any connection requests were rejected by this node.
- Each time an object is requested by an application and not found in the cache, Couchbase will find the object on disk. This cache miss requires a background fetch and is a measured per item fetched from disk by ep_bg_fetched. If we’re managing to a 100% working set this could be a sign of a cluster under stress; alternatively, this may not be an issue if we have a smaller working set. In either scenario, gaining an understanding of what’s typical in an environment is important as a large increase will provide an early warning signal.
- The number of items queued for persistence is an important area to monitor to understand if you have adequate i/o resources to keep up with your application. While your application will always be served via our caching tier one of the great benefits of Couchbase is our ability to also provide data durability by persisting to disk. Should this asynchronous operation become overloaded we could impact application performance. As a result, especially in heavy write systems, ep_queue_size and ep_flusher_todo will be important to keep an eye on. We never want to get to 1 million items and will likely want to flag a warning around 500000 to 800000, especially if this is an upward trend over time.
- Following the vb_num_eject_replicas statistic gives us number of times Couchbase ejected replica values from memory. This indicates a specific bucket has reached its low water mark. While simply reach this threshold might not be problematic as the cluster is simply freeing memory resources, consistently seeing this behavior or an increasing trend could be. More importantly this is a way to head off out of memory (ep_oom_errors/ep_tmp_oom_errors) scenarios which we never want to see in our production clusters.
- Couchbase by design avoids stale cache scenarios by performing a
‘warmup’ process at startup of a node. Warmup is the process of reading
objects from disk and pre-loading the cache. Monitoring warmup provides
visibility into how quickly a Couchbase node will complete its startup
process and be available to support load within the cluster. While
warmup is complete when ep_warmup_value_count is equal to
vb_active_curr_items; however, getting more granular information can be
provided by ep_warmup_state. Below are the seven warmup states. A node
will not complete until it’s in a ‘done’ status.
- Initial - Start warmup processes.
- EstimateDatabaseItemCount - Estimating database item count.
- KeyDump - Begin loading keys and metadata, but not documents, into RAM.
- CheckForAccessLog - Determine if an access log is available. This log indicates which keys have been frequently read or written.
- LoadingAccessLog - Load information from access log.
- LoadingData - The server is loading data first for keys listed in the access log, or if no log available, based on keys found during the ‘Key Dump’ phase.
- Done - The server is ready to handle read and write requests.
- Because Couchbase is not only a NoSQL persistence engine but also a cache we want to understand memory consumption of the Couchbase server ep_engine. This can be monitored by mem_used.
- It’s worth noteing the cbstats statistics covered above are also available via our REST interface (http://<ip>:8091/pools/default/buckets/<bucket name>/stats).
- ops, ep_cache_miss_rate, couch_docs_fragmentation, ep_queue_size, vb_active_resident_items_ratio, curr_connections, curr_items_tot, ep_bg_fetched, ep_diskqueue_drain, ep_diskqueue_fill, vb_replica_eject, ep_oom_errors, ep_queue_size, ep_tmp_oom_errors, mem_used, etc
The intent here is to provide some guidance on where to start in developing a Couchbase monitoring strategy. This is a summary of basic best practices we’ve seen customers implement. Each customer is unique and additional metrics might be needed based on your application.