Redis is an in-memory database that provides blazingly fast performance. This makes it a compelling alternative to disk-based databases when performance is a concern. You might already be using Redis to power your performance-sensitive applications. How do you ensure that your Redis deployment is healthy and meeting your requirements?
You will need to know which Redis monitoring metrics to watch and a tool to monitor these critical server metrics to ensure its health. Redis returns a big list of database metrics when you run the info command on Redis shell. You can pick a smart selection of relevant metrics from these. And these can help you ensure your system’s health and to quickly perform root cause analysis of any performance-related issue you might be encountering.
This blog post lists the important database metrics to monitor. We will look at each metric from a database performance perspective and discuss the common issues and solutions associated with them.
1. Performance Metric: Throughput
Throughput tells you how many database operations your server is performing in a particular time duration. It is dependent upon your application workload and its business logic. By looking at the history of throughput, you can infer the pattern of load on a server e.g. peak load, the frequency of peak load, the time frames of peak load, average load etc.
You can collect throughput metric values for all the commands run on the Redis server by executing
127.0 .0 .1: 6379 > info commandstats# Commandstats cmdstat_get: calls = 797, usec = 4041, usec_per_call = 5.07 cmdstat_append: calls = 797, usec = 4480, usec_per_call = 5.62 cmdstat_expire: calls = 797, usec = 5471, usec_per_call = 6.86 cmdstat_auth: calls = 147, usec = 288, usec_per_call = 1.96 cmdstat_info: calls = 46, usec = 902, usec_per_call = 19.61 cmdstat_config: calls = 2, usec = 130, usec_per_call = 65.00 cmdstat_eval: calls = 796, usec = 36950, usec_per_call = 46.42 cmdstat_command: calls = 796, usec = 8578, usec_per_call = 10.78
Redis groups its various commands into connection, server, cluster, generic, etc. ScaleGrid Redis monitoring aggregates the throughput of various commands into one of the above-mentioned groups. The throughput is represented as a stacked area graph, where the height of each colored area provides the throughput of a group of commands.
A reduced throughput could generally indicate that the server gets fewer queries. It could also indicate a potential issue, say, an expensive query. Similarly, an increased throughput signifies intensive workload on a server and a larger latency.
2. Memory Utilization
Memory is a critical resource for Redis performance. Used memory defines the total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc).
You can collect all memory utilization metrics data for a Redis instance by running
127.0 .0 .1: 6379 > info memory# Memory used_memory: 1007280 used_memory_human: 983.67 K used_memory_rss: 2002944 used_memory_rss_human: 1.91 M used_memory_peak: 1008128 used_memory_peak_human: 984.50 K
Sometimes, when Redis is configured with no max memory limit, memory usage will eventually reach system memory, and the server will start throwing “Out of Memory” errors. At other times, Redis is configured with a max memory limit but noeviction policy. This would cause the server not to evict any keys, thus preventing any writes until memory is freed. The solution to such problems would be configuring Redis with max memory and some eviction policy. In this case, the server starts evicting keys using eviction policy as memory usage reaches the max.
Memory RSS (Resident Set Size) is the number of bytes that the operating system has allocated to Redis. If the ratio of
memory_used’ is greater than ~1.5, then it signifies memory fragmentation. The fragmented memory can be recovered by restarting the server.
3. Cache Hit Ratio
The cache hit ratio represents the efficiency of cache usage. Mathematically, it is defined as (Total key hits)/ (Total keys hits + Total key misses).
info stats command provides
keyspace_misses metric data to further calculate cache hit ratio for a running Redis instance.
127.0.0.1:6379> info stats # Stats ............. sync_partial_err:0 expired_keys:10 evicted_keys:12 keyspace_hits:4 keyspace_misses:15 pubsub_channels:0 pubsub_patterns:0 .............
If the cache hit ratio is lower than ~0.8 then a significant amount of the requested keys are evicted, expired, or do not exist at all. It is crucial to watch this metric while using Redis as a cache. Lower cache hit ratio results in larger latency as most of the requests are fetching data from the disk. It indicates that you need to increase the size of Redis cache to improve your application’s performance.
4. Active Connections
The number of connections is a limited resource which is either enforced by the operating system or by the Redis configuration. Monitoring the active connections helps you to ensure that you have sufficient connections to serve all your requests at peak time.
5. Evicted and Expired Keys
Redis supports various eviction policies that are used by the server when memory usage hits the max limit. A persistent positive value of this metric is an indication that you need to scale the memory up.
127.0 .0 .1: 6379 > info stats# Stats .............. sync_partial_err: 0 expired_keys: 0 evicted_keys: 0 keyspace_hits: 0 keyspace_misses: 0 pubsub_channels: 0 pubsub_patterns: 0 ..............
Redis supports TTL (time to live) property for each key. The server deletes the key if the associated TTL has elapsed. If the application does not define this property, it causes expired data to pile up in memory. A positive metric value shows that your expired data is being cleaned up properly.
6. Replication Metrics
connected_slaves metric informs the reachability of the slave server to a master. Slave unreachability could lead to higher read latency depending on the read load on a server.
127.0.0.1:6379> info replication# Replication role:master/slave connected_slaves:0/master_slave_io_seconds_ago:0 master_repl_offset:0 ..............
master_slave_io_seconds_ago metric tells how much time elapses during the communication between a slave and the master. A higher value for this metric can be indicative of issues on the master/slave or some network problems. It further causes the slave to serve stale data.
We have mentioned some of the important metrics that will provide good visibility into the health and performance of your database server. There could be others that are relevant to your particular database servers and use cases. We would recommend going through and understanding all the metrics reported by “info” command.