One of the things that often confuses users who are familiar with other databases when they try Redis is the lack of visibility into the database: There are no sets of tables or collections to see, just a plain, flat key space that could (potentially) have millions of keys. The ability to iterate cheaply over this key space makes it very important to familiarize oneself with the database contents.
Iterating over the Redis key space has other important use cases, too. A few that come to mind are:
- garbage collection or cleaning up keys that match a certain pattern
- data movement and schema changes or moving a certain set of keys to another data structure
- debugging, data sampling, data fixes or finding and fixing all keys which were messed up by a recent change
In this post, we will dig deep into various key space iteration options available in Redis.
O(N) Iterators: KEYS
The Redis KEYS command returns all the keys in the database that match a pattern (or all the keys in the key space). Similar commands for fetching all the fields stored in a hash is HGETALL and for all fetching the members of a SMEMBERS. The keys in Redis are stored in a dictionary (AKA a hash table). The KEYS command works by iterating over this dictionary and sending everything that matches the pattern out as a single array reply. The other commands work similarly.
The performance of such an operation depends on the size of the collection, i.e. O(N). Thus, the use of KEYS is severely discouraged in production environments with a large number of keys. Redis, being single-threaded, gets blocked during this iteration, thus blocking other operations.
Thus, KEYS should be used only for debugging and other special occasions where performance is not a concern (like when the database has been brought offline to apply a data fix). The other important thing to remember about this algorithm is that it sends out all the matching keys together as a single response. This might be extremely convenient when the key space is small but will create multiple issues on a large key space. KEYS, however, is a favorite command among developers in their own dev environments.
KEYS in Action:
127.0.0.1:6379> MSET one 1 two 2 three 3 four 4OK# All the keys127.0.0.1:6379> keys *1) "four"2) "three"3) "two"4) "one"# keys that begin with the letter 't'127.0.0.1:6379> keys t*1) "three"2) "two"# keys that have a 'ee' in them127.0.0.1:6379> keys *ee*1) "three"
Cursor-Based Iterators: SCAN
The SCAN and its sister commands, SSCAN (for sets), HSCAN (for hashes), and ZSCAN (for sorted sets) provide the cursor-based approach to iterating over Redis data structures. They have been available in Redis since 2.8.0.
Keys are returned in incremental iterations with constant time guarantees for each iteration. A cursor (an integer is this case) is returned when the iteration is initialized, and an updated cursor is returned for every iteration. An iteration cycle begins when the cursor is set to 0 in the SCAN request and terminates when the cursor returned by the server is 0. Due to the nuances of Redis' architecture and the cursor algorithm implementation, here are some peculiarities of this approach:
- A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration.
- A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration.
- A given element may be returned multiple times. It is up to the application to handle the case of duplicated elements
- Elements that were not constantly present in the collection during a full iteration, may be returned or not: it is undefined.
- A number of elements returned during each count varies and can be 0 too. However, the iteration is not complete until the server returns the cursor value of 0.
- The COUNT option can be used to limit the number of elements returned in each iteration. The default value is 10. However, it is considered only a suggestion and not enforced in all cases. The COUNT value can be changed during each iteration call.
- The MATCH option allows specifying of patterns, like the KEYS command allows.
- The cursor implementation is completely stateless on the server side. That allows (potentially) infinite iterations to start in parallel. Also, there are no requirements of ensuring that an iteration continues up to the end and can be stopped anytime.
Despite its peculiarities, SCAN is a very useful command and the right command to pick for key space iterations for most use cases.
SCAN in Action
127.0.0.1:6379> flushdbOK127.0.0.1:6379> keys *(empty list or set)127.0.0.1:6379> debug populate 33OK127.0.0.1:6379> scan 0 COUNT 51) "4"2) 1) "key:1" 2) "key:9" 3) "key:13" 4) "key:29" 5) "key:23"127.0.0.1:6379> scan 4 1) "42"2) 1) "key:24" 2) "key:28" 3) "key:18" 4) "key:16" 5) "key:12" 6) "key:2" 7) "key:6" 8) "key:31" 9) "key:27" 10) "key:19"127.0.0.1:6379> scan 421) "9"2) 1) "key:3" 2) "key:4" 3) "key:20" 4) "key:8" 5) "key:32" 6) "key:5" 7) "key:26" 8) "key:10" 9) "key:21" 10) "key:14"127.0.0.1:6379> scan 9 COUNT 1001) "0"2) 1) "key:25" 2) "key:30" 3) "key:22" 4) "key:17" 5) "key:15" 6) "key:0" 7) "key:11" 8) "key:7"
Under the Hood
The algorithm that SCAN (and its sister commands) use to scan through is an intriguing one and leads to some of the characteristics of the command that we described above. Antirez described it at a high level in his blog post, and it is explained (slightly better) in the comments above the implementation (function dictScan). Describing it in detail would make this post too long so I will give enough description to make it’s implications evident.
- Most Redis data structures are internally represented as dictionaries (at least partially in the case of sorted sets). They are implemented as power-of-two sized hash tables with chaining for collisions. The challenge in writing a cursor based iterative algorithm here is to be able to deal with growing and shrinking of the hash without sacrificing the Redis principles of simplicity (of the API) and speed.
- SCAN basically scans a bunch of hash buckets every iteration and returns the elements matching the pattern in those. Since it looks at only a fixed list of buckets some time iterations might return no values at all.
- The cursor that is returned is basically an offset into the hash table being iterated. It deals with the growing and shrinking of hash tables (i.e. rehashing) by clever manipulation of higher level bits of the offset while increasing the offset along with the properties of the hash table. Implications of this approach are that new elements added during the iteration may or may not be returned. However, the cursor itself, wouldn’t need to restart on a change in the size of the hash table.
- A given bucket must be visited just once and all its keys must be returned in a single go. This is again to ensure that hash resizing (i.e. rehashing) doesn’t complicate iteration progress. However, this leads to the COUNT argument being not strictly enforceable.
- Since the above approach is completely stateless on the server side, it basically implies that iterations can be stopped or a huge number of iterations can be started in parallel without no increased memory usage.
At a high level, two choices are available to iterate over the Redis key space are:
- Use KEYS when performance is not a concern or when the key space is reasonably sized.
- At all other times, use SCAN.