Lessons Learned from Picking a Java-Based Driver for Amazon ElastiCache for Redis: Part 1

In this article, readers will learn about lessons I learned from picking a Java-based driver for Amazon ElastiCache for Redis with visuals and helpful code.

Jeroen Reijn

CORE ·

Feb. 10, 23 · Opinion

Likes (2)

Comment

Save

6.7K Views

In my day-to-day job, I support teams at different organizations and help them with their AWS challenges. One of the teams I recently supported was using Amazon ElasticCache for Redis as a storage/caching layer for their primary workload. They were validating their production setup and testing several failure scenarios. In this article, I will share some of the lessons learned. Keep in mind that the cases described in this article are very context-specific and might not reflect your use case, so my advice is to always do your own tests.

Initial Architecture

The team built a REST-based service by using API Gateway, AWS Lambda, and Amazon ElastiCache for Redis.

Amazon ElastiCache for Redis was set up with cluster mode enabled, as this gave the team the flexibility to scale both read and write actions. The initial cluster existed of just a single shard, with the option to scale up/out if needed.

When the Redis client writes to and reads from the cluster, it will first check which shard the data belongs to and sends the read or write command to the node in that particular shard.

Now, when you set up ElastiCache Redis with “Cluster Mode” enabled in AWS, the recommended way to connect with the cluster from your client is by connecting via the cluster configuration endpoint. The cluster configuration endpoint keeps track of all the changes in the Redis cluster and can give you an up-to-date topology.

The AWS Lambda function had been written in Java and, therefore, the team looked at different Java-based Redis drivers and eventually decided to use Jedis, which is a Java client for Redis designed for performance and ease of use. One of the reasons they chose to use Jedis was that it proved to be a lightweight library. Instantiating the client connection takes somewhere between 200-600ms (depending on the Lambda memory settings), which impacts the cold start of the Lambda function:

   
  
 
   private JedisRedisService() {
  logger.info("Instantiating Redis Cluster connection");
  HostAndPort hostAndPort = new HostAndPort(REDIS_HOST, Integer.valueOf(REDIS_PORT));
  ConnectionPoolConfig connectionPoolConfig = new ConnectionPoolConfig();
  connectionPoolConfig.setMaxTotal(5);
  connectionPoolConfig.setMaxIdle(5);

  DefaultJedisClientConfig defaultJedisClientConfig =     DefaultJedisClientConfig.builder()
    .connectionTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .socketTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .blockingSocketTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .user(REDIS_USERNAME)
    .password(REDIS_PASSWORD)
    .ssl(true)
    .build();

  jedisCluster = new JedisCluster(
    hostAndPort, 
    defaultJedisClientConfig, 
    JedisCluster.DEFAULT_MAX_ATTEMPTS, 
    connectionPoolConfig
  );
  logger.info("Instantiated Redis Cluster connection");
} 
  

Jedis has shown to be the fastest driver for specific scenarios based on the “Optimize Redis Client Performance for Amazon ElastiCache and MemoryDB” post on the AWS blog. Performance for a cache is very important, but we also wanted to look at different aspects of the driver.

Expect the Unexpected

Werner Vogels often states that you should always expect the unexpected if you’re building in the cloud. Things can break, and it’s best to prepare for that as much as possible.

Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures. This is a given, whether you are using the highest quality hardware or lowest cost components. – Source: “10 Lessons from 10 Years of Amazon Web Services”

To test how Jedis would handle some “failure” scenarios, we decided we wanted to test for, at least, the things we could expect.

Source: AWS

Two scenarios we could think of were:

A primary node failure.
A maintenance upgrade that would rotate and update the nodes to a new version of Redis.

We performed these tests with Jedis version 4.2.3, and later, we retried with 4.3.1 as there were a couple of fixes and improvements regarding Redis clusters.

To test the impact, we created a simple Lambda function that performs GET operations on a specific set of keys within our cache. As we wanted to test under “real-world” circumstances, we created a small load test with Artillery.io that was sending a constant stream of requests.

As a response, we represent the value of the get method and list the available nodes in the cluster:

   
  
 
   {
  "nodes": [
    "test-cluster-0001-001.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379",
    "test-cluster-0001-002.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379",
    "test-cluster-0001-003.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379"
  ],
  "test": "test value"
} 
  

Primary Node Failover

As ElastiCache for Redis was the primary data source for the service, the team wanted to see the impact of a primary node failure. The service was predicated to have a read-heavy workload and our small test cluster consisted of one primary node and two replica nodes. The team was expecting to have a similar production setup when the service would launch. Testing a primary node failure is quite simple in ElastiCache for Redis. From the AWS console, you can trigger the fail-over utilizing a button:

Or from the command line by executing:

   
   aws elasticache test-failover /
    --replication-group-id "test-cluster" /
    --node-group-id "0001"

Once the failover was triggered, we could track its progress in the ElastiCache events logs. The event log can be found in the AWS Console or from the AWS CLI by executing:

aws elasticache describe-events --max-items 5

This will result in a JSON or table-based response that contains a list of events in the order they took place.

Source ID	Type	Date	Event
test-cluster-0001-001	cache-cluster	November 12, 2022, 14:43:40 (UTC+01:00)	Finished recovery for cache nodes 0001.
test-cluster-0001-001	cache-cluster	November 12, 2022, 14:33:22 (UTC+01:00)	Recovering cache nodes 0001.
test-cluster	replication-group	November 12, 2022, 14:31:59 (UTC+01:00)	Failover to replica node test-cluster-0001-002 completed.
test-cluster	replication-group	November 12, 2022, 14:30:56 (UTC+01:00)	Test Failover API called for node group 0001.

As you can see from the above event log, the cluster first performs a failover. After which, it recovers the node and adds it back into the cluster. If you look closely, you can see that the time between triggering the failover action and the actual failover takes about a minute. We ran this test multiple times and we’ve seen this vary from close to a minute to up to two minutes.

We noticed from the moment the primary node was taken down, our service started throwing errors. From the CloudWatch logs, we noticed several Jedis related errors:

   
   Cluster retry deadline exceeded.: redis.clients.jedis.exceptions.JedisClusterOperationException
redis.clients.jedis.exceptions.JedisClusterOperationException: Cluster retry deadline exceeded.
    at redis.clients.jedis.executors.ClusterCommandExecutor.executeCommand(ClusterCommandExecutor.java:88)
    at redis.clients.jedis.UnifiedJedis.executeCommand(UnifiedJedis.java:148)
    at redis.clients.jedis.UnifiedJedis.get(UnifiedJedis.java:570)

And:

   
  
 
   2022-10-16 19:03:28 7e6d77f1-eea2-4c9e-ab3c-da108d1c9b16 DEBUG ClusterCommandExecutor - Failed connecting to Redis: null
redis.clients.jedis.exceptions.JedisConnectionException: Failed to connect to any host resolved for DNS name.
    at redis.clients.jedis.DefaultJedisSocketFactory.connectToFirstSuccessfulHost(DefaultJedisSocketFactory.java:63) ~[task/:?]
    at redis.clients.jedis.DefaultJedisSocketFactory.createSocket(DefaultJedisSocketFactory.java:87) ~[task/:?]
    at redis.clients.jedis.Connection.connect(Connection.java:180) ~[task/:?]
    at redis.clients.jedis.Connection.initializeFromClientConfig(Connection.java:338) ~[task/:?]
    at redis.clients.jedis.Connection.<init>(Connection.java:53) ~[task/:?]
    at redis.clients.jedis.ConnectionFactory.makeObject(ConnectionFactory.java:71) ~[task/:?] 
  

The total duration of exceptions in our log ended up taking about six minutes. As the load was the same during the failover, we did not notice any Lambda cold starts during the failover.

During that time, our service was mostly responding with errors and not servicing our clients:

Finding out what’s causing that was not that simple, as Jedis does not write a lot of log information when the log level is set to debug. From what we could tell from the Jedis source code, it seems when the Jedis client connects to the configuration endpoint, it fetches all known hosts. It stores that in a local cache and figures out the primary node from the set of nodes. Once the primary node is known, it will execute operations against the primary node. It does so by connecting to the Amazon-generated domain name/entry and when that entry is down/unreachable, it seems it does not use the configuration endpoint to rediscover the nodes or fetch the new cluster topology. We initially thought this had to do with the DNS caching of the JVM, so we also tried to disable caching, but we did not see any effect after that change:

   
   {
 java.security.Security.setProperty("networkaddress.cache.ttl","0");
 java.security.Security.setProperty("networkaddress.cache.negative.ttl", "0");       
}

From the Jedis source code, we learned that a bit of randomization is applied once the node is marked as “unavailable.” This seems to happen only when the connection attempt has timed out after several retry attempts. It will randomly connect to one of the remaining nodes and, by chance, it might find the new primary node and forget about the old primary. We found this out by looking at the code and printing the list of nodes available to the client.

We noticed that even when the old primary is added back to the cluster as a replica, that node will not be known by our application. Because we were using AWS Lambda and Lambdas are rotated at a random moment in time, a cold start would probably solve this issue but, in the meantime, one of the nodes will be missing from the total set of nodes for that duration, which could impact the availability of our service.

As we were experiencing issues while rotating nodes, we were wondering about the impact of a version upgrade. A version upgrade would probably also take down nodes, so that’s what we tested next.

Cluster Engine Version Upgrade

The next scenario was running a cluster engine upgrade. To validate the impact of a version upgrade, we migrated from version Redis 6.0.x to 6.2.x. Again, the event log can show us the timing of the procedure:

Date	Message	SourceIdentifier	SourceType
2023-01-30T12:48:35.546000+00:00	Cache cluster patched	test-cluster-0001-001	cache-cluster
2023-01-30T12:47:42.064000+00:00	Cache node 0001 restarted	test-cluster-0001-001	cache-cluster
2023-01-30T12:47:40.055000+00:00	Cache node 0001 shutdown	test-cluster-0001-001	cache-cluster
2023-01-30T12:46:28.167000+00:00	Failover from master node test-cluster-0001-001 to replica node test-cluster-0001-002 completed	test-cluster-0001-001	cache-cluster
2023-01-30T12:46:28.167000+00:00	Failover from master node test-cluster-0001-001 to replica node test-cluster-0001-002 completed	test-cluster	replication-group
2023-01-30T12:46:26.946000+00:00	Failover to replica node test-cluster-0001-002 completed	test-cluster	replication-group
2023-01-30T12:45:40.725000+00:00	Cache cluster patched	test-cluster-0001-003	cache-cluster
2023-01-30T12:44:47.350000+00:00	Cache node 0001 restarted	test-cluster-0001-003	cache-cluster
2023-01-30T12:44:45.348000+00:00	Cache node 0001 shutdown	test-cluster-0001-003	cache-cluster
2023-01-30T12:43:32.918000+00:00	Cache cluster patched	test-cluster-0001-002	cache-cluster
2023-01-30T12:42:39.710000+00:00	Cache node 0001 restarted	test-cluster-0001-002	cache-cluster
2023-01-30T12:42:37.707000+00:00	Cache node 0001 shutdown	test-cluster-0001-002	cache-cluster

To our surprise, this did not result in the same behavior. We noticed that nodes kept upgrading, but our service was not generating any errors during that time:

Conclusion

Testing the above scenarios helped us get a better understanding of how our driver would handle specific cluster-wide events. Having a few minutes of downtime/errors during a primary failover event is not ideal and depends on your use case if that’s acceptable for your workload. While researching mitigation scenarios, we learned that the Tinder engineering team discovered similar behavior with Jedis in production, which they described in their post “Taming ElastiCache with Auto-discovery at Scale.” This triggered us to do some testing with a different Java-based Redis driver. You can find more about that in part 2.

AWS AWS Lambda Cache (computing) cluster Driver (software) Java (programming language) Redis (company) Testing API Command-line interface Dynamic DNS JSON Java virtual machine REST Topology optimization Cloud

Published at DZone with permission of Jeroen Reijn. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending