DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Keep Your Application Secrets Secret
  • Low Code Approach for Building a Serverless REST API
  • What Is API-First?
  • How to Configure AWS Glue Job Using Python-Based AWS CDK

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • MCP Servers: The Technical Debt That Is Coming
  • The End of “Good Enough Agile”
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Lessons Learned from Picking a Java-Based Driver for Amazon ElastiCache for Redis: Part 1

Lessons Learned from Picking a Java-Based Driver for Amazon ElastiCache for Redis: Part 1

In this article, readers will learn about lessons I learned from picking a Java-based driver for Amazon ElastiCache for Redis with visuals and helpful code.

By 
Jeroen Reijn user avatar
Jeroen Reijn
DZone Core CORE ·
Feb. 10, 23 · Opinion
Likes (2)
Comment
Save
Tweet
Share
6.7K Views

Join the DZone community and get the full member experience.

Join For Free

In my day-to-day job, I support teams at different organizations and help them with their AWS challenges. One of the teams I recently supported was using Amazon ElasticCache for Redis as a storage/caching layer for their primary workload. They were validating their production setup and testing several failure scenarios. In this article, I will share some of the lessons learned. Keep in mind that the cases described in this article are very context-specific and might not reflect your use case, so my advice is to always do your own tests.

Initial Architecture

The team built a REST-based service by using API Gateway, AWS Lambda, and Amazon ElastiCache for Redis.

AWS Cloud Chart

Amazon ElastiCache for Redis was set up with cluster mode enabled, as this gave the team the flexibility to scale both read and write actions. The initial cluster existed of just a single shard, with the option to scale up/out if needed.

When the Redis client writes to and reads from the cluster, it will first check which shard the data belongs to and sends the read or write command to the node in that particular shard.

Shards

Now, when you set up ElastiCache Redis with “Cluster Mode” enabled in AWS, the recommended way to connect with the cluster from your client is by connecting via the cluster configuration endpoint. The cluster configuration endpoint keeps track of all the changes in the Redis cluster and can give you an up-to-date topology.

Redis

The AWS Lambda function had been written in Java and, therefore, the team looked at different Java-based Redis drivers and eventually decided to use Jedis, which is a Java client for Redis designed for performance and ease of use. One of the reasons they chose to use Jedis was that it proved to be a lightweight library. Instantiating the client connection takes somewhere between 200-600ms (depending on the Lambda memory settings), which impacts the cold start of the Lambda function:

 
private JedisRedisService() {
  logger.info("Instantiating Redis Cluster connection");
  HostAndPort hostAndPort = new HostAndPort(REDIS_HOST, Integer.valueOf(REDIS_PORT));
  ConnectionPoolConfig connectionPoolConfig = new ConnectionPoolConfig();
  connectionPoolConfig.setMaxTotal(5);
  connectionPoolConfig.setMaxIdle(5);

  DefaultJedisClientConfig defaultJedisClientConfig =     DefaultJedisClientConfig.builder()
    .connectionTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .socketTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .blockingSocketTimeoutMillis(JedisCluster.DEFAULT_TIMEOUT)
    .user(REDIS_USERNAME)
    .password(REDIS_PASSWORD)
    .ssl(true)
    .build();

  jedisCluster = new JedisCluster(
    hostAndPort, 
    defaultJedisClientConfig, 
    JedisCluster.DEFAULT_MAX_ATTEMPTS, 
    connectionPoolConfig
  );
  logger.info("Instantiated Redis Cluster connection");
}


Jedis has shown to be the fastest driver for specific scenarios based on the “Optimize Redis Client Performance for Amazon ElastiCache and MemoryDB” post on the AWS blog. Performance for a cache is very important, but we also wanted to look at different aspects of the driver.

Expect the Unexpected

Werner Vogels often states that you should always expect the unexpected if you’re building in the cloud. Things can break, and it’s best to prepare for that as much as possible.

Failures are a given and everything will eventually fail over time: from routers to hard disks, from operating systems to memory units corrupting TCP packets, from transient errors to permanent failures. This is a given, whether you are using the highest quality hardware or lowest cost components. – Source: “10 Lessons from 10 Years of Amazon Web Services”

To test how Jedis would handle some “failure” scenarios, we decided we wanted to test for, at least, the things we could expect.

Everything Fails

Source: AWS

Two scenarios we could think of were:

  1. A primary node failure.
  2. A maintenance upgrade that would rotate and update the nodes to a new version of Redis.

We performed these tests with Jedis version 4.2.3, and later, we retried with 4.3.1 as there were a couple of fixes and improvements regarding Redis clusters.

To test the impact, we created a simple Lambda function that performs GET operations on a specific set of keys within our cache. As we wanted to test under “real-world” circumstances, we created a small load test with Artillery.io that was sending a constant stream of requests.

As a response, we represent the value of the get method and list the available nodes in the cluster:

 
{
  "nodes": [
    "test-cluster-0001-001.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379",
    "test-cluster-0001-002.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379",
    "test-cluster-0001-003.test-cluster.ifrpha.euw1.cache.amazonaws.com:6379"
  ],
  "test": "test value"
}


Primary Node Failover

As ElastiCache for Redis was the primary data source for the service, the team wanted to see the impact of a primary node failure. The service was predicated to have a read-heavy workload and our small test cluster consisted of one primary node and two replica nodes. The team was expecting to have a similar production setup when the service would launch. Testing a primary node failure is quite simple in ElastiCache for Redis. From the AWS console, you can trigger the fail-over utilizing a button:

Failover Trigger

Or from the command line by executing:

 
aws elasticache test-failover /
    --replication-group-id "test-cluster" /
    --node-group-id "0001"


Once the failover was triggered, we could track its progress in the ElastiCache events logs. The event log can be found in the AWS Console or from the AWS CLI by executing:

 
aws elasticache describe-events --max-items 5


This will result in a JSON or table-based response that contains a list of events in the order they took place.

Source ID Type Date Event
test-cluster-0001-001 cache-cluster November 12, 2022, 14:43:40 (UTC+01:00) Finished recovery for cache nodes 0001.
test-cluster-0001-001 cache-cluster November 12, 2022, 14:33:22 (UTC+01:00) Recovering cache nodes 0001.
test-cluster replication-group November 12, 2022, 14:31:59 (UTC+01:00) Failover to replica node test-cluster-0001-002 completed.
test-cluster replication-group November 12, 2022, 14:30:56 (UTC+01:00) Test Failover API called for node group 0001.

As you can see from the above event log, the cluster first performs a failover. After which, it recovers the node and adds it back into the cluster. If you look closely, you can see that the time between triggering the failover action and the actual failover takes about a minute. We ran this test multiple times and we’ve seen this vary from close to a minute to up to two minutes.

We noticed from the moment the primary node was taken down, our service started throwing errors. From the CloudWatch logs, we noticed several Jedis related errors:

 
Cluster retry deadline exceeded.: redis.clients.jedis.exceptions.JedisClusterOperationException
redis.clients.jedis.exceptions.JedisClusterOperationException: Cluster retry deadline exceeded.
    at redis.clients.jedis.executors.ClusterCommandExecutor.executeCommand(ClusterCommandExecutor.java:88)
    at redis.clients.jedis.UnifiedJedis.executeCommand(UnifiedJedis.java:148)
    at redis.clients.jedis.UnifiedJedis.get(UnifiedJedis.java:570)


And:

 
2022-10-16 19:03:28 7e6d77f1-eea2-4c9e-ab3c-da108d1c9b16 DEBUG ClusterCommandExecutor - Failed connecting to Redis: null
redis.clients.jedis.exceptions.JedisConnectionException: Failed to connect to any host resolved for DNS name.
    at redis.clients.jedis.DefaultJedisSocketFactory.connectToFirstSuccessfulHost(DefaultJedisSocketFactory.java:63) ~[task/:?]
    at redis.clients.jedis.DefaultJedisSocketFactory.createSocket(DefaultJedisSocketFactory.java:87) ~[task/:?]
    at redis.clients.jedis.Connection.connect(Connection.java:180) ~[task/:?]
    at redis.clients.jedis.Connection.initializeFromClientConfig(Connection.java:338) ~[task/:?]
    at redis.clients.jedis.Connection.<init>(Connection.java:53) ~[task/:?]
    at redis.clients.jedis.ConnectionFactory.makeObject(ConnectionFactory.java:71) ~[task/:?]


The total duration of exceptions in our log ended up taking about six minutes. As the load was the same during the failover, we did not notice any Lambda cold starts during the failover.

During that time, our service was mostly responding with errors and not servicing our clients:

Errors

Finding out what’s causing that was not that simple, as Jedis does not write a lot of log information when the log level is set to debug. From what we could tell from the Jedis source code, it seems when the Jedis client connects to the configuration endpoint, it fetches all known hosts. It stores that in a local cache and figures out the primary node from the set of nodes. Once the primary node is known, it will execute operations against the primary node. It does so by connecting to the Amazon-generated domain name/entry and when that entry is down/unreachable, it seems it does not use the configuration endpoint to rediscover the nodes or fetch the new cluster topology. We initially thought this had to do with the DNS caching of the JVM, so we also tried to disable caching, but we did not see any effect after that change:

 
{
 java.security.Security.setProperty("networkaddress.cache.ttl","0");
 java.security.Security.setProperty("networkaddress.cache.negative.ttl", "0");       
}


From the Jedis source code, we learned that a bit of randomization is applied once the node is marked as “unavailable.” This seems to happen only when the connection attempt has timed out after several retry attempts. It will randomly connect to one of the remaining nodes and, by chance, it might find the new primary node and forget about the old primary. We found this out by looking at the code and printing the list of nodes available to the client. 

We noticed that even when the old primary is added back to the cluster as a replica, that node will not be known by our application. Because we were using AWS Lambda and Lambdas are rotated at a random moment in time, a cold start would probably solve this issue but, in the meantime, one of the nodes will be missing from the total set of nodes for that duration, which could impact the availability of our service.

As we were experiencing issues while rotating nodes, we were wondering about the impact of a version upgrade. A version upgrade would probably also take down nodes, so that’s what we tested next.

Cluster Engine Version Upgrade

The next scenario was running a cluster engine upgrade. To validate the impact of a version upgrade, we migrated from version Redis 6.0.x to 6.2.x. Again, the event log can show us the timing of the procedure:

Date Message SourceIdentifier SourceType
2023-01-30T12:48:35.546000+00:00 Cache cluster patched test-cluster-0001-001 cache-cluster
2023-01-30T12:47:42.064000+00:00 Cache node 0001 restarted test-cluster-0001-001 cache-cluster
2023-01-30T12:47:40.055000+00:00 Cache node 0001 shutdown test-cluster-0001-001 cache-cluster
2023-01-30T12:46:28.167000+00:00 Failover from master node test-cluster-0001-001 to replica node test-cluster-0001-002 completed test-cluster-0001-001 cache-cluster
2023-01-30T12:46:28.167000+00:00 Failover from master node test-cluster-0001-001 to replica node test-cluster-0001-002 completed test-cluster replication-group
2023-01-30T12:46:26.946000+00:00 Failover to replica node test-cluster-0001-002 completed test-cluster replication-group
2023-01-30T12:45:40.725000+00:00 Cache cluster patched test-cluster-0001-003 cache-cluster
2023-01-30T12:44:47.350000+00:00 Cache node 0001 restarted test-cluster-0001-003 cache-cluster
2023-01-30T12:44:45.348000+00:00 Cache node 0001 shutdown test-cluster-0001-003 cache-cluster
2023-01-30T12:43:32.918000+00:00 Cache cluster patched test-cluster-0001-002 cache-cluster
2023-01-30T12:42:39.710000+00:00 Cache node 0001 restarted test-cluster-0001-002 cache-cluster
2023-01-30T12:42:37.707000+00:00 Cache node 0001 shutdown test-cluster-0001-002 cache-cluster

To our surprise, this did not result in the same behavior. We noticed that nodes kept upgrading, but our service was not generating any errors during that time:

Upgrading Nodes

Conclusion

Testing the above scenarios helped us get a better understanding of how our driver would handle specific cluster-wide events. Having a few minutes of downtime/errors during a primary failover event is not ideal and depends on your use case if that’s acceptable for your workload. While researching mitigation scenarios, we learned that the Tinder engineering team discovered similar behavior with Jedis in production, which they described in their post “Taming ElastiCache with Auto-discovery at Scale.” This triggered us to do some testing with a different Java-based Redis driver. You can find more about that in part 2.

AWS AWS Lambda Cache (computing) cluster Driver (software) Java (programming language) Redis (company) Testing API Command-line interface Dynamic DNS JSON Java virtual machine REST Topology optimization Cloud

Published at DZone with permission of Jeroen Reijn. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Keep Your Application Secrets Secret
  • Low Code Approach for Building a Serverless REST API
  • What Is API-First?
  • How to Configure AWS Glue Job Using Python-Based AWS CDK

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!