This post conclude this week’ series of API design choices regarding how to handle partial failure scenarios in sharded cluster. In my previous post, I discussed my issues with a local solution for the problem.
The way we ended up solving this issue is actually quite simple. We apply a global solution to a global problem, we added the ability to inject error handling logic deep into the execution pipeline of the sharding implementation, like this:
In this case, as you can see, we are allow requests to fail if we are querying (because we can probably still get something from other servers that will be useful), but if you are requesting something by id and it generates an error, we will propagate this error. Note that in our implementation, we call to a user defined “NotifyUserInterfaceAboutServerFailure”, which will let the user know about the error.
That way, you probably have some warning in the UI about partial information, but you are still functional. This is the proper way to handle this, because you are handling this once, and it means that you can handle it properly, instead of having to do the right thing everywhere.