Today I’d like to share something about my experience using Akka, specifically, Akka remoting for building a distributed application. No super-secret tricks, just what we’ve learned the hard way – by making mistakes. So here’s hoping this blog helps you avoid making the same mistakes.
Akka is a good fit for the Metafor system. Our system decomposes well into the actor model. It’s a large system full of semi-independent actors; each knows how to do its own thing, and how to talk to each other. We also have divisions that are one-to-many in bulk, so it makes sense to spin them off entirely into their own Java virtual machine to allow us a central server with smaller services clustered around it.
A quick overview: our central service is called the Central Provisioning Service (CPS). We only need one CPS, however more of these can be spun up for scaling purposes. Around the CPS is a cluster of Management Consoles (MC). These are what users interact with directly. And finally, we have agents which connect to the MCs. Each of these connections is a one-to-many relationship: one agent speaks to one MC, and one MC speaks to one CPS.
I’ll focus on the relationship between the MC and the CPS — this is where we used Akka remoting.
Given that the machines hosting the CPS and MCs would always be in the same subnet, we felt the limitations of remoting wouldn’t hinder us. Aside from some sharp edges, the Akka remoting solution has served us well.
To set up Akka remoting, you need a few pieces of information:
- Your exact own address is from the perspective of people trying to connect to you
- It must be an IP address you can bind to (which means NAT isn’t going to work)
So we set up the components to allow us to configure the bind address in the config files and away we went.
And it worked brilliantly… right up until we realized we were doing it wrong.
There’s a reason that the authors of Akka go on about how blocking is a bad thing. I’m going to echo this: blocking is bad. The reason is simple. Anywhere an ask is made, timeouts will happen. Timeouts that don’t need to happen.
We got random intermittent failures in our code which were caused by completely unrelated activities, simply because some resource was not as infinite as we assumed (CPU, network bandwidth, thread availability for actors to run in, etc).
Now yes, there are a few places where blocking is necessary. On one side of the application, you may need to block when a customer is making the actual request. Somebody has to either send a response to that HTTP socket or let it know a response isn’t coming. Astute readers will pick up on the possibility of just hanging the socket in some actor’s state and sending the reply later. This hasn’t worked for us in practice, but we also haven’t tried all REST frameworks.
On the other side, blocking will happen any time you access resources external to your system. Databases, disk access, sending network requests, etc – all of these block. Resist the urge to do it the easy way and block. Instead, spin these long running pieces of work off into external tasks such as special purpose actors, external threads, or similar constructs.
The point is this: any actor that expects to receive messages over time should not block. This is particularly relevant with remote actors since there are more things that can go wrong with a remote actor than a local one.
While a local actor can be blocked or slowed down, a remote one can suffer from even more problems: network issues such as latency spikes and packet drops. Using ask/reply with a remote actor will lead to timeouts. It’s better to send the request with a tag of some sort and have the reply include that tag. This allows the actor that’s seeking information to tie the response to whatever caused it to seek information in the first place.
If a timeout is required at the top of the request, that becomes the only timeout in the entire system. For extra robustness, when that timeout triggers a “never mind” message can be sent down the pipe to abort whatever is in progress.
Overall thoughts: Akka remoting is absolutely usable for internal communications between nodes of a distributed system. However, it’s not suitable for communications over an untrusted network because it’s less reliable once latency gets involved.