Load-testing is not trivial. It’s often not just about downloading JMeter or Gatling, recording some scenarios and then running them. Well, it might be just that, but you are lucky if it is. And what may sound like “Captain Obvious speaking”, it’s good to be reminded of some things that can potentially waste time.
So, when you run the tests, eventually you will hit a bottleneck, and then you’ll have to figure out where it is. It can be:
- client bottleneck – if your load-testing tool uses HttpURLConnection, the number of requests sent by the client is quite limited. You have to start from that and make sure enough requests are leaving your load-testing machine(s)
- network bottlenecks – check if your outbound connection allows the desired number of requests to reach the server
- server machine bottleneck – check the number of open files that your (most probably) linux server allows. For example, if the default is 1024, then you can have at most 1024 concurrent connections. So increase that (limits.conf)
- application server bottleneck – if the thread pool that handles requests is too low, requests may be kept waiting. If some other tiny configuration switch (e.g. whether to use NIO, which is worth a separate article) has the wrong value, that may reduce performance. You’d have to be familiar with the performance-related configurations of your server.
- database bottlenecks – check the CPU usage and response times of your database to see if it’s not the one slowing the requests. Misconfiguring your database, or having too small/few DB servers, can obviously be a bottleneck
- application bottleneck – these you’d have to investigate yourself, possibly using some performance monitoring tool (but be careful when choosing one, as there are many “new and cool”, but unstable and useless ones). We can divide this type in two:
- framework bottleneck – if a framework you are using has problems. This might be a web framework, a dependency injection framework, an actor system, an ORM, or even a JSON serialization tool
- application code bottleneck – if you are misusing a tool/framework, have blocking code, or just wrote horrible code with unnecessarily high computational complexity
You’d have to constantly monitor the CPU, memory, network and disk I/O usage of the machines, in order to understand when you’ve hit the hardware bottleneck.
One important aspect is being able to bombard your servers with enough requests. It’s not unlikely that a single machine is insufficient, especially if you are a big company and your product is likely to attract a lot of customers at the start and/or making a request needs some processing power as well, e.g. for encryption. So you may need a cluster of machines to run your load tests. The tool you are using may not support that, so you may have to coordinate the cluster manually.
As a result of your load tests, you’d have to consider how long does it make sense to keep connections waiting, and when to reject them. That is controlled by connect timeout on the client and registration timeout (or pool borrow timeout) on the server. Also have that in mind when viewing the results – too slow response or rejected connection is practically the same thing – your server is not able to service the request.
If you are on AWS, there are some specifics. Leaving auto-scaling apart (which you should probably disable for at least some of the runs), you need to have in mind that the ELB needs warming up. Run the tests a couple of times to warm up the ELB (many requests will fail until it’s fine). Also, when using a load-balancer and long-lived connections are left open (or you use WebSocket, for example), the load balancer may leave connections from itself to the servers behind it open forever and reuse them when a new request for a long-lived connection comes.
Overall, load (performance) testing and analysis is not straightforward, there are many possible problems, but is something that you must do before release. Well, unless you don’t expect more than 100 users. And the next time I do that, I will use my own article for reference, to make sure I’m not missing something.