These two are two of the most-used application servers for ruby apps. How similar and how different are they? In the past unicorn was the most popular, nowadays, PUMA is the leader.
Processes that are run on your computer (text editor, browser, ...) require CPU time, otherwise, tasks they are meant to solve won't be solved. If your machine has
X cores it just means that in a single point of time
X processes can be, well... processed. In order to know how many cores your computer has, type
nproc --all in the terminal. I'm using a quite oldish laptop and I got 2 (but hey... it's still more than 1).
OK... but when we're developing apps we've got text editor launched along with multiple different apps, and this number definitely exceeds 2, and still everything seems to work simultaneously. How is that possible? It's because system switches between processes and give each one of them CPU time. If it's done often enough, everything seems to be executed simultaneously (but in fact only 2 of them can be). The more processes you start at once the more unresponsive system will become. It's because system has to switch again and again to "satisfy" every process (it tries to be "fair" as far as access to CPU is considered).
But how does it apply to our topic?
Unicorn is a multi-process, single-thread web server. What does it mean? When I start web server with 2 workers and go to htop I'll see 3 unicorn processes (1 master process and 2 workers).
Every HTTP request that comes from the internet, will be processed by one of the workers. In the best scenario when 2 clients send requests to a server, everyone will get its own worker (process) and they all will be processed simultaneously (one worker per core). But what happens when 3rd request comes? It's queued and it waits until the first of the workers finishes processing.
If your app is not fast enough it may cause lot of requests queued waiting for available worker when the traffic appears. OK, but can't we just add more workers? We can... but remember 2 cores I have on the board. Even if I increase workers to 10 it doesn't mean I'll be able to process 10 requests at once. It just means the system will be processing 2 of them and every once in a while it switches to a different worker and gives it a bit of CPU time.
Consider this. You've got 2 routes in your API. The first
slow_IO takes 5 seconds to be completed (essentially it waits for the external system [I/O bounded action - in my example its simulated by
sleep(5)]). The second -
fast_CPU, computes something and returns the result almost immediately.
Unicorn Workers Count == Cores
If I had only 2 unicorn workers and all of them would be hit with
slow_IO request and then 3 new
fast_CPU would appear they would be queued and forced to wait until first of
slow_IO is completed (so they'll start being processed after 5 seconds).
Unicorn workers_count > Cores
In case we have 3 workers (essentially any number greater than cores number) system switches between them every once in a while and this guarantees that even despite fact "fast" requests were received by server after the
slow_IO they get some CPU time, thus they should complete before the
slow_IO ones. My tests confirm that.
The Sky Is the Limit
OK, so if adding more workers made overall performance better, can't we just add let's say 100 additional worker processes. Will it work? Probably it won't (too much time spent by system during the context switching between processes). However, there is one more additional factor that comes into play — RAM. Every worker is a process with loaded application — if its size is really big then it will be a bottleneck and essentially, you'll run out of memory. Consider this (primitive) showcase.
a) 5 workers, all load tiny app. Snapshot from htop (memory is in KB):
b) the same workers but I load a relatively big array on app start (
config.ru ). Clearly, more RAM consumed.
SOME_CONST =  (1..3_500_000).each do |i| SOME_CONST << "foo" end
We already know what multi-process means, but what is multi-threaded application? The simple answer is this: imagine you've got a desktop application that one of the possible actions to be triggered is to compute a complex equation. If the app is single-threaded you'll almost for sure notice frozen UI, because thread which controls it (UI) now is busy with computations.
Introducing an additional thread will give you chance to delegate this computation to an another thread still allowing the default thread to handle UI events. Of course, the only way to have these both threads to do their job parallel is to run instructions for each one on separate cores. Because threads belong to the same process they share resources (memory, process files descriptors, etc). As usual, everything comes with the price. Writing multi-threaded applications is more difficult — operating on the same data, changing it from different threads can quickly introduce hard to track errors.
Because of GIL presence in MRI, it's not possible to run two threads on two cores the same time within one process. This may change in the future (ruby 3?), however, for now, the easiest way to use true multi-threading and still use ruby is to use... JRuby and depend on JVM. All in all, it's worth knowing that you can get some benefit of multi-threading even when using MRI (essentially it comes down to running one thread within your process at a time, and occasionally switch between threads).
Puma is a multi-process, multi-threaded server. When server starts you declare the minimum and the maximum number of threads for each of its workers. In the simplest case, you may define N workers with 1 thread what essentially should work similar to N Unicorn workers. Everything good happens when you decide to have more threads. Now then, when new requests come they may be processed by the same worker but by different thread (each thread per request) up to the maximum number of threads per process.
Most of the web applications we work on a daily basis spend more time on I/O than on CPU intensive tasks. I mean, requests to external web services, db calls, etc. When your app is waiting for I/O (thread_1) it's the best moment to switch to another thread (thread_2) and to start processing new request even if you in a moment switch back to the first thread (normally thread_1 would be just wasting time while waiting for I/O to complete, but in that case you at least started processing [done something] the second request in thread_2).
Puma 1 Worker, Multiple Threads
In Unicorn's model 1 worker would lead us to queue and requests would be processed sequentially. As we can see from the chart below (1 worker, 4 threads min, 10 threads max)
fast_CPU request was finished before slow_IO even despite they had been started earlier.
Great benefit is smaller RAM consumption for this configuration (Unicorn would require 4 workers).
Usually, our actions are mixed of I/O and CPU. Below is the simulation for route including both I/O and CPU and operations (
both_CPU_and_IO route) My configuration was:
1. Unicorn workers = 2 * cores_number
2. Puma workers = 2 * cores_number, each with 2 threads
3. Puma workers = 2 * cored_number, each with 4 threads
Puma's configuration was the most performant In general performance difference between servers depend on many factors. If I did tests for CPU intensive actions probably it would turn out that performance between servers is close to zero. As always - everything should be benchmarked before decisions are made.
Multiple threads within a process may lead to problems of which you are not aware in the beginning. Because all your threads share memory, changing some global variable (which is a bad idea anyway) will instantly affect not only the current thread but also the rest.
This post touches only a few aspects of the topic, that's why I recommend reading more on that, and of course benchmarking.