HTTP is probably the most popular application-level protocol and there are many libraries that implement it on top of network I/O, which is a special (stream-oriented) case of general I/O. Since all I/O has a much in common, let’s start with some discussion about it.
I’ll concentrate on I/O cases with a lots of concurrent HTTP requests, for example micro-services, where a set of higher-level HTTP services invoke several lower-level ones, some concurrently and some sequentially due to data dependencies.
When serving many such requests the total number of concurrently open connections can become big at times; if there are data dependencies, or if the lower-level services are slow (or slowed down due to exceptional conditions). So microservice layers tend to require many concurrent, potentially long-lived connections. To see how many open connections we are required to support without crashing let’s recall Little’s Law with Ψ being the average in-progress requests count, ρ being the average arrival rate and τ being the average completion time:
Ψ = ρ τ
The number of in-progress requests we can support depends on the language runtime, the OS and the hardware; the average request completion time (or latency), depends on what we have to do in order to fulfill the requests, including of course the calls to any lower level services, access to storage etc.
How many concurrent HTTP requests can we support? Each will need an open connection and some runnable primitive that can read/write on it using syscalls. If the memory, I/O subsystem and network bandwidth can keep up, modern OSes can support hundreds of thousands open TCP connections; the runnable primitives they provide to work on sockets are threads. Threads are much more heavyweight than sockets: a single box running a modern OS can only support 5000-15000 of them.
From 10,000 Feet: I/O Performance on the JVM
Nowadays JDK threads are OS threads on most platforms but if at any time there are only few concurrent connections then the “thread-per-connection” model is perfectly fine.
What if not? The answer to this question has changed along history:
- JDK pre-1.4 only had libraries calling into the OS’ thread-blocking I/O (
java.iopkgs), so only the “thread-per-connection” model or thread-pools could be used. If you wanted something better you’d tap into your OS’ additional features through JNI.
- JDK 1.4 added non-blocking I/O or NIO (
java.niopackages) to read/write from connections only if it can be done immediately, without putting the thread to sleep. Even more importantly it added a way for a single thread to work effectively on many channels with socket selection, which means asking the OS to block the current thread and unblock it when it is possible to receive/send data immediately from at least one socket of a set.
- JDK 1.7 added NIO.2, also known as asynchronous I/O (still
java.niopackages). This means asking the OS to perform I/O tasks completely in the background and wake up a thread with a notification later on, only when the I/O has finished.
Calling HTTP From the JVM Either Easily or Efficiently: the Thread-blocking and Async Toolboxes
There’s a wide selection of open-source HTTP client libraries available for the JVM. The thread-blocking APIs are easy to use and to maintain but potentially less efficient with many concurrent requests, while the async ones are efficient but harder to use. Asynchronous APIs also virally affect your code with asynchrony: any method consuming asynchronous data must be asynchronous itself, or block and nullify the advantages of asynchrony.
Here’s a selection of open-source HTTP clients for Java and Clojure:
URLConnectionuses traditional thread-blocking I/O.
- Apache HTTP Client uses traditional thread-blocking I/O with thread-pools.
- Apache Async HTTP Client uses NIO.
- Jersey is a ReST client/server framework; the client API can use several HTTP client backends including
URLConnectionand Apache HTTP Client.
- OkHttp uses traditional thread-blocking I/O with thread-pools.
- Retrofit turns your HTTP API into a Java interface and can use several HTTP client backends including Apache HTTP Client.
- Grizzly is network framework with low-level HTTP support; it was using NIO but it switched to AIO .
- Netty is a network framework with HTTP support (low-level), multi-transport, includes NIO and native (the latter uses
- Jetty Async HTTP Client uses NIO.
- Async HTTP Client wraps either Netty, Grizzly or JDK’s HTTP support.
- clj-http wraps the Apache HTTP Client.
- http-kit is an async subset of
clj-httpimplemented partially in Java directly on top of NIO.
- http async client wraps the Async HTTP Client for Java.
From 10,000 Feet: Making it Easy
Since Java threads are heavy on resources, if we want to perform I/O and scale to many concurrent connections we have to use either NIO or async NIO; on the other hand they are much more difficult to code and maintain. Is there a solution to this dilemma?
If threads weren’t heavy we could just use straightforward blocking I/O, so our question really is: can we have cheap enough threads that could be created in much larger numbers than OS threads?
At present the JVM itself doesn’t provide lightweight threads but Quasar comes to the rescue with fibers, which are very efficient threads, implemented in userspace.
Calling HTTP From the JVM Both Easily and Efficiently: the Comsat Fiber-blocking Toolbox
Comsat integrates some of the existing libraries with Quasar fibers. The Comsat APIs are identical to the original ones and the HTTP clients section) explains how to hook them in; for the rest simply ensure you’re running Quasar properly, fire up your fibers when you need to perform a new HTTP call and use one (or more) of following fiber-blocking APIs (or take inspiration from templates and examples:
- An extensive subset of the Apache HTTP Client API, integrated by bridging the async one. Apache HTTP Client is mature, efficient, feature-complete and very widely used.
- The fiber-blocking Retrofit API wraps the Apache client. Retrofit is a modern and high-level HTTP client toolkit that has been drawing a lot of interest also for ReST.
- The JAXRS synchronous HTTP client API, integrated by bridging Jersey’s async one. Jersey is a very popular JAXRS-compliant framework for ReST, so several micro-services could decide to use both its server and client APIs.
- The OkHttp synchronous API, integrated by bridging the OkHttp async API. OkHttp performs very well, is cheap on resources and feature-rich yet at the same time it has a very straightforward API for common cases, plus it supports HTTP2 and SPDY as well.
- An extensive subset of the clj-http API, integrated by bridging the async API of
http-kit. clj-http is probably the most popular HTTP client API in the Clojure ecosystem.
- An extensive subset of the clj-http API, integrated by bridging the async API of
New integrations can be added easily and of course contributions are always welcome.
Some Load Tests with JBender
jbender is Pinterest’s Quasar-based network load testing framework. It’s efficient and flexible but thanks to Quasar fiber-blocking its source code is tiny and readable; using it is just as straightforward as using traditional thread-blocking I/O.
Consider this project, which builds on JBender and with a tiny amount of code implements HTTP load test clients for all the Comsat-integrated libraries, both in their original thread-blocking version and in Comsat’s fiber-blocking one.
JBender can use any either (plain, heavyweight, OS) threads or fibers to perform requests, both are abstracted by Quasar to a shared abstract class called
Strand, so the thread-blocking and fiber-blocking versions share HTTP code: this proves that the Comsat-integrated APIs are exactly the same as the original ones and that fibers and threads are used exactly in the same way.
The load-test clients accept parameters to customize pretty much every aspect of their run but the test cases we’ll consider are the following:
- 41000 long-lived HTTP connections fired at the highest possible rate.
- Executing 10000 requests (plus 1000 of initial client and server warmup) lasting 1 second each with a target rate of 1000 rps.
- Executing 10000 requests (plus 1000 of initial client and server warmup) lasting 100 milliseconds each with a target rate of 10000 rps.
- Executing 10000 requests (plus 1000 of initial client and server warmup) with an immediate reply and a target rate of 100000 rps.
All of the tests have been fired against a server running Dropwizard, optimized to employ fibers on the HTTP server-side with
comsat-dropwizard for maximum concurrency. The server simply replies to any request with “Hello!”
Here’s some information about our load test environment:
|Parallel Universe Stack||Quasar 0.7.4-SNAPSHOT, Comsat 0.5.0|
|Fiber Server (comsat-dropwizard 0.5.0, Jetty 9.2.9)||AWSEC2 Linux m4.xlarge (16 GB, 4 vcpus, high net perf)|
|Client (GET “/” -> 204 “Hello”)||AWSEC2 Linux t2.medium (4 GB, 2 vcpus, moderate-to-low net perf)|
|JBender load test suite (server + clients)||https://github.com/circlespainter/comsat-http-client-bench|
|CPU/RAM Monitoring method||JFR|
|CPU/RAM sampling interval||JFR’s default|
|HTTP Client settings||No retries, maximum-sized connection pool, I/O threads (only async) = <cpus> (= 2 for m4.medium), connect/read/write/ttl timeout = 1h|
|Warmup||1000 reqs, both server and client|
|Request generator buffer||= # reqs with pre-generation for throughput tests, = 1 for concurrency tests|
|Request completion events buffer||= # reqs with throughput tests, = 1 for concurrency tests|
The first important result is that the Comsat-based clients win hands-down, each compared to its respective non-fiber mode. Apache’s for many long-lasting connections and OkHttp’s for lots of short-lived requests with a very high target rate, both with a small and a bigger heap (resp. 990 MiB and 3 GiB, showing just the first one for brevity):
|HTTP Client Load Test (colored is best)||Regular (thread-blocking) Apache (4.4.1)||Comsat (fiber-blocking) Apache (async 4.1)||Regular (thread-blocking) OkHttp 2.4.0||Comsat (fiber-blocking) OkHttp 2.4.0||Regular (thread-blocking) Jersey 2.19 w/JDK connector||Comsat (fiber-blocking) Jersey 2.19 w/JDK connector|
|AHC blocking (BIO)||AHC async (NIO) + Quasar fibers||OkHttp blocking (BIO)||OkHttp async (BIO) + Quasar fibers||Jersey blocking||Jersey async + Quasar fibers|
|Long-lived concurrent 41k (maximum rate possible)||Max||16715||41k||16358||16608||16713||16713|
|Error||OOM - thread||-||OOM - thread||OOM - thread||OOM - thread||OOM - thread|
|Heap max (MiB)||N/A||702||N/A||N/A||N/A||N/A|
|Heap avg (MiB)||246|
|Throughput with target rate 1k (response after 1s)||Time max (ms)||7139||1138||10209||1301||6341||4370|
|Time avg (ms)||2359||1002||3031||1008||1902||1477|
|Heap max (MiB)||227||110||125||119||330||342|
|Heap avg (MiB)||61||29.8||34.9||30||76.8||73.7|
|Throughput with target rate 10k (response after 100ms)||Time max (ms)||4898||4085||7939||7079||45198||14512|
|Time avg (ms)||2479||2717||3423||2125||25885||7594|
|Heap max (MiB)||338||192||179||165||495||489|
|Heap avg (MiB)||91.3||67.2||40.9||38.5||147||155|
|Throughput with target rate 100k (immediate response)||Time max (ms)||6937||3590||4668||1793||9303||9840|
|Time avg (ms)||1468||1821||1287||826||1659||3442|
|Heap max (MiB)||226||188||130||113||354||398|
|Heap avg (MiB)||62.2||66||36.1||33.2||79.5||122|
|Notes||OkHttp doesn’t use NIO but regular blocking I/O under the hood.||Jersey uses one thread per connection even in the async case|
OkHttp excels in speed and memory utilization for fast requests. The fiber version for the JVM uses the async API and performs significantly better even though the underlying mechanism is traditional blocking I/O served by a thread pool.
Even more impressive is the measure by which the
comsat-httpkit wins against a traditional
clj-http client (still showing just with the small heap):
|HTTP Client Load Test (colored is best)||clj-http||comsat-httpkit|
|AHC blocking (BIO)||http-kit async (NIO) + Quasar fibers|
|Long-lived concurrent 41k (maximum rate possible)||Max||15715||41k|
|Error||OOM - thread||-|
|Heap max (MiB)||N/A||511|
|Heap avg (MiB)||127|
|Throughput with target rate 1k (response after 1s)||Time max (ms)||19059||1102|
|Time avg (ms)||8720||1003|
|Heap max (MiB)||405||331|
|Heap avg (MiB)||94.4||64.3|
|Throughput with target rate 10k (response after 100ms)||Time max (ms)||22045||5545|
|Time avg (ms)||8960||4102|
|Heap max (MiB)||406||250|
|Heap avg (MiB)||117||52.1|
|Throughput with target rate 100k (immediate response)||Time max (ms)||42849||3438|
|Time avg (ms)||34750||4698|
|Heap max (MiB)||523||259|
|Heap avg (MiB)||481||50.2|
There are other Jersey providers as well (Grizzly, Jetty and Apache) but Jersey proved the worst of the bunch with a generally higher footprint and an async interface (used by Comsat’s fiber-blocking integration) that unfortunately spawns and blocks a thread for each and every request; for this reason (and probably also due to each provider’s implementation strategy) the fiber version sometimes provides clear performance benefits and sometimes doesn’t. Anyway these numbers are not as interesting as the Apache, OkHttp and http-kit ones so I’m not including them here, but let me know if you’d like to see them.
(Optional) From 100 < 10,000 Feet: More About I/O Performance on the JVM
So you want to know why fibers are better than threads in highly concurrent scenarios.
When only few concurrent sockets are open, the OS kernel can wake up blocked threads with very low latency. But OS threads are general purpose and they add considerable overhead for many use cases: they consume a lot of kernel memory for bookkeeping, synchronization syscalls can be orders of magnitude slower than procedure calls, context switching is expensive, and the scheduling algorithm is too generalist. All of this means that at present OS threads are just not the best choice for fine-grained concurrency with significant communication and synchronization, nor for highly concurrent systems in general .
Blocking I/O syscalls can indeed block expensive OS threads indefinitely, so a “thread-per-connection” approach will tear your system down very fast when you’re serving lots of concurrent connections; on the other hand using a thread-pool will probably make the “accepted” connection queue overflow because we can’t keep the arrival pace or cause unacceptable latencies at the very least. A “fiber-per-connection” approach instead is perfectly sustainable because fibers are so lightweight.
Summing it up: threads can be better at latency with few concurrent connections and fibers are better at throughput with many concurrent connections.
Of course fibers need to run on top of active OS threads because the OS knows nothing about fibers, so fibers are scheduled on a thread pool by Quasar. Quasar is just a library and runs entirely in user-space, which means that a fiber performing a syscall will block its underlying JVM thread for the entire call duration, making it unavailable to other fibers. That’s why it’s important that such calls are as short as possible and especially they shouldn’t wait for long time or, even worse, indefinitely: in practice fibers should only perform non-blocking syscalls. So how can we make blocking HTTP clients run so well on fibers? As those libraries provide a non-blocking (but inconvenient) API as well, we convert that async APIs to a fiber-blocking ones and use it to implement the original blocking API. The new implementation (which is very short and is little more than a wrapper) will:
- Block the current fiber.
- Start an equivalent asynchronous operation, and pass in a completion handler that will unblock the fiber when finished.
From the fiber’s (and programmer’s) perspective the execution will restart after the library call when I/O completes, just like when using a thread and a regular thread-blocking call.
With Quasar and Comsat you can easily write and maintain highly concurrent and HTTP-intensive code in Java, Clojure or Kotlin and you can even choose your favorite HTTP client library, without any API lock-ins. Do you want to use something else? Let us know, or integrate it with Quasar yourself.
- …and much not in common, for example file I/O (which is block-oriented) supports memory-mapped I/O which doesn’t make sense with stream-oriented I/O.
- Read this blog post for further discussion.
- Not so before 1.2, when it had (only) Green Threads.
- Using thread-pools means dedicating a limited or anyway managed amount (or pool) of threads to fulfill a certain type of tasks, in this case serving HTTP requests: incoming connections are queued until a thread in the pool is free to serve it (as an aside, “connection pooling” is something entirely different and it’s most often about reusing DB connections).
- Have a look at this intro for more information.
- Read for example this, this and this for more information and benchmarks as well as this guest post on ZeroTurnaround RebelLabs’s blog if you want more insight about why and how fibers are implemented.