Over a million developers have joined DZone.

Go and Quasar: A Comparison of Style and Performance

DZone's Guide to

Go and Quasar: A Comparison of Style and Performance

In this article, Ron Pressler explores the Skynet benchmark, Go, and Quasar — results included!

· Performance Zone
Free Resource

Transform incident management with machine learning and analytics to help you maintain optimal performance and availability while keeping pace with the growing demands of digital business with this eBook, brought to you in partnership with BMC.

A user recently made us aware of the Skynet benchmark, a microbenchmark for “extreme” multithreading (1M threads). We are generally wary of such microbenchmarks because they are often tailored to measure a specific strength of a particular platform, without taking into account how relevant that strength is for real applications. For example, a platform with a 1000x faster implementation of sqrt would be hard pressed to yield even a 0.01% improvement in performance when running real applications.

With threads the situation is a bit different: when many threads are active (say, over 10K) processing transactions in short bursts, the kernel thread scheduling overhead might become onerous and your application may then spend a significant portion of its time waiting for the kernel to schedule your code. Lightweight thread (AKA fibers) implementations, like those provided by Go, Erlang, and, on the JVM, Quasar (and Kilim), can reduce this overhead by two orders of magnitude. This may be the difference between your server application being able to handle 500 or 5000 requests per second (some benchmarks can be found here and here).

However, once the threading overhead is reasonably low – say, less than 1% of the total time – differences in a particular implementation matter less and less: if there’s no overhead at all, the performance improvement will be only 1%. Because the JVM does not yet have built-in fibers, Quasar is required to implement them in a way that adds more overhead than platforms with native implementations. This is why in a microbenchmark that tests scheduling overhead alone, a generally slow runtime like Erlang’s BEAM may outperform a very fast runtime like HotSpot, even though once there’s any actual workload, the JVM quickly makes up the difference and then some (and then a lot, really). To further confuse the picture, some classical scheduling benchmarks like the ring benchmark actually reward schedulers that are only good at single-core scheduling and penalize schedulers that are good at sharing load among many cores.

Threads in Skynet fan out to create child threads and synchronize on them in more interesting ways than in the crude ring structure, though. While this is still an “overhead only” benchmark, at least it actually measures not only the ability to block and unblock a thread, but also to make good use of all available processor cores and that’s why we decided to give it a try by translating the Go implementation into Java + Quasar. This is the original Go code, taken from here:

package main import "fmt" import "time" func skynet(c chan int, num int, size int, div int) {  if size == 1 {  c <- num  return  }  rc := make(chan int)  var sum int  for i := 0; i < div; i++ {  subNum := num + i*(size/div)  go skynet(rc, subNum, size/div, div)  }  for i := 0; i < div; i++ {  sum += <-rc  }  c <- sum } func main() {  c := make(chan int)  start := time.Now()  go skynet(c, 0, 1000000, 10)  result := <-c  took := time.Since(start)  fmt.Printf("Result: %d in %d ms.\n", result, took.Nanoseconds()/1e6) } 

And this is the Java code (using Quasar), translated from Go pretty much line by line (taken from here):

import co.paralleluniverse.fibers.*; import co.paralleluniverse.strands.channels.Channel; import static co.paralleluniverse.strands.channels.Channels.*; public class Skynet {  static void skynet(Channel<Long> c, long num, int size, int div) throws SuspendExecution, InterruptedException {  if (size == 1) {  c.send(num);  return;  }  Channel<Long> rc = newChannel(BUFFER);  long sum = 0L;  for (int i = 0; i < div; i++) {  long subNum = num + i * (size / div);  new Fiber(() -> skynet(rc, subNum, size / div, div)).start();  }  for (int i = 0; i < div; i++)  sum += rc.receive();  c.send(sum);  }  public static void main(String[] args) throws Exception {  for (int i = 0 ; i < RUNS ; i++) {  long start = System.nanoTime();  Channel<Long> c = newChannel(BUFFER);  new Fiber(() -> skynet(c, 0, 1_000_000, 10)).start();  long result = c.receive();  long elapsed = (System.nanoTime() - start) / 1_000_000;  System.out.println((i+1) + ": " + result + " (" + elapsed + " ms)");  }  }  static final int RUNS = 4;  static final int BUFFER = 0; // >= 0 (fully sync), >= BRANCH (fully async) ; < 0 means unlimited } 

The first thing to notice is how similar the Java code is to the Go code. Quasar basically imports the entire Go and Erlang programming models into Java, including channel selection from Go, as well as actor supervision, behaviors and hot code reloading from Erlang.

The initial benchmark results were less than stellar (we also uncovered a hidden bug in the process), but after profiling and making some straightforward improvements we got these average figures on my MacBook laptop (using go1.6.2, java 1.8.0_40 and after dropping the first couple of Java runs, required for JVM warmup):

Go, unbuffered channels:     350 ms
Go, buffered channels:       310 ms

Quasar, unbuffered channels: 900 ms
Quasar, buffered channels:   360 ms

There’s apparently a big difference depending on the kind of Quasar channel used: unbuffered channels introduce many more synchronization events (because every send must wait for a receive) and perform significantly worse than unbuffered channels, whereas in Go the difference is very small. A careful profiling uncovered that in the unbuffered channel case, the bulk of the overhead is indeed spent in the synchronization code, while in the buffered case the overhead was indeed mostly the internal implementation of continuations employed by Quasar. We’ve found more room for improvement in the channel synchronization code and we’re confident that we can get even better results, although it’s not critically important. In real use cases, the current level of overhead introduced by Quasar is low enough that most workloads – even minor – would drown it out completely.

One more thing: the above Java code is not normally how you’d write this program. Every Quasar fiber can return a result and calling Fiber.get() blocks and waits for the fiber to return it (in fact, the Fiber class implements j.u.c.Future). The last code sample, which you’ll find below, is you’d idiomatically write Skynet in Java with Quasar.

On my machine that code runs in ~300 ms – same as or ahead of Go’s result – with a similar number of synchronization events as the buffered channel case, as the fibers aren’t contending on the same channel to write their results.

This overhead-only microbenchmark, as expected, gives the advantage to the platform that handles the overhead natively but we were surprised by how slight the advantage is, especially as there’s room for improvement in Quasar’s channel implementation. I think this is yet another testament to the versatility of the JVM as a general-purpose, polyglot, high-performance platform.

Here’s the more idiomatic implementation of the Skynet benchmark in Quasar (code taken from here):

import co.paralleluniverse.fibers.*; import java.util.concurrent.ExecutionException; public class Skynet {  static long skynet(long num, int size, int div) throws SuspendExecution, InterruptedException {  try {  if (size == 1)  return num;  Fiber<Long>[] children = new Fiber[div];  long sum = 0L;  for (int i = 0; i < div; i++) {  long subNum = num + i * (size / div);  children[i] = new Fiber<>(() -> skynet(subNum, size / div, div)).start();  }  for (Fiber<Long> c : children)  sum += c.get();  return sum;  } catch (ExecutionException e) {  throw (RuntimeException) e.getCause();  }  }  public static void main(String[] args) throws Exception {  for (int i = 0; i < RUNS; i++) {  long start = System.nanoTime();  long result = new Fiber<>(() -> skynet(0, 1_000_000, 10)).start().get();  long elapsed = (System.nanoTime() - start) / 1_000_000;  System.out.println((i + 1) + ": " + result + " (" + elapsed + " ms)");  }  }  static final int RUNS = 4; } 

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

quasar ,static ,public ,skynet ,final ,buffer

Published at DZone with permission of Ron Pressler, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}