Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Go and Quasar: A Comparison of Style and Performance

DZone's Guide to

Go and Quasar: A Comparison of Style and Performance

Go provides a lightweight implementation of threads, known as fibers. Under JVM, Quasar can provide the same capability. We tested against the Skynet concurrency microbenchmark.

· Performance Zone
Free Resource

A user recently made us aware of the Skynet benchmark, a microbenchmark for “extreme” multithreading (1M threads). We are generally wary of such microbenchmarks because they are often tailored to measure a specific strength of a particular platform, without taking into account how relevant that strength is for real applications. For example, a platform with a 1000x faster implementation of sqrt would be hard pressed to yield even a 0.01% improvement in performance when running real applications. With threads the situation is a bit different: when many threads are active (say, over 10K) processing transactions in short bursts, the kernel thread scheduling overhead might become onerous and your application may then spend a significant portion of its time waiting for the kernel to schedule your code. Lightweight thread (AKA fibers) implementations, like those provided by Go, Erlang, and, on the JVM, Quasar (and Kilim), can reduce this overhead by two orders of magnitude. This may be the difference between your server application being able to handle 500 or 5000 requests per second (some benchmarks can be found here and here).

However, once the threading overhead is reasonably low – say, less than 1% of the total time – differences in a particular implementation matter less and less: if there’s no overhead at all, the performance improvement will be only 1%. Because the JVM does not yet have built-in fibers, Quasar is required to implement them in a way that adds more overhead than platforms with native implementations. This is why in a microbenchmark that tests scheduling overhead alone, a generally slow runtime like Erlang’s BEAM may outperform a very fast runtime like HotSpot, even though once there’s any actual workload, the JVM quickly makes up the difference and then some (and then a lot, really). To further confuse the picture, some classical scheduling benchmarks like the ring benchmark actually reward schedulers that are only good at single-core scheduling and penalize schedulers that are good at sharing load among many cores.

Threads in Skynet fan out to create child threads and synchronize on them in more interesting ways than in the crude ring structure, though. While this is still an “overhead only” benchmark, at least it actually measures not only the ability to block and unblock a thread, but also to make good use of all available processor cores and that’s why we decided to give it a try by translating the Go implementation into Java + Quasar. This is the original Go code, taken from here:

package main

import "fmt"
import "time"

func skynet(c chan int, num int, size int, div int) {
if size == 1 {
c <- num
return
}

rc := make(chan int)
var sum int
for i := 0; i < div; i++ {
subNum := num + i*(size/div)
go skynet(rc, subNum, size/div, div)
}
for i := 0; i < div; i++ {
sum += <-rc
}
c <- sum
}

func main() {
c := make(chan int)
start := time.Now()
go skynet(c, 0, 1000000, 10)
result := <-c
took := time.Since(start)
fmt.Printf("Result: %d in %d ms.\n", result, took.Nanoseconds()/1e6)
}

And this is the Java code (using Quasar), translated from Go pretty much line by line (taken from here):

import co.paralleluniverse.fibers.*;
import co.paralleluniverse.strands.channels.Channel;
import static co.paralleluniverse.strands.channels.Channels.*;
import java.time.*;

public class Skynet {
    static void skynet(Channel<Long> c, long num, int size, int div) throws SuspendExecution, InterruptedException {
        if (size == 1) {
            c.send(num);
            return;
        }

        Channel<Long> rc = newChannel(BUFFER);
        long sum = 0;
        for (int i = 0; i < div; i++) {
            long subNum = num + i*(size/div);
            new Fiber(() -> skynet(rc, subNum, size/div, div)).start();
        }
        for (int i = 0; i < div; i++)
            sum += rc.receive();
        c.send(sum);
    }

    public static void main(String[] args) throws Exception {
        for (int i=0; i < RUNS; i++) {
            Instant start = Instant.now();

            Channel<Long> c = newChannel(BUFFER);
            new Fiber(() -> skynet(c, 0, 1_000_000, 10)).start();
            long result = c.receive();

            Duration took = Duration.between(start, Instant.now());
            System.out.printf("Result: %d in %d ms.\n", result, took.toMillis());
        }
    }

    static final int RUNS = 4;
    static final int BUFFER = 0; // = 0 unbufferd, > 0 buffered ; < 0 unlimited
}

The first thing to notice is how similar the Java code is to the Go code. Quasar basically imports the entire Go and Erlang programming models into Java, including channel selection from Go, as well as actor supervision, behaviors and hot code reloading from Erlang.

The initial benchmark results were less than stellar (we also uncovered a hidden bug in the process), but after profiling and making some straightforward improvements we got these average figures on my MacBook laptop (using go1.6.2, java 1.8.0_40 and after dropping the first couple of Java runs, required for JVM warmup):

Go, unbuffered channels:     350 ms
Go, buffered channels:       310 ms

Quasar, unbuffered channels: 900 ms
Quasar, buffered channels:   360 ms

There’s apparently a big difference depending on the kind of Quasar channel used: unbuffered channels introduce many more synchronization events (because every send must wait for a receive) and perform significantly worse than unbuffered channels, whereas in Go the difference is very small. A careful profiling uncovered that in the unbuffered channel case, the bulk of the overhead is indeed spent in the synchronization code, while in the buffered case the overhead was indeed mostly the internal implementation of continuations employed by Quasar. We’ve found more room for improvement in the channel synchronization code and we’re confident that we can get even better results, although it’s not critically important. In real use cases, the current level of overhead introduced by Quasar is low enough that most workloads – even minor – would drown it out completely.

One more thing: the above Java code is not normally how you’d write this program. Every Quasar fiber can return a result, and calling Fiber.get() blocks and waits for the fiber to return it (in fact, the Fiber class implements j.u.c.Future). The last code sample, which you’ll find below, is how you’d idiomatically write Skynet in Java with Quasar.

On my machine that code runs in ~300 ms – same as or ahead of Go’s result – with a similar number of synchronization events as the buffered channel case, as the fibers aren’t contending on the same channel to write their results.

This overhead-only microbenchmark, as expected, gives the advantage to the platform that handles the overhead natively but we were surprised by how slight the advantage is, especially as there’s room for improvement in Quasar’s channel implementation. I think this is yet another testament to the versatility of the JVM as a general-purpose, polyglot, high-performance platform.

Here’s the more idiomatic implementation of the Skynet benchmark in Quasar (code taken from here):

import co.paralleluniverse.fibers.*;
import java.util.concurrent.*;
import java.time.*;

public class Skynet {
    static long skynet(long num, int size, int div) throws SuspendExecution, InterruptedException {
        try {
            if (size == 1)
                return num;

            Fiber<Long>[] children = new Fiber[div];
            long sum = 0;
            for (int i = 0; i < div; i++) {
                long subNum = num + i*(size/div);
                children[i] = new Fiber<>(() -> skynet(subNum, size/div, div)).start();
            }
            for (Fiber<Long> c : children)
                sum += c.get();
            return sum;
        } catch (ExecutionException e) {
            throw (RuntimeException) e.getCause();
        }
    }

    public static void main(String[] args) throws Exception {
        for (int i=0; i < RUNS; i++) {
            Instant start = Instant.now();

            long result = new Fiber<>(() -> skynet(0, 1_000_000, 10)).start().get();

            Duration took = Duration.between(start, Instant.now());
            System.out.printf("Result: %d in %d ms.\n", result, took.toMillis());
        }
    }

    static final int RUNS = 4;
}

Appendix (2016-05-14)

Unfortunately, like many benchmarks, Skynet is subject to interpretation as to what it is that it really tries to measure. My interpretation is that it is a benchmark of the ability to efficiently create and synchronize (i.e. schedule) threads. Some entries have interpreted it differently, and so get much better results. If the benchmark is intended to measure the performance of carrying out a particular parallel reduction, then a much faster Java implementation – not employing fibers or Quasar at all, but fork/join directly – is straightforward:

import java.util.concurrent.*;
import java.time.*;
import java.util.stream.*;

public class SkynetFJ {
    static class Node extends RecursiveAction {
        final int num, size, div;
        long sum;

        Node(int num, int size, int div) {
            this.num = num;
            this.size = size;
            this.div = div;
        }

        public void compute() {
            if (size == 1)
                sum = num;
            else {
                Node[] subs = new Node[div];
                for (int i = 0; i < div; i++)
                    subs[i] = new Node(num + i * (size / div), size / div, div);
                invokeAll(subs);
                for (int i = 0; i < div; i++)
                    sum += subs[i].sum;
            }
        }
    }

    public static void main(String[] args) throws Exception {
        for (int i = 0; i < RUNS; i++) {
            Instant start = Instant.now();

            Node root = new Node(0, 1_000_000, 10);
            root.invoke();
            long result = root.sum;

            Duration took = Duration.between(start, Instant.now());
            System.out.printf("Result: %d in %d ms.\n", result, took.toMillis());
        }
    }

     static final int RUNS = 20;
}

This program runs about 10x faster than either Go or Quasar implementations on my machine (but, of course, I don’t think this is what the benchmark aims to measure). In fact, if the benchmark intends to measure the performance of just summing all numbers between 1 and 1 million in parallel, without regard to a specific fanout (of 10), we can do a lot faster than even the fork/join example above:

long result = LongStream.range(0, 1_000_000).parallel().sum();
  1. When benchmarking JVM code you should really use JMH (and we usually do), but this benchmark wasn’t meant to be scientific, and we’ve had reasons to believe that the benchmark is reasonable even without JMH, so we allowed ourselves some sloppiness.
Topics:
skynet ,go ,multi-threaded applications ,quasar

Published at DZone with permission of Fabio Tudone, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}