Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

LockSupport.parkNanos() Under the Hood and the Curious Case of Parking (Part I)

DZone's Guide to

LockSupport.parkNanos() Under the Hood and the Curious Case of Parking (Part I)

Let's take a look under the hood.

· Java Zone ·
Free Resource

Get the Edge with a Professional Java IDE. 30-day free trial.

When a colleague of mine was running some experiments, he noticed LockSupport.parkNanos() would either return almost immediately or in roughly 50 microseconds steps. In other words, calling LockSupport.parkNanos(10000) would not return after 10 microseconds but roughly after 50 μs. LockSupport.parkNanos(55000) would not return after 55 μs but roughly after 100 μs, etc. The 50 μs step was present way too consistently to be a coincidence. I was curious about what was causing it, and because I love exploring how stuff works under the hood, I decided to have a closer look.

Reproducer in Java

The first step was easy: write a reproducer. I re-used code from my older and somewhat-related experiment and just added a new runner:

@Override
    public void run() {
        long startNanos = System.nanoTime();
        for (long i = 0; i < iterationCount; i++) {
            LockSupport.parkNanos(10_000);
        }
        long durationNanos = System.nanoTime() - startNanos;
        long durationMillis = NANOSECONDS.toMillis(durationNanos);
        System.out.println(iterationCount + " iterations in " + durationMillis + " ms.");

        long microsPerIteration = NANOSECONDS.toMicros(durationNanos) / iterationCount;
        System.out.println("This means each iteration took " + microsPerIteration + " microseconds");
    }


This was sufficient to reproduce the results my colleague observed in his JMH-based benchmark. On my Linux box, each iteration took just over 60 μs. When I increased the parking time to 55,000 ns, then each iteration took around 110 μs, and with the parking time, 110,000 ns each iteration took 150 μs. It also took 150 μs with parking time set to 105,000 ns. Clearly, the 50 μs increment was there.

The first step of my investigation was easy: just follow the LockSupport.parkNanos() implementation in my JDK. We can see it delegates to Unsafe.park():

public static void parkNanos(long nanos) {
        if (nanos > 0)
            U.park(false, nanos);
    }


Unfortunately, the Unsafe class does not tell much:

public native void park(boolean isAbsolute, long time);


It’s a native method, which means that we have to dive deeper. It turns out the implementation of the park() method that the JVM uses when running on Linux uses the POSIX Threads API to implement thread parking.

Reproducer in C

With this in mind, I tried to create a simple reproducer in C. The idea was to remove the JVM and its magic out of the equation and be as close to the metal as possible. I wrote my last non-trivial C code more than 15 years ago, so this turned out to be quite challenging. I learned many things, for example, I didn’t know glibc had a stub implementation of POSIX Threads API, and it does nothing! So, I was wondering why my pthread_cond_timedwait() returned immediately without waiting for the timeout. Fortunately, another colleague of mine pointed out that I forgot to link the pthreads library, hence my program used the empty implementation from glibc! Eventually, I succeeded; here is a C reproducer:

#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdbool.h>
#include <sys/types.h>
#include <unistd.h>

#define CLOCK_TYPE CLOCK_MONOTONIC

void printTime(struct timespec *ts) {
    printf("%lld.%.9ld\n", (long long)ts->tv_sec, ts->tv_nsec);
}

void addNanos(struct timespec *base, struct timespec *deadline, int deltaNanos) {
    deadline->tv_sec = base->tv_sec;
    deadline->tv_nsec = (base->tv_nsec) + deltaNanos;
    //todo: deal with nanos overflow
}

bool isDeadlineReached(struct timespec* deadline) {
    struct timespec currentTime;
    clock_gettime(CLOCK_TYPE, &currentTime);

    return currentTime.tv_sec >= deadline->tv_sec && currentTime.tv_nsec >= deadline->tv_nsec;
}

int initializeCond(pthread_cond_t *cond) {
    pthread_condattr_t cond_attr;
    if (pthread_condattr_init (&cond_attr)) {
        return -1;
    }
    if (pthread_condattr_setclock (&cond_attr, CLOCK_TYPE)) {
        return -1;
    }
    if (pthread_cond_init (cond, &cond_attr)) {
        return -1;
    }
    return 0;
}

int main() {
    int pid = getpid();
    printf("PID: %i\n", pid);
    sleep(10);
    printf("Sleeping finished\n");

    int deltaNanos = 2000;
    pthread_cond_t cond;
    pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

    if (initializeCond(&cond)) {
        return -1;
    }

    struct timespec start, end, deadline;
    if (pthread_mutex_lock(&lock)) {
        return -1;
    }
    clock_gettime(CLOCK_TYPE, &start);
    addNanos(&start, &deadline, deltaNanos);

    int iteration = 0;
    while (!isDeadlineReached(&deadline)) {
        pthread_cond_timedwait(&cond, &lock, &deadline);
        iteration++;;
    }

    clock_gettime(CLOCK_TYPE, &end);
    printf("Iteration count: %i\n", iteration);
    printTime(&start);
    printTime(&deadline);
    printTime(&end);

    //todo: ns overflow handling
    printf("Elapsed time: %ld\n", (end.tv_nsec- start.tv_nsec));
    return 0;
}


The code does very little: it acquires a mutex, and then, it waits on a condition with a timeout. Given there is no other thread to signal the condition, then the thread should unblock only after the timeout. POSIX API allows spurious wake-ups, but we can ignore them in this experiment. A long story short: even when the code uses a timeout of just 5,000 ns, it sleeps for almost 60 microseconds. This was the same behavior as I observed in Java! This is a strong indication that the JVM/JDK is not causing the 50 μs step and the root cause is deeper.

Going Deeper

While running my C program with strace, I noticed the pthread_cond_timedwait() was implemented as a futex syscall. Again, this is not very surprising:

$ 
strace -e futex ./sleepgranularity
PID: 26955
Sleeping finished
futex(0x7fff800b3ad8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=4084}) = -1 ETIMEDOUT (Connection timed out)


It’s interesting that the tv_nsec parameter is less than my configured timeout, which happens to be 5,000 ns. I assume it’s because pthread_cond_timedwait() uses absolute time as a timeout while futexes uses relative time and perhaps POSIX Threads impl also does a little bit of busy-spinning before handing over to the kernel. This is just speculation as I didn’t really check it in the source code.

Ok, so now we know LockSupport.park() uses futex syscall under the hood and, very likely, it’s our Linux kernel causing this mysterious 50 μs step. So, once again, we have to go deeper!

Dive Into the Kernel

The futex implementation in the kernel uses hrtimer for timeouts. Hrtimer stands for “High-Resolution Timer” and it is API inside kernel for … well, timers. Timeout enforcement is a textbook use-case for timers. You can read more about Linux timers in the kernel documentation. At this point, I wondered: is there any way to configure hrtimer granularity? It turns out there is! Since version 4.6 Linux allows us to configure “process slack value” via the /proc/ filesystem! The slack value is used to group timers, which are “close to each other” and minimize the number of wake-ups, and guess what? The default value happens to be 50 μs — the same as the step we observed!

We can start with the experiment in C as it has less moving parts than the Java experiment. If you read carefully the source code above, then you noticed it first prints its own PID, and then, it sleeps 10 seconds before calling the pthread_cond_timedwait(). The sleeping is there for a reason. It is to give us a chance to modify to slack value via the /proc/ filesystem. It’s actually dead simple. Let’s try# echo 200000 > /proc/<pid>/timerslack_ns. This increases the default slack time from 50,000 ns to 200,000 ns, and the pthread_cond_timedwait() should return much later than the usual 50-60 μs. Let’s run it!

PID: 32644 Sleeping finished
Iteration count: 1
21090.742608800
21090.742615800
21090.742895275
Elapsed time: 286475

You can see it took well over 200 μs! Let’s try it the other way around, decreasing this value should make pthread_cond_timedwait() return faster.

# echo 1000 > /proc/702/timerslack_ns

$ ./sleepgranularity PID: 702 Sleeping finished
Iteration count: 1
21311.855077826
21311.855079826
21311.855080608
Elapsed time: 2782

It returned in less than 3 μs.

Conclusion #1

We proved it’s possible to decrease the 50 μs granularity step by changing the timerslack_ns value!

Back to Java

Let’s try the same with Java. It turns out to be a bit more complicated, mostly because Java starts multiple threads and I’ve always been confused by the relations between threads and full-blown processes in Linux. Anyway, here is what worked for me:

  1. Start the Java experiment as java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 100000 1000
  2. Find the JVM process with jps. It gives me this output: 1387 testvdso-1.0-SNAPSHOT.jar
  3. Find the native thread id of the main Java thread with jstack 1387|grep main "main" #1 prio=5 os_prio=0 tid=0x00007fd6bc012000 nid=0x56c waiting on condition [0x00007fd6c5ada000]
  4. Now, we have the native thread ID = 0x56c. Let’s convert it from hexa: 0x56c = 1388 dec
  5. At this point, we can use the /proc/ filesystem: # echo 1000 > /proc/1388/timerslack_ns

Behold!

100000 iterations in 5563 ms. This means each iteration took 55 microseconds
100000 iterations in 5530 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 3737 ms. This means each iteration took 37 microseconds
100000 iterations in 525 ms. This means each iteration took 5 microseconds
100000 iterations in 527 ms. This means each iteration took 5 microseconds
100000 iterations in 521 ms. This means each iteration took 5 microseconds


Can you spot the line with the change? The latency went down from 55 microseconds to just 5 micros! Yay!

Conclusion #2

We can shrink the minimum usable thread parking time in Java with a bit of OS tweaking!

Caveats

The research above makes many many assumptions: about JDK parkNanos() implementation, POSIX Threads implementation, futexes, hrtimers, etc. Obviously, I assume Linux OS, and it’s quite different on other operating systems.

Summary

Even when using a high-level language such as Java, it’s good to understand what is going on at a lower level. Linux offers a great amount of flexibility and tunables, and it’s possible to use them to reduce minimum parking time.

Would I use this in production? Not really or, at least, definitely not without further investigation of all implications. The first thing I can think of is the increased CPU load when the step is too small. There might be other side-effects I am not aware of. However, it was tremendous fun to explore it, and I learned a lot about quite a few layers of my stack!

Stay tuned for the second part of this series where I investigate the parkNanos() behavior on Windows.

If you like to poke systems and learn about low-level details, then you can follow me on Twitter!

Get the Java IDE that understands code & makes developing enjoyable. Level up your code with IntelliJ IDEA. Download the free trial.

Topics:
java ,c ,tutorial ,reproducer ,parknanos ,linux ,jvm ,jdk ,unsafe

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}