DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report

Locks & Condition Variables - Latency Impact

Martin Thompson user avatar by
Martin Thompson
·
Nov. 07, 11 · Interview
Like (0)
Save
Tweet
Share
7.31K Views

Join the DZone community and get the full member experience.

Join For Free

In a previous article on Inter-Thread Latency I showed how it is possible to signal a state change between 2 threads with less than 50ns of latency.  To many developers, writing concurrent code using locks is a scary experience.  Writing concurrent code using lock-free algorithms, i.e. algorithms that rely on the use of memory barriers and an intimate understanding of the underlying memory models, can be totally terrifying.  To me lock-free / non-blocking algorithms are like playing with explosives or corrosive chemicals, if you do not understand what you are doing, or show the ultimate respect, then very bad things can, and most likely will, happen!

In this article, I'd like to illustrate the impact of using locks and the resulting latency they can impose on your designs.  I want to use a very similar algorithm to that used in my previous inter-thread latency article to illustrate the ping-pong effect of handing control back and forth between 2 threads.  In this case, rather than using a couple of volatile variables, I will employ a pair of condition variables to signal a state change so control can be passed back and forth.

The Code

import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

import static java.lang.System.out;

public final class LockedSignallingLatency
{
    private static final int ITERATIONS = 10 * 1000 * 1000;

    private static final Lock lock = new ReentrantLock();
    private static final Condition sendCondition = lock.newCondition();
    private static final Condition echoCondition = lock.newCondition();

    private static long sendValue = -1L;
    private static long echoValue = -1L;

    public static void main(final String[] args)
        throws Exception
    {
        final Thread sendThread = new Thread(new SendRunner());
        final Thread echoThread = new Thread(new EchoRunner());

        final long start = System.nanoTime();

        echoThread.start();
        sendThread.start();

        sendThread.join();
        echoThread.join();

        final long duration = System.nanoTime() - start;

        out.printf("duration %,d (ns)\n", duration);
        out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L));
        out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration);
    }

    public static final class SendRunner
        implements Runnable
    {
        @Override
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    sendValue = i;
                    sendCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    while (echoValue != i)
                    {
                        echoCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

            }
        }
    }

    public static final class EchoRunner
        implements Runnable
    {
        @Override
        public void run()
        {
            for (long i = 0; i < ITERATIONS; i++)
            {
                lock.lock();
                try
                {
                    while (sendValue != i)
                    {
                        sendCondition.await();
                    }
                }
                catch (final InterruptedException ex)
                {
                    break;
                }
                finally
                {
                    lock.unlock();
                }

                lock.lock();
                try
                {
                    echoValue = i;
                    echoCondition.signal();
                }
                finally
                {
                    lock.unlock();
                }
            }
        }
    }
}



Results

Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz

$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency
duration 41,649,616,343 (ns)
2,082 ns/op
480,196 ops/s

$ java LockedSignallingLatency
duration 73,789,456,491 (ns)
3,689 ns/op
271,041 ops/s




Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz

$ taskset -c 2,4 java LockedSignallingLatency
duration 47,209,549,484 (ns)
2,360 ns/op
423,643 ops/s

$ java LockedSignallingLatency
duration 336,168,489,093 (ns)
16,808 ns/op
59,493 ops/s



Observations

The above is a typical set of results I've seen in the middle of the range from multiple runs.  There are a couple of interesting observations I'd like to expand on. 

Firstly, the one-way latency to signal a change is pretty much the same as what is considered current state of the art for network hops between nodes via a switch.  It is possible to get ~1µs latency with InfiniBand and less than 5µs with 10GigE and user-space IP stacks. This is 3 orders of magnitude greater latency than what I illustrated in the previous article using just memory barriers to signal between threads.  This cost comes about because the kernel needs to get involved to arbitrate between the threads for the lock, and then manage the scheduling for the threads to awaken when the condition is signalled.

Secondly, the impact is clear when letting the OS choose what CPUs the threads get scheduled on rather than pinning them manually.  I've observed this same issue across many use cases whereby Linux, in default configuration for its scheduler, will greatly impact the performance of a low-latency system by scheduling threads on different cores resulting in cache pollution.   Windows by default seems to make a better job of this.

I recently had an interesting discussion with Cliff Click about using condition variables and their cost.  He pointed out a problem he was seeing.  If you look at the case where a sleeping thread gets signalled within the lock, it goes to run and then discovers it cannot get the lock because the signalling thread already has the lock, so it gets put back to sleep until the signalling thread releases the lock, thus causing more work than necessary.  Modern schedulers would benefit from being more aware of communication mechanisms between threads to have more efficient location and rescheduling logic.  As we go more concurrent and parallel our schedulers need to become more aware of IPC mechanisms.

Conclusion

When designing a low-latency system it is crucial to avoid the use of locks and condition variables for the main transaction flows.  Non-blocking or lock-free algorithms are key to achieving ultra-low latency but can be very difficult to prove correct.   I would not recommend designing lock-free algorithms for business logic but they can be very effectively employed for low-level infrastructure components.  The business logic is best run on single threads following the Single Writer Principle from my previous article. 

From http://mechanical-sympathy.blogspot.com/2011/11/locks-condition-variables-latency.html

Lock (computer science)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 4 Best dApp Frameworks for First-Time Ethereum Developers
  • Documentation 101: How to Properly Document Your Cloud Infrastructure Project
  • Introduction Garbage Collection Java
  • Scaling Your Testing Efforts With Cloud-Based Testing Tools

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: