Over a million developers have joined DZone.

Challenge: The Race Condition in the TCP Stack

Ayende Rahien wants to know if you can spot the bug in the code. Think you're up for the challenge?

· Performance Zone

Discover 50 of the latest mobile performance statistics with the Ultimate Guide to Digital Experience Monitoring, brought to you in partnership with Catchpoint.

Occasionally, one of our tests would hang. Everything seems hunky dory, but it would just freeze and not complete. This is new code, and thus suspect, but exhaustive review of it looked fine. It took over two days of effort to narrow it down, but eventually we managed to point the finger directly at this line of code:

image

On certain cases, this line would simply not read anything on the server. Even though the client most definitely sent the data. Now, given that you are using TCP, dropped packets, etc, might be expected. But we are actually testing on the loopback device, which I expect to be reliable.

We spent a lot of time investigating this, ending up with very high degree of certainty that the problem was in the TCP stack somewhere. Somehow, on the loopback device, we are losing packets. Not always, and not consistently, but we were absolutely losing packets, which led the server to wait indefinitely for the client to send the message it already did.

Now, I’m as arrogant as the next developer, but even I don’t think that I found that big a bug in TCP. I’m pretty sure if it was that broken, I would have known about it. Besides, TCP is supposed to retransmit lost packets, so even if there were lost packets on the loopback device, we should have recovery from that.

Trying to figure out what is going on there sucked. It is hard to watch packets on the loopback device in WireShark, and tracing just told me that a message is sent from the client to the server, but the server never got it.

But we preserve, and we ended up with a small reproduction of the issue. Here is the code and my comments are below:

class Program
{
    private static TcpListener _listener;
    private static readonly byte[] ShortMessageBuffer = Encoding.UTF8.GetBytes("Hello\r\n");
    private static readonly byte[] LongMessageBuffer = Encoding.UTF8.GetBytes("Be my friend, please?\r\n");
    static readonly byte[] HeartbeatBuffer = Encoding.UTF8.GetBytes("\r\n");

    static void Main()
    {
        _listener = new TcpListener(IPAddress.Loopback, 9090);
        _listener.Start();
        for (int i = 0; i < 4; i++)
        {
            ListenToNewTcpConnection();
        }

        for (int i = 0; i < 1000; i++)
        {
            Console.Write("\r" + i + "    ");
            RunClient().Wait();
        }
    }

    private static async Task RunClient()
    {
        using (var tcpClient = new TcpClient())
        {
            await tcpClient.ConnectAsync("127.0.0.1", 9090);
            using (var networkStream = tcpClient.GetStream())
            {
                var streamReader = new StreamReader(networkStream);

                await networkStream.WriteAsync(ShortMessageBuffer, 0, ShortMessageBuffer.Length);
                await networkStream.WriteAsync(LongMessageBuffer, 0, LongMessageBuffer.Length);

                var reply = await streamReader.ReadLineAsync();
                if (reply != "ixqZU8RhEpaoJ6v4xHgE1w==")
                    throw new InvalidDataException("Bad server?");

                await networkStream.FlushAsync();

                do
                {
                    reply = await streamReader.ReadLineAsync();
                } while (string.IsNullOrEmpty(reply));
                Console.WriteLine(reply);
            }
        }
    }

    private static void ListenToNewTcpConnection()
    {
        Task.Run(async () =>
        {
            var tcpClient = await _listener.AcceptTcpClientAsync();
            ListenToNewTcpConnection();
            using(var networkStream = tcpClient.GetStream())
            {
                var streamReader = new StreamReader(networkStream);
                var greet = await streamReader.ReadLineAsync();
                var computeHash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(greet));
                var replyBuffer = Encoding.UTF8.GetBytes(Convert.ToBase64String(computeHash) + "\r\n");
                await networkStream.WriteAsync(replyBuffer, 0, replyBuffer.Length);
                // now the connection is valid, we can start talking to each other
                await ProcessConnection(networkStream);
            }
        });
    }


    private static async Task ProcessConnection(NetworkStream stream)
    {
        using (var streamReader = new StreamReader(stream))
        using (var streamWriter = new StreamWriter(stream))
        {
            var readNextLine = streamReader.ReadLineAsync();
            while (true)
            {
                var resultingTask = await Task.WhenAny(readNextLine, Task.Delay(5000));
                if (resultingTask != readNextLine)
                {
                    // keep alive for the tcp connection
                    await stream.WriteAsync(HeartbeatBuffer, 0, HeartbeatBuffer.Length);
                    continue;
                }

                var msg = await readNextLine;
                readNextLine = streamReader.ReadLineAsync();
                msg = new string(msg.Reverse().ToArray());
                await streamWriter.WriteLineAsync(msg);
                await streamWriter.FlushAsync();
            }
        }
    }
}

This code is pretty simple. It starts a TCP server, and listen to it, and then it read and write to the client. Nothing much here, I think you’ll agree.

If you run it, however, it will mostly work, except that sometimes (anywhere between 10 runs and 500 runs on my machine), it will just hang. I’ll save you some time and let you know that there are no dropped packets, TCP is working properly in this case. But the code just doesn’t. What is frustrating is that it is mostly working, it takes a lot of work to actually get it to fail.

Can you spot the bug? I’ll continue discussion of this in my next post.

Is your APM strategy broken? This ebook explores the latest in Gartner research to help you learn how to close the end-user experience gap in APM, brought to you in partnership with Catchpoint.

Topics:
tcp ,server ,stack ,tests ,timing ,device ,packets ,loopback

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}