Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Reading UTF-8 Characters From An Infinite Byte Stream

DZone's Guide to

Reading UTF-8 Characters From An Infinite Byte Stream

· Integration Zone
Free Resource

Learn how API management supports better integration in Achieving Enterprise Agility with Microservices and API Management, brought to you in partnership with 3scale

I’ve been playing with the twitter streaming API today. In very simple terms, you make an HTTP request and then sit on the response stream reading objects off it. The stream is a stream of UTF-8 characters and each object is a JSON encoded data structure terminated by \r\n. Simple I thought, I’ll just create a StreamReader and set up a while loop on its Read method. Here’s my first attempt …

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}

Unfortunately it didn’t work. The StreamReader maintains a small internal buffer so I wouldn’t see the \r\n combination that marked the end of a new tweet until the next tweet came along and flushed the buffer.

OK, so let’s just read each byte from the stream and convert them one-by-one into UTF-8 characters. This works fine when your tweets are all in English, but UTF-8 can have multi-byte characters; any Japanese tweets I tried to read failed.

Thanks to ‘Richard’ on Stack Overflow the answer turned out to be the Decoder class. It  buffers the bytes of incomplete UTF-8 characters, allowing you to keep stacking up bytes until they are complete. Here’s revised example that works great with Japanese tweets:

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

Unleash the power of your APIs with future-proof API management - Create your account and start your free trial today, brought to you in partnership with 3scale.

Topics:

Published at DZone with permission of Mike Hadlow, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}