Refactoring C: Implementing Parsing
In C, implementing the parsing of messages is actually a very complex operation. Let's get started.
Join the DZone community and get the full member experience.
Join For FreeSo far in this series, I've done a whole lot of work around building the basic infrastructure of just building a trivial echo server with SSL. But the protocol I have in mind is a lot more complex. Let’s get started with actually implementing the parsing of messages.
To start with, we need to implement parsing of lines. In C, this is actually a decidedly non-trivial operation, because you need to read the data from the network into someplace and parse it. This area is rife with errors, so that is going to be fun.
Here is a simple raw message:
GET employees/1-A employees/2-B
Timeout: 30
Sequence: 293
Include: ReportsTo
The structure goes:
CMD args1 argN\r\n
And then header lines with:
Name: value\r\n
The final end of the message is \r\n\r\n.
To make things simple for myself, I’m going to define the maximum size of a message as 8KB (this is the common size in HTTP as well). Here is how I decided to represent it in memory:
The key here is that I want to minimize the amount of work and complexity that I need to do. That is why the entire message is limited to 8KB. I’m also simplifying how I’m going to be handling things from an API perspective. All the strings are actually C strings, null terminated, and I’m using the argv, argc convention for naming, just like in the main function.
This means that I can simply read from the network until I find a “\r\n\r\n” in there. Here is how I do this:
struct cmd* read_message(struct connection * c) {
int rc, to_read, to_scan = 0;
do
{
// first, need to check if we already
// read the value from the network
if (c->used_buffer > 0) {
char* final = strnstr(c->buffer + to_scan, "\r\n\r\n", c->used_buffer);
if (final != NULL) {
struct cmd* cmd = parse_command(c, c->buffer, final - c->buffer + 2/*include one \r\n*/);
// now move the rest of the buffer that doesn't belong to this command
// adding 4 for the length of the msg separator (\r\n\r\n)
c->used_buffer -= (final + 4) - c->buffer;
memmove(c->buffer, final + 4, c->used_buffer);
return cmd;
}
to_scan = max(c->used_buffer - 3, 0);
}
to_read = MSG_SIZE - c->used_buffer;
if (to_read == 0) {
push_error(EINVAL, "Message size is too large, after 8KB, "
"couldn't find \r\n separator, aborting connection.");
return NULL;
}
rc = connection_read(c, c->buffer + c->used_buffer, to_read);
if (rc == 0)
return NULL; // broken connection, probably
c->used_buffer += rc;
} while (1);
}
There is a bit of code here, but the gist of it is pretty simple. The problem is that I need to handle partial state. That is, a single message may come in two separate packets, or multiple messages may come in a single packet. I don’t have a way to control that, so I need to be careful about tracking past state. The connection has a buffer that is used to hold the state in memory, whose size is large enough to hold the largest possible message. I’m reading from the network to a buffer and then scanning to find the message separator.
If I couldn’t find it, I’m recording the last location where it could be starting, and then issuing another network read and will try searching for \r\n\r\n again. Once that is found, the code will call to the parse_commnad() method that operates over the entire command in memory (which is much easier). With that done, my message parsing is actually quite easy, from a conceptual point of view, although I’ll admit that C make it a bit long.
static struct cmd* parse_command(struct connection* c, char* buffer, size_t len) {
char* line_ctx = NULL, *ws_ctx = NULL, *line, *arg;
struct cmd* cmd = NULL;
char* copy = malloc(len+1);
if (copy == NULL) {
push_error(ENOMEM, "Unable to allocate command memroy");
goto error_cleanup;
}
// now we need to have our own private copy of this
memcpy(copy, buffer, len);
copy[len] = 0; // ensure null terminator!
cmd = calloc(1, sizeof(struct cmd));
if (cmd == NULL) {
push_error(ENOMEM, "Unable to allocate command memroy");
goto error_cleanup;
}
line = strtok_s(copy, "\r\n", &line_ctx);
if (line == NULL) {
push_error(EINVAL, "Unable to find \r\n in the provided buffer");
goto error_cleanup;
}
arg = strtok_s(line, " ", &ws_ctx);
if (arg == NULL) {
push_error(EINVAL, "Invalid message command line: %s", line);
goto error_cleanup;
}
do
{
cmd->argc++;
cmd->argv = realloc(cmd->argv, sizeof(char*) * cmd->argc);
cmd->argv[cmd->argc - 1] = arg;
arg = strtok_s(NULL, " ", &ws_ctx);
} while (arg != NULL);
while (1)
{
line = strtok_s(NULL, "\r\n", &line_ctx);
if (line == NULL)
break;
arg = strtok_s(line, ":", &ws_ctx);
if (arg == NULL) {
push_error(EINVAL, "Header line does not contain ':' separator: %s", line);
goto error_cleanup;
}
while (*ws_ctx != 0 && *ws_ctx == ' ')
ws_ctx++; // skip initial space
cmd->headers_count++;
cmd->headers = realloc(cmd->headers, sizeof(struct header) *cmd->headers_count);
cmd->headers[cmd->headers_count - 1].key = arg;
cmd->headers[cmd->headers_count - 1].value = ws_ctx;
}
return cmd;
error_cleanup:
if (copy != NULL)
free(copy);
if (cmd != NULL) {
cmd_drop(cmd);
}
return NULL;
}
I’m copying the memory from the network buffer to my own location. This is important because the read_message() function will overwrite it in a bit, and it also allows me to modify the memory more easily, which is required for using strtok(). This basically allows me to tokenize the message into its component parts — first on a line by line basis, with splitting on space for the first line and then treating this as headers lines.
I added the ability to reply to a command, which means that we are pretty much almost done. You can see the current state of the code here.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments