Over a million developers have joined DZone.

Fast Transaction Log: Linux

Ayende Rahien shares a tutorial for performance testing on a benchmark on Linux, specifically with the speed to write transactions to the journal file.

· Performance Zone

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

We were doing some performance testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine (i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios:

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

#define _GNU_SOURCE

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <malloc.h>
#include <sys/uio.h>
#include <sys/file.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/resource.h>



#define FILE_SIZE  (1024 * 1024 * 1024L)
#define CHUNK_SIZE (16 * 1024)


int simulate_journal_writes(int fd, char* desc, int use_fsync, int use_fdatasync)
{
        int rc;
        int pos;
        struct timeval start, end;
long mtime, seconds, useconds;    

void* buff = memalign(CHUNK_SIZE, CHUNK_SIZE);
        int rand_fd = open("/dev/urandom", O_RDONLY);
        if(rand_fd == -1)
        {
                printf("\nFailed to open random %i", errno);
                exit(2);
        }
        rc = read(rand_fd, buff, CHUNK_SIZE);
        if(rc != CHUNK_SIZE)
        {
                printf("\nFailed to read random %i", errno);
                exit(2);  
        }
        close(rand_fd);


gettimeofday(&start, NULL);

for (pos=0; pos<FILE_SIZE; pos+=CHUNK_SIZE)
        {
                int write_rc = write(fd, buff, CHUNK_SIZE);
                if ( write_rc == -1 )
                {
                        printf("\ncan't write %i", errno);
                        exit(2);
                }
                if(use_fsync)
                {
                        write_rc = fsync(fd);
                        if ( write_rc == -1 )
                        {
                                printf("\ncan't fsync %i", errno);
                                exit(2);
                        }
                }
                if(use_fdatasync)
                {
                        write_rc = fdatasync(fd);
                        if ( write_rc == -1 )
                        {
                                printf("\ncan't fdatasync %i", errno);
                                exit(2);
                        }
                }
        }

gettimeofday(&end, NULL);

    seconds  = end.tv_sec  - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;

    mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;

    printf("Elapsed time %s: %ld milliseconds\n", desc, mtime);

        free(buff);
}

int main()
{
        int fd;
        int rc;
        char* filename = "journal.log";

        unlink(filename);

        fd = open(filename,O_WRONLY | O_CREAT,S_IWUSR | S_IRUSR);
        if(fd == -1)
        {
                printf("Failed to open file %i", errno);
                return 1;
        }

        rc = fallocate(fd, 0, 0, FILE_SIZE);
        if(rc == -1)
        {
                printf("Failed to size file %i", errno);
                return 1;  
        }
        rc = close(fd);
        if(rc == -1)
        {
                printf("\nFailed to close file %i", errno);
                return 1;   
        }

        printf("\nStarting\n");

fd = open(filename,O_WRONLY | O_CREAT , S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " buffered ", 0, 0 );
        close(fd);

        fd = open(filename,O_WRONLY | O_CREAT , S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " buffered + fdatasync ", 0, 1);
        close(fd);

        fd = open(filename,O_WRONLY | O_CREAT , S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " buffered + fsync ", 1, 0);
        close(fd);


        fd = open(filename,O_WRONLY | O_CREAT | O_DSYNC, S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " O_DSYNC ",0,0);
        close(fd);


        fd = open(filename,O_WRONLY | O_CREAT | O_DIRECT, S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " O_DIRECT ", 0, 0);
        close(fd);

        fd = open(filename,O_WRONLY | O_CREAT | O_DIRECT | O_DSYNC, S_IWUSR | S_IRUSR);
        simulate_journal_writes(fd, " O_DIRECT | O_DSYNC ", 0, 0);
        close(fd);

        return 0;
}

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

MethodTime (ms)Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

MethodTime (ms)Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10 percent difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15 percent can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

Learn tips and best practices for optimizing your capacity management strategy with the Market Guide for Capacity Management, brought to you in partnership with BMC.

Topics:
performance ,linux ,c programming language

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}