Over a million developers have joined DZone.

Tips to Check and Improve Your Storage IO Performance with MongoDB

· Performance Zone

Discover 50 of the latest mobile performance statistics with the Ultimate Guide to Digital Experience Monitoring, brought to you in partnership with Catchpoint.

I’m speaking at MongoDB World this June. Come hear from me and other MongoDB experts! Use the code 25AntoineGirbal for 25% off tickets at http://world.mongodb.com!

In most application the disk IO will typically end up being your main bottleneck, all other silly bottlenecks being worked out (CPU, number of connections, etc). And whether our competitors like it or not, the write locks are rarely the bottlenecks in a well designed application :)

I’ve recently spent some time on a large application running in Microsoft Azure with Azure drives, and anything done out of RAM was taking FOREVER. Is it MongoDB’s fault? Not really, I was running queries and aggregations against a working set that was larger than the RAM, and the Azure drives only allowed about 100 IO per second (and it doesnt make a difference whether it’s sequential or random IO on virtual drive). So an aggregation over 100 million documents was expected to take 24h, and when you combine sharding (which requires an expensive final stage to reaggregate) it was looking more like 2 days. And if it fails at any point, it must be restarted from the beginning. Needless to say, this was unusable.

MongoDB can’t go past the physical limits of the disk, there is no magic, and this is true for any database. For that kind of heavy workload you want to make sure that your box is not a slug! There is a variety of tools to verify the speed of a disk (e.g. bonnie++) but there is actually one that comes right with the mongo distribution: mongoperf.

Using Mongoperf

To use it, put a simple JSON configuration in a file, and run it as follows from a folder on the disk you wish to test:

$ cat ./mongoperf.conf
{ nThreads:1024, fileSizeMB:1000, mmf:false, r:true, w:true, syncDelay:60 }
$ mongoperf < ./mongoperf.conf
...
87 ops/sec 0 MB/sec
106 ops/sec 0 MB/sec
85 ops/sec 0 MB/sec
92 ops/sec 0 MB/sec

The options are fairly straightforward, you can pick the file size on disk, the number of threads to do operations, and whether operations are read or write. One interesting option is mmf which tell it to use memory-mapped files the same way as MongoDB. If turned on, read and write operations go through RAM (thus can be cached and buffered) and eventually the dirty pages are written to disk every syncdelay seconds. The mmf option gives you better results that closely mimic MongoDB but are difficult to interpret - for the purpose of purely testing the disk itself it is better to leave it off.

Now the result you see above is for an Azure Drive, at about 100 iops. With that kind of performance you can’t really build an application that reads beyond RAM, or does any kind of writing over 100s per second. Still let’s verify in parallel of running mongoperf that the disk is indeed pegged using iostat:

$ iostat -x 2
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 0.13 15.89 0.00 83.86

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 42.50 89.00 340.00 372.00 5.41 1.60 12.49 7.58 99.70
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

As seen above, the disk utilization of drive sdc is indeed 99.70%. For kicks lets test mongoperf on my own laptop which is an older generation macbook with SSD:

...
new thread, total running : 128
12527 ops/sec 48 MB/sec
12220 ops/sec 47 MB/sec
12558 ops/sec 49 MB/sec
...

That’s already much better. A single older SSD can bring it to 12k iops which is pretty impressive. For reference, a single HDD should give you about 250 iops, and a solid RAID5 over 10 HDDs up to 3000 iops. Those precious iops are what we need to build high volume complex applications that want to make use of indexing and aggregation.

Testing in EC2

Let’s use an EC2 i2 instance (new release at this time of writing) which looks promising. The i2.2xlarge gives us 2x 800GB SSDs which we can try to combine. Each disk is rated for 35k iops according to Amazon. We’re using the new Amazon Linux AMI (HVM) which packs a 3.10 Linux kernel. Let’s format a drive with EXT4 and mount it (after properly setting the line in fstab):

$ sudo mkfs.ext4 /dev/sdb
$ sudo vim /etc/fstab
$ grep sdb /etc/fstab
/dev/sdb /media/ephemeral0 auto defaults,noatime,nofail,comment=cloudconfig 0 2
$ sudo mount /media/ephemeral0
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 8123812 1439044 6584520 18% /
devtmpfs 31448496 64 31448432 1% /dev
tmpfs 31329260 0 31329260 0% /dev/shm
/dev/xvdb 769018760 70368 729861452 1% /media/ephemeral0

A few things to note: the mount should use noatime which prevents extra unwanted IO to metadata. Also the readahead should be set low in general (using blockdev) but it is not important here since we’ll be testing with direct IO. Now let’s go ahead with testing the drive:

$ cat mongodb-linux-x86_64-2.6.0-rc2/mongoperf.conf
{
nThreads:1024,
fileSizeMB:1000,
mmf:false,
r:true,
w:true,
syncDelay:60
}
$ cd /media/ephemeral0/
$ sudo ~/mongodb-linux-x86_64-2.6.0-rc2/bin/mongoperf < ~/mongodb-linux-x86_64-2.6.0-rc2/mongoperf.conf
...
new thread, total running : 1
5474 ops/sec 21 MB/sec
new thread, total running : 2
7717 ops/sec 30 MB/sec
new thread, total running : 4
7740 ops/sec 30 MB/sec
new thread, total running : 8
7776 ops/sec 30 MB/sec
new thread, total running : 16
7548 ops/sec 29 MB/sec
new thread, total running : 32
7661 ops/sec 29 MB/sec
...
$ sudo iostat -x 2
avg-cpu: %user %nice %system %iowait %steal %idle
0.13 0.00 2.08 10.39 0.13 87.28
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdc 0.00 0.00 3777.50 7555.00 30220.00 30220.00 5.33 1.64 0.14 0.09 98.20

What’s interesting here is that we get about 5000 iops right off the bat with 1 thread, but then it hits a ceiling below 8000 iops even with more threads. Weird. Let’s try with XFS on the second drive.

$ sudo yum install xfsprogs
$ sudo mkfs.xfs /dev/sdc
$ sudo mkdir /media/ephemeral1
$ sudo mount /dev/sdc
$ cd /media/ephemeral1
$ sudo ~/mongodb-linux-x86_64-2.6.0-rc2/bin/mongoperf < ~/mongodb-linux-x86_64-2.6.0-rc2/mongoperf.conf
...
new thread, total running : 1
4801 ops/sec 18 MB/sec
new thread, total running : 2
7088 ops/sec 27 MB/sec
new thread, total running : 4
10104 ops/sec 39 MB/sec
...
new thread, total running : 256
33972 ops/sec 132 MB/sec
new thread, total running : 512
39590 ops/sec 154 MB/sec
new thread, total running : 1024
47873 ops/sec 187 MB/sec

It is much faster with XFS! It is maxing out around 47k iops with 1024 threads! XFS was supposed to be about 15% faster than EXT4 with SSD, but here it looks like it’s a different ballgame where it really scales with the number of threads used. Impressive.

$ sudo iostat -x 2
avg-cpu: %user %nice %system %iowait %steal %idle
1.17 0.00 22.37 75.75 0.71 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 3.50 0.00 3.00 0.00 52.00 17.33 0.01 2.00 2.00 0.60
xvdb 10.00 2.50 24143.00 25074.50 193240.00 193336.50 7.85 136.90 2.78 0.02 100.20
xvdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Checking the result with iostat, it does show 100% utilization with 25k read ops and 25k write ops per second.

Setting up a RAID

Our server has 2 drives, so ideally we would get 2x the throughput by using RAID0 on them. The usual way of doing this is through mdadm.

$ sudo umount /dev/sd[b,c]
$ sudo mdadm --create /dev/md0 -c 256 --level 0 --raid-devices 2 /dev/sdb /dev/sdc
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
$ sudo mkfs.xfs /dev/md0
...
$ sudo vim /etc/fstab
$ grep md0 /etc/fstab
/dev/md0 /media/raid auto defaults,noatime,nofail,comment=cloudconfig 0 2
$ sudo mount /dev/md0
mount: mount point /media/raid does not exist
$ sudo mkdir /media/raid
$ sudo mount /dev/md0

Note here that I am forcing the stripe size to 256KB. If you don’t do that, the new 1.2 RAID format uses 512KB stripes which is too high for XFS or kernel buffers to match and performance will suffer (that seems like a bug on the RAID side..). Let’s see the performance:

$ sudo ~/mongodb-linux-x86_64-2.6.0-rc2/bin/mongoperf < ~/mongodb-linux-x86_64-2.6.0-rc2/mongoperf.conf
new thread, total running : 1
3587 ops/sec 14 MB/sec
new thread, total running : 2
5436 ops/sec 21 MB/sec
new thread, total running : 4
6432 ops/sec 25 MB/sec
new thread, total running : 8
7151 ops/sec 27 MB/sec
new thread, total running : 16
9397 ops/sec 36 MB/sec
...
new thread, total running : 256
60636 ops/sec 236 MB/sec
new thread, total running : 512
77010 ops/sec 300 MB/sec
new thread, total running : 1024
89326 ops/sec 348 MB/sec

Those results are very good. We went from 47k with 1 drive to 89k with 2 drives, so the performance loss appears to be minimal. An alternative is to use LVM instead of mdadm, which let’s you do RAID0 and also includes very useful features like snapshotting (backups will be easier!). Let see what performance LVM can give:

$ sudo umount /dev/md0
$ sudo mdadm --stop /dev/md0
mdadm: stopped /dev/md0
$ sudo pvcreate /dev/sd[b,c]
Physical volume "/dev/sdb" successfully created
Physical volume "/dev/sdc" successfully created
$ sudo vgcreate vg0 /dev/sd[b,c]
Volume group "vg0" successfully created
$ sudo lvcreate -i 8 -I 256 -n mongo -l 100%FREE vg0
Rounding size (381546 extents) down to stripe boundary size (381544 extents)
Number of stripes (8) must not exceed number of physical volumes (2)
$ sudo lvcreate -i 2 -I 256 -n mongo -l 100%FREE vg0
Logical volume "mongo" created
$ sudo mkfs.xfs /dev/vg
$ sudo vim /etc/fstab
$ sudo grep vg0 /etc/fstab
/dev/vg0/mongo /media/raid auto defaults,noatime,nofail,comment=cloudconfig 0 2
$ sudo mount /dev/vg0/mongo
$ cd /media/raid/
$ sudo ~/mongodb-linux-x86_64-2.6.0-rc2/bin/mongoperf < ~/mongodb-linux-x86_64-2.6.0-rc2/mongoperf.conf
new thread, total running : 1
3491 ops/sec 13 MB/sec
new thread, total running : 2
5863 ops/sec 22 MB/sec
new thread, total running : 4
9051 ops/sec 35 MB/sec
new thread, total running : 8
14206 ops/sec 55 MB/sec
new thread, total running : 16
21367 ops/sec 83 MB/sec
...
new thread, total running : 256
59237 ops/sec 231 MB/sec
new thread, total running : 512
70155 ops/sec 274 MB/sec
new thread, total running : 1024
84789 ops/sec 331 MB/sec
...
$ sudo iostat -x 2
avg-cpu: %user %nice %system %iowait %steal %idle
1.97 0.00 41.47 55.50 0.99 0.07 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdb 10.50 8.00 21199.50 21806.50 169684.00 200100.00 8.60 125.75 2.91 0.02 97.60 xvdc 16.00 7.00 21376.00 21672.00 171156.00 199024.00 8.60 115.85 2.69 0.02 94.80 dm-0 0.00 0.00 42663.00 42943.00 341304.00 399212.00 8.65 496.08 5.73 0.01 100.00

The performance of LVM is very good here too at about 85k iops. It seems to be in line with software RAID (madm) and actually has been reported to give better / more consistent performance in some cases.

Conclusion

In conclusion:

  • disk IO is the most important hardware metric since it is the typical bottleneck. Without good disk IO, you can forget all those sexy high volume applications with advanced indexing and aggregation
  • go for SSD if you can. An SSD will give you 100x the IO performance of an HDD.
  • test the disk performance before you commit to a box setup. There are good tools like bonnie++ and mongoperf. You might as well avoid surprises after you have committed a given setup to production.
  • use XFS, mount with noatime, set low read-ahead.
  • to compound the performance of disk, set up a RAID0 with MDADM or LVM (but note that any disk dying will kill the array)
  • make use of enough threads in the application. Now this works well for reading threads (you can have as many as you want) but for writing it is really a matter of the kernel’s background pdflush process which uses between 2 and 8 threads only. Sadly the writing part may not make full use of the SSD capacity depending on how well the OS implements it.

Is your APM strategy broken? This ebook explores the latest in Gartner research to help you learn how to close the end-user experience gap in APM, brought to you in partnership with Catchpoint.

Topics:

Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}