You might have seen my attempts at testing MySQL as a Key Value Store (KVS) (here
i.e. used like MongoDB, and I have had some comments to those. Above
all, many wanted to know the details of my test environment. To be
honest, this started off as a very simple thing, but it turned out a
bigger than I thought, so I guess its reasonable to give some
background. I apologize for not giving you this background before, but
as I said, my initial intention was to just run bog standard MongoDB
versus MySQL and see what happened.
Here at Recorded Future, when I joined about 2 years ago, we used
a lot of MySQL, and it was causing us serious problems. We had a pretty
standard school-book style relational schema, and performance was
really hurting us bad. Because of the amount and type of data we manage,
because we run on Amazon and because we are still developing and
changing, this schema didn't fit our needs at all.
Read - scalability was not a good option either, as we are massaging so
much data that we had big issues with just having the Slaves keep up
with the master. Schema changes, as we developed our application, was
What we ended up doing was a redesign of a lot of things, ending up with
something that could use a KVS, as these are easier to scale, using
sharding. This had some negative implication too, like it was no longer
possible to go into MySQL and issue a query for data just like that. And
no, using the mongo shell for this, as not only were we now
using MongoDB, we had data spread over several MongoDB stores. On the
good side though, things was not much more performant and easier to
scale, write scalability with MongoDB is good and easy to use and
configure. Which is not to say that MongoDB solves all problems in the
So, having implemented MongoDB and having gotten used to it and even
starting to like it a fair bit, I also started seeing some other things.
Like, Why can't MySQL be used for this? I mean MySQL is a bit more
mature than MongoDB and should be better at handling, say disk-IO. On
the other hand, MongoDB is newer. One thing I already knew was that
MongoDB really needs to be in-memory to a large extent, when it hits the
disk, performance goes downhill fast, so here MySQL is better. But when
we DO have enough RAM (and as we use several MongoDB datastores for
different purposes, and we are on Amazon, where you can get respectable
amounts of RAM, this is usually the case) what happens? If I let MySQL
use RAM it can get so I don't get any disk-IO for reads at all, will it
be as fast as MongoDB? Or maybe even faster.
And among the MySQL Storage Engines, the one mostly targeted as a Key Value Store is NDB? How's that going to work.
So, to test all this, but to make sure I wasn't affected by Amazons
ups-and-downs, I pulled one of our MongoDB datastores, some 105.000.000
rows of real world data. Initially, my plan was to use the full MongoDB
datastore, but I had to exclude some fields as these were VARCHAR and
putting them on Disk with NDB was causing just too much disk-I/O, as NDB
stores disk data as fixed length (so a UTF-8 VARCHAR(256) field
occupies some 768 bytes). I ended up with a table schema like this:
CREATE TABLE `keyvalue` (
`id` bigint(20) NOT NULL,
`value1` int(11) DEFAULT NULL,
`value2` double DEFAULT NULL,
PRIMARY KEY (`id`)
In the case of MongoDB, the id column was used as _id, which is MongoDBs way of saying "PRIMARY KEY".
Having done all that, I need some tools for my test bench:
- Homebrew server with an AMD FX-8120 8 core CPU and 16 Gb RAM on an
ASUS M5A88-V Mobo. This is a neat server MoBo really, supporting SATA 6,
USB 3.0 and other niceties but I use a Lite-On LNE100TX network card
instead of the built-in Realtek crap that many MoBos, inparticular AMD
ones, seems to be equipped with.
- A bunch of disks, the disks where database resides is a LVM Stripe
of two Seagate Barracuda 7200.11 500 Gb disks using xfs. But this is
pretty insignificant as there should be no disk-I/O at all, or at least
extremely limited, when testing.
- Ubuntu 10.10 (I have some reasons for this a bit ancient version.
And yes, I have patched that 16 Gb memory issue in Linux with some AMD
- MySQL 5.6.5 64-bit (for all MySQL tests except NDB)
- MySQL Cluster 7.2.6 64-bit (for MySQL NDB tests)
- MongoDB 2.0.5 64-bit
Having all that ready I now had a slightly smaller table taking up some
4.2 Gb when exported as JSON. So this data has been imported into
MongoDB and MySQL with a number of different storage engines, and I was
ready to do some testing. I use a test program that kicks of SQL queries
with over a specific T of threads, each SQL (with Mongo this is not SQL
of course, but the test program is the same) containing a variable
part, which is the ID for the row to look up, The program starts by
reading a file with a number of random IDs, the IDs are spread over the
threads and then each thread looks up the rows N times.
In this case I picked some numbers for my test, so the number of threads
I use is 100, not an unreasonable number, but big enough to make any
concurrency issues visible and the number of rounds is 10. So to be
clear, the 1.000.000 IDs are distributed over the threads, so each
thread processes 10.000 IDs each, running then 10 times, which means I
do some 10.000.000 random reads in parallel. Not the most exciting
benchmark on the planet, I know, but it was all the same interesting and
also to some extent a real-world test.
Before each test run, I did a few initial runs to ensure that the cached were warmed up. Also, I did look at vmstat and stuff to ensure no disk-I/O was taking place.
Current state of tests
I have so far tested MongoDB, MySQL Cluster and MySQL InnoDB and since
last I have also tested the MySQL MEMORY engine and MySQL with MyISAM.
So far the results are like this:
- MongoDB - 110.000 rows read per second
- MySQL Cluster - 32.000 rows read per second
- MySQL with InnoDB - 39.000 rows read per second
- MySQL with MEMORY / HEAP - 43.000 rows read per second
- MySQL with MyISAM - 28.000 rows read per second
The last two ones were very disappointing to me, and trust me, I want
them to be good. I really wanted MySQL to beat MongoDB, and I really
wanted to see how the much lower overhead of MyISAM would make it run
circles around InnoDB, but no.
As MyISAM only caches the keys itself, not the actual data, this was an
issue, but still, the MyISAM data file should be well cached by Linux
for me, and rightly so, there was no disk-I/O. Could context switching
when going to Linux for data be an issue? I decided to try something
fun. I created a new unique index on the table, covering all columns,
which meant that if I used this covering index, all data would be
fulfilled by the index (but the index in itself would be bigger of
course). This didn't help at all, quite the opposite. Playing with named
keycaches and stuff didn't help at all, nor did using mmap. Which is
not to say that these guys do not help to improve performance, all this
means is that when all data is already im memory, they don't seem to
help. I did figure out one thing that did work though, which was using
tcmalloc. Enabling tcmalloc when starting the mysql server, which is
easy these days as MySQL comes with the tcmalloc library included and
mysqld_safe has an command-line option for it. Real easy, this really
should be the default.
With this in place, I managed to get:
- MySQL with MyISAM - 37.000 rows read per second
I'll try using tcmalloc with the other engines also, and see what happens.
In conclusion so far
Don't take me wrong here, I'm not out to prove that MySQL or MongoDB
sucks. Nor am I out to do some exact benchmarking. What I am out to do
is to see if the claim that an RDBMS can perform as well as a standard
Key Value Store in an environment optimized for a Key Value Store (i.e.
all data in memory). And such an environment isn't that specialized
actually. databases around a few 10's of gigabytes are really common and
can easily fit on memory. And although I do prefer the more strict
datamodel of an RDBMS to the pretty schema-free design of a NoSQL
alternative such as MongoDB (MongoDB really has some reliance on schema,
but in essence, it's schema-free), I have to admit that sometimes you
need more performance, and distributing the load over multiple servers
is a lot easier when you don't have to worry about global transactions
or global consistency, and frankly, in many cases you can live without
that, to an extent,
What remains to be tried is MySQL with NDB using multiple mysqld
servers, InnoDB using tcmalloc and tarantool and the MySQL HANDLER
interface (actually, I have already tested the latter, but I am not sure
I got the tests right).
Who apologizes for the omission of the test setup before publishing the first results