Degenerate Performance Scenario for LMDB
Join the DZone community and get the full member experience.
Join For Freei was working peacefully through some stuff with voron, when i ran into some really bad performance in a pretty important scenario. i decided that this is probably something that i was doing wrong, and set out to reproduce the same scenario in lmdb, to figure out what i was doing wrong.
here is the code:
int main(int argc,char * argv[]) { int i = 0, j = 0, rc; uuid id; mdb_env *env; mdb_val key, data; mdb_stat mst; mdb_cursor *cursor; char sval[32]; clock_t start = clock(), end; srandom(time(null)); rc = mdb_env_create(&env); rc = mdb_env_set_mapsize(env, 1024*1024*768); rc = mdb_env_set_flags(env, mdb_writemap, 1); rc = mdb_env_open(env, "e:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(uuid); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { mdb_txn *txn; mdb_dbi dbi; rc = mdb_txn_begin(env, null, 0, &txn); rc = mdb_open(txn, null, 0, &dbi); for (j= 0; j < 100; j++) { uuidcreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }
as you can see, we are inserting 10,000 items (100 transactions of 100 items each). each item has key of 16 bytes and a value of 100 bytes. now, you might note that this is probably the worst case scenario for a b+tree, uuids are unordered, and this generates a lot of fragmentation in the tree. bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. it took me a long time to narrow down what is actually going on. at first i was convinced that the problem was with windows’ implementation of memory mapped files, but eventually i realized that select ain’t broken .
here is the process monitor’s trace of the first few transactions. the highlighted calls are to flushbuffersfile, which is how flushfilebuffers look, which indicates a commit.
as i said, those are the first few transactions. you can see that the os is doing a pretty good job in merging writes and avoiding seeks. however, as times goes by …
and as more times goes by, we get to (there are more rows at the top that i had to remove so i could take the snapshot) …
in fact, i put the data in excel and got:
and those are some really bad numbers, i think you’ll agree. after the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350 kb.
let us compare that to what happens when we are actually writing sequential data (using uuidcreatesequential instead of uuidcreate). the test complete in half the time, and the actual statistics are also very interesting.
we are writing a lot less, but much more importantly, we are doing a lot less seeks. for reference, here is the final transaction that we have with the sequential write scenario:
i ran those tests on a hdd drive, so seek times matter a lot, but i have gotten similar results when running on an ssd.
now, the interesting question here is what exactly is going on ? and i think that a large part of this is the way lmdb allocates pages. it uses a copy-on-write method, which means that after the transaction is done, the previously used pages are free. that means that they can be reclaimed, and the next transaction will do just that. the problem with this approach is that it tends to spread writes all over the file. this saves disk space, but requires a lot more seeks when you need to commit a transaction.
that means that, in order to optimize this particular scenario, i probably need to do some thinking about the free page allocation behavior. we want to be a lot more conservative about when and how we give out those pages, to make sure that we aren’t generating a really bad situation for the database when commit time comes.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments