Rooting Out Redundancy in Neo4j
Join the DZone community and get the full member experience.
Join For FreeIntro
So, for the last 2 months we've been working diligently, trying to create the 1.5 release of Neo4j.
While on the surface it may look like little has changed, under the
hood a huge amount of work has gone into a far more stable and usable HA
implementation and rewriting the property storage layer to use far less
disk space while maintaining all its features and providing a speed
boost at the same time. In this post I will deal exclusively with the
latter.
Departing from the old model: A new way to store things
So far, the properties were stored on disk in a doubly linked list,
where each of its nodes contained some necessary
administrative/structural overhead and the actual property data. More
specifically, the layout was:
byte 0 : 4 high bits of previous pointer, inUse flag byte 1 : unused< byte 2 : 4 high bits of next pointer bytes 3-4 : property type bytes 5-8 : property index bytes 9-12 : previous pointer 32 low bits bytes 13-16 : next pointer 32 low bits bytes 17-24 : property data
The last 8 bytes where the value stored, enough to accommodate all primitive values, a short string or a pointer to the dynamic store, where a dynamic record chain would store a long string, an array of primitives or String[].
There is some waste here, in part because the full 8 bytes are used in
the (rare) cases of storing doubles and longs or for short strings but
mostly because this pointers are repeated for each property, making the
impact of the structural overhead felt. On the flip side, the Short
String optimization was a great success, proving the value in inlining
more property types. So we decided to highlight the good parts and
lowlight the bad, ending up with a PropertyRecord structure that is no
longer equivalent to one property but acts as a container for a variable
number of variable length properties. The current layout is:
byte 0 : 4 high bits of previous, 4 high bits of next pointers bytes 1-4 : previous property record bytes 5-8 : next property record bytes 9-40 : payload
Yes, that is correct, no inUse flag, explained by the payload structure.
First, let's call the 4 8-byte-blocks in payload just blocks, to have a
simple name for them. Each of these blocks is used in various ways,
depending on the property data type. Starting off, every property needs
to have the property index and the property type. These are common and
always present, with the property index taking up the first 3 bytes of
the block and the type taking up the 4 high bits of the 4th byte. Now,
after that comes the property value. If it is a primitive that fits in 4
bytes, then the 4 low bits of the 4th byte are skipped and the
remaining 4 bytes of the block are used to store the value and we are
done. When storing a pointer into the DynamicStore for non-short strings
and for arrays, the 36 bits required find home to the second half of
the 4th byte and the low order 4 bytes. This means that each
PropertyRecord can store up to 4 such properties - a huge saving in
space.
For longs and doubles which require 8 bytes, the 4 1/2 trailing bytes
are skipped and instead the next block is used as a whole to store the
value. This leads to some waste but it is still more efficient than the
previous method and it is a relatively rare use case.
What remains is ShortStrings and the brand new ShortArray. Since we saved all that space and I/O calls with ShortString, why not expand on the idea? We now have LongerShortString,
which is like ShortString but on crack. It operates on the same
principle - it scans a string, sees if it falls within an encoding,
encodes it and stores a header with the length and the encoding table id
and then the actual data, encoded in longs that take up blocks right
after the property info. If it doesn't fit in the max of 3 1/2 blocks of
a property record, it is instead encoded as UTF8 and stored in the
DynamicStringStore. A similar idea is applied to arrays. When passed a
primitive array we first determine the minimum number of bits required
to store its values, effectively shaving off all the leftmost zeros we
can while keeping all array members the same size. This means that if we
are asked to store new int[] {1,2,3,4,5}, the entries will take up not 32 but 3 bits each. boolean[]
for example costs 1 bit per entry. Obviously, mixing in even a single
negative value gives immediately a maximum number of bits per entry. So,
to store an array we first determine this number and then the header
becomes:
4 bits, an enum value identifying the primitive type
6 bits, the length of the array
6 bits, the number of bits per item
and then follow the "bit shaved" array entries. The same algo is used
for dynamic arrays as well, but the length is actualy stored in the
length field of the dynamic record (as usual), not the ShortArray header
and we just keep how many bits of the last byte are used. That, along
with the bits per entry number are enough to reconstruct the value. Of
course, in this case as well, if the array does not fit in the
PropertyRecord even after this "compression", it is stored in the
DynamicArrayStore as usual, though now in its bit-shaved form as byte[],
meaning less DynamicRecords are used so less waste. This comes at the
price of reconstructing the array when reading it in, but the reduced
I/O more than makes up for it. A more exact description of the new
ShortString, including all the ShortString classes and size limits, as well as the new ShortArray, is available in the manual.
What about the mystery of the missing inUse flag? Well, that is a
combination of 2 things. One is that the blocks are marked individually
as in use or not, since the API allows for a property to be deleted, and
now a property is no longer a record but a collection of blocks. So we
folded that into the property type, with 0 signifying not in use. The
second is that the blocks are written out defragmented on disk, meaning
that if from 3 properties in a record we delete the middle one (set its
type to deleted), then only the remaining two will be written. This
leads to a simple method of marking "no more properties in this record"
by writing a 0 for the 4th byte of the first not-used block (the
implementation just writes a whole long). A corollary of this is that a
property record that has the 4th byte of the first block 0 is actually
not used.
Code walkthrough
I was going to outline the changes/additions at a source code level
here, but this post is getting too long. Besides, from the above the
code becomes straightforward to follow. If you have any questions,
suggestions or would like to small talk about the implementation, drop
by our mailing list.
Just a tweaking note here - the logic of when and how allocation of
blocks happens and the defragmentation strategy is held in
WriteTransaction. Go ahead and experiment with what best suites your use
case - feedback on these code paths will be greeted with much rejoice!
Out with the old, in with the new: Migrating the store
Unlike the 4+ billion changes for extended address space changes a while
ago, this store upgrade cannot happen in place over an old database. We
need to do a true migration, meaning recreating the store from scratch
and replacing your existing data files with the new ones. This process
is extremely safe: It never writes in your existing data files, it is
crash resistant (so if it fails mid-way nothing bad happens) and keeps a
backup of your data (under upgrade-backup/ in the database
directory). However, better safe than sorry, so it is considered good
practice to keep an independent backup of your data.
The store migration process is relatively straightforward - it goes over
the node and relationship stores, copying them over as they are and,
for each primitive, it reads in the property chains, transforms them in
the new format and stores them. That has the side benefit of compacting
the property store, skipping over deleted entries, so you should notice a
significant reduction in disk usage if you happen to delete lots of
properties and not restart often.
All the migration code is bundled in the kernel source in package org.neo4j.kernel.impl.storemigration
and can be run both as a standalone tool and as part of normal startup -
so no matter if you use the server scripts or just the kernel library,
just set the config option "allow_store_upgrade"="true" and you are set to go.
Onwards and upwards
There are more stuff in this release that can fit in a blog post. Long
discussions in the community have ended up providing inspiration for
substantial changes which not only provide robustness in the current
offering but pave the way for more exciting features to come. So, maybe
"Boden Bord" is not filled to the brim with obvious new features, but
rest assured, we are in for a wild next year.
Thank you for making Neo4j what it is.
Published at DZone with permission of Chris Gioran. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments