Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Reviewing Noise Search Engine (Part 2)

DZone's Guide to

Reviewing Noise Search Engine (Part 2)

Take a look at a continuing review of the Noise search engine's project code. In particular, see how the shredder works.

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

In my previous post, I started going over the Noise project code. We got to the indexer, but there is still a lot to do. As a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

The indexer code handed off a lot of work to the shredder, so it is a good thing that this is my next target for review. At a glance, it looks like Noise is using rustc_serialize to do the actual JSON parsing. I’m guessing it makes sense at this point, but given that RavenDB 4.0 had rolled its own JSON parser and given that Noise is accepting the data as strings, I’m going to assume that JSON parsing costs are going to be high on any profiler output.

The first thing to see in the file is the shredder struct:

image

The existing_key_value_to_delete is very interesting since it implies that this is how deletes and updates are handled.

It looks like this is working really closely with the key builder. The first method on the file is:

image

There are a whole bunch of stuff like that, with roughly the same pattern (add_bool_null_entriesadd_stemmed_entries, etc). Because of the close cooperation of those two, I’m going to be reading them in parallel. The KeyBuilder struct looks like this:

image

At this point, I’m guessing that what is going on is that the shredder goes over the JSON document, and add the property path to the KeyBuilder. But given add_number_entries, I don’t know see how that would work. We are adding the document sequence number to the end of the key path, and store the value of the property in the value. That is… strange. The KeyBuilder data contains the property paths and the current position in the array (possibly nested), so that much is clear, its usage, not so much at this stage.

Because this is confusing, I’ll start from the shred function and go from there to see what it is exactly that this is going on.

image

This takes the string, passes it to the JSON parser, and starts iterating over it. OK, going over the code, that seems easier to follow now. This parses the JSON one token at a time. When it encounters a property, it adds that to the KeyBuilder. So the KeyBuilder is basically the path into the current value in the document that we are parsing.

This bugs me, though:

image

I’m following what is going on, and it makes perfect sense. But all those special key prefixes that are scattered are not documented anywhere that I could see, and as I mentioned previously, single-letter prefixes are hard on readability and I don’t think they add much, given that RocksDB is using prefix compression anyway.

There is already a PR to fix this, which is likely to be applied by the time you are reading this blog post, by the way. This change the prefixes to be constants, improving code readability.

A lot of the functionality seems to be inside maybe_add_value, see the call sites:

image

It looks like all the numbers are recorded as floats (given that JSON use double precision IEEE floating point, I guess that make sense). You can see the different prefixes that are given for the values.

The tests at the bottom of the file complete the picture, and if I understand this properly, given the following JSON:

{ "name":"John", "age":31, "city":"New York", “hobbies”: [“reading”, “dogs”] }

The shredder is going to output the following entries, given a document sequence of 555:

  • W.age!f31#555, 
  • W.hoobies$!dogs#555,1 
  • W.hobbies$!reading#555,0 
  • W.name!john#555, 

I think that the code calls this key path, and the general format is (white space added):

W <property path> ! <value> # <document sequence> , <array path>

A critical observation is that an array is denoted as a $ in the property path, and there are as many indexes in the array path as there are $ signs in the property path. I’m currently ignoring the full-text search aspect of things since it is not important for our purposes. With the understanding of what is being generated, let’s move to seeing what is actually going on.

Let us go back to merge_existing_doc and see how that works. This is called after we shredded a document, so we have done all of the interesting work with regards to parsing the document and building the tokens, splitting the data, etc. At this point, we check if we already have this document and if so, we load all the existing values and compare them to the current version. The nice thing is that we get to optimize. If a certain value didn’t change, we don’t need to do anything at all. On the other hand, if a value was there and is no longer in the current version, we need to delete it. That is effectively how updates are handled.

With deletes, we do the same, but in reverse, we reconstruct the document, find all the indexed values, and then we remove it. Finally, we have add_all_to_batch, which takes all the in-memory work that was done and apply that to the RocksDB batch, pretty much finishing all the work. I don’t like that some methods are holding into the state (shredded_key_values/existing_key_value_to_delete) then flush it and some are manipulating the RocksDB batch directly, it seems like there should be either or kind of approach.

Coming back to the issue that I had with the compaction routine, we have this code, which adds a number to the entries:

image

Now, if I’m reading things properly, if the value is zero (not uncommon by any means), that will result in eight bytes of zeros being written as the value. Combine this with this:

image

And now I think that this will cause that value to be removed during compaction, resulting in data corruption — no idea if this is actually the case since I’m just going over the code and not running/testing it thoroughly.

It turns out that I’m not reading this properly (pun intended). I missed the negation on the first line of compaction_filter. The C and K values are the only things that will be removed during compaction, not the other way around, so no data corruption, just remember that there is a reason I use == false instead of ! becauseI keep missing them.

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
big data ,search engine ,shredder

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}