DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Data
  4. Reviewing FASTER: The Hash Structure

Reviewing FASTER: The Hash Structure

In this article on the review of FASTER, let's take a look at the hash structure and find out how FASTER is handling GET and PUT functions.

Oren Eini user avatar by
Oren Eini
·
Sep. 08, 18 · Review
Like (2)
Save
Tweet
Share
4.63K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing my stroll through the FASTER codebase, I want to try to get to the core stuff, not just the periphery. This is a bit hard because there is a lot of code here and it seems to be hard to find where the real action is happening. I decided to find how FASTER is handling GET and PUT operations. There is something called InternalHashTable inside FasterKv, which seems promising, so I’m going to follow up on that. Interestingly enough, it shows up as:

image

So there is some funky stuff here to consider, but we’ll deal with that later. For now, I want to understand what it is doing with the InternalHashTable. Inside that class, there is the notion of buckets:

image

And a bucket is:

image

This starts to get interesting for me, so let’s dig deeper into HashBucketEntry, where they use the same union trick I talked about in my previous posts, this allows us to easily define an atomic version of it with very little effort.

image

There is also the overflow entry definition, which looks like this:

image

I've got to say, I’m really liking this approach for handling data packing as well as multi-threading concerns. It also plays very well with the actual cache line architecture of modern CPUs.

There is also the KeyHash struct, whose heart is this:

image

So far, this lines us very neatly to how the FASTER paper talks about things. Given a KeyHash, here is how you get its bucket:

image

This simply index into the buckets_ array by taking the (size_ –1) bits from the hash itself. That tells me that the FASTER structure is very sensitive to the choice of hash function that will be used. This SO answer has some amazing detail on analysis of the output of hash functions with different outputs, which can give you some idea about why this matters so much. This post is an in-depth discussion of this, as well of why the choice of hash function matters. This is enough for me to stop everything and try to find what kind of hash function is actually being used here.

The user gets to define their own hash function, and if they do so badly (for example, use the identity function since they have an 8 bytes value and they need an 8 bytes hash) that is going to be fun. The function that they seem to be using in their examples is this one:

image

A bit of searching on the interweb didn’t narrow it down to something specific; it may be something that they wrote specifically for that. Given that the paper doesn’t mention this, it doesn’t seem to be something special.

Given that 40343 is a prime, it seems like a pretty common pattern of multiple by a prime with each 16 bits portion of the key. The idea is that the multiplication by prime will spread the bits around. No idea how high the quality of this hash function is since the actual analysis would take at least a few hours. At least at a glance, it doesn’t seem to be awful, but I do wonder whatever something like FNV-1. In fact, this is very similar, but with different prime and offset and with addition instead of XOR.

The actual InternalHashTable class doesn’t actually do something, there are a few functions around checkpointing and recovery, but they aren’t very interesting at this point. I’m going back to actually working with the hash table and look at how reads work. They end up in the InternalRead method, which does the majority of the work inside the FindEntry function. The first thing that happens there is:

image

The first line is there to handle growing the hash table on the fly and will be ignored for now. As you can see, this basically just gives me the relevant bucket. Here is the next stage: once we have a bucket, we need to find the relevant entry that matches what we are looking for. Here is the code:

image

You can see that there is a bunch of stuff going on here. First, we run over the known entries in the bucket, trying to find an entry with the same tag. You can see the tentative usage, which is used to sync up between concurrent readers and writers. There may be more items in the bucket than we have space for, so there is also the concept of overflow. This is basically a linked list of 7 items at a time and likely to be pretty fast (frequently already in the higher tiers of the cache for commonly accessed data).

Now that we have the entry, let’s see what is done with it. I’m currently reading through the InternalRead code. Just finding the matching hash doesn’t really help us, we may have hash collisions, this is handled here:

image

This is the first time that we actually run into the hybrid log (hlog here). But the basic idea is pretty obvious. Either the keys match or we have a pointer to the previous entry. I’m not seeing any handling of complete mismatch, though. I’m pretty sure that this is a bug (the behavior is different in the C# version).

This is enough for now. From here, the InternalRead code is starting to do things with the hlog, state machine, etc. I’m going to handle that in a separate post.

Database PRIME (PLC) POST (HTTP) Papers (software) Overflow (software) Data structure Cache (computing)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Build an Automated Testing Pipeline With GitLab CI/CD and Selenium Grid
  • Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective
  • Specification by Example Is Not a Test Framework
  • 19 Most Common OpenSSL Commands for 2023

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: