DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • How To Validate Archives and Identify Invalid Documents in Java
  • How To Get the Comments From a DOCX Document in Java
  • Architecting and Building LLM-Powered Generative AI Applications
  • How to Convert Excel and CSV Documents to HTML in Java

Trending

  • Four Ways for Developers To Limit Liability as Software Liability Laws Seem Poised for Change
  • Memory Management in Java: An Introduction
  • DevSecOps: Integrating Security Into Your DevOps Workflow
  • How To Use ChatGPT API in Python for Your Real-Time Data

Reviewing Resin (Part 3)

We're focusing on UpsertTransaction again and focusing on the main parts of it. We'll also take a deep dive into the files it holds.

Oren Eini user avatar by
Oren Eini
·
Jul. 15, 17 · Review
Like (1)
Save
Tweet
Share
2.28K Views

Join the DZone community and get the full member experience.

Join For Free

Be sure to check out Part 1 and Part 2 first! In the last post, I started looking at UpsertTransaction, but got sidetracked into the utils functions. Let's focus back on this. The key parts of UpsertTransaction are:

image

Let's see what they are. DocumentStream is the source of the documents that will be written in this transaction. Its job is to get the documents to be indexed, to give them a unique ID if they don’t already have one, and to hash them.

I’m not sure yet what the point is there, but we have this:

image

...which sounds bad. The likelihood is small, but it isn’t a crypto hash, so it's likely very easily broken. For example, look at what happened to MurmurHash.

I think that this is later used to handle some partitioning in the trie, but I’m not sure yet. We’ll look at _storeWriter later. Let us see what the UpsertTransaction does. It builds a trie, then push each of the document from the stream to through the trie. The code is doing a lot of allocations, but I’m going to stop harping at that from now on.

The trie is called for each term and for each document with the following information:

image

The code isn’t actually using tuple — I just collapsed a few classes to make it clear what the input is.

This is what will eventually allow the trie to do lookups of a term and find the matching document, I’m assuming.

That method is going to start a new task for that particular field name if it is new, and push the new list of words for that field into the work queue for that task. The bad thing here is that we are talking about a blocking task, so if you have a lot of fields, you are going to spawn off a lot of threads — one per field name.

What I know now is that we are going to have a trie per field, and it is likely that (based on the design decisions made so far) a trie isn’t a small thing.

Next, UpsertTransaction needs to write the document. This is done taking the document we are processing and turning that into a dictionary of short to string. I’m not sure how it is supposed to handle multiple values for the same field, but I’ll ignore that for now. That dictionary is then saved into a file and its length and positions are returned.

I know that I said that I won’t talk about performance, but I looked at the serialization code and I saw that it is using compression, like this. This is done on a field-by-field basis, while you could probably benefit from compressing them all together.

image

Those are a lot of allocations. And then we go a bit deeper:

image

First, we have the allocation of the memory stream, then the ToArray call. And that happens per field and per document. Actually, if we go up, we’ll see:

image

So it is allocations all the way down.

Okay, let us focus on what is going on in terms of files:

  • "write.lock": This one is pretty obvious
  • *.da: Stands for document address and holds a series of document addresses ((long Position,int Size)). I assume that this is using the same sort as something else — not sure yet. 
  • *.rdoc: Documents are stored here. It contains the actual serialized data for the documents (the  Dictionary<short, Field>). This is the target for the addresses that are held by the *.da files.
  • *.pk: Holds document hashes and olds a list of document pk hash and a flag saying if it is deleted, I’m assuming. From context, it looks like the hash is a way to update documents across transactions.
  • *.kix: Key index; a text file holding the names of all the fields across the entire transaction.
  • *.pos: Posting file; this one holds the tries that were built during the transaction. This is basically just List<(int DocumentId, int Count)>, but I’m not sure how they are used just yet. It looks like this is how Resin is able to get the total term frequency per document. It looks like this is also sorted.
  • *.tri: The trie files that actually contain the specific values for a particular field. The name pattern is {indexVersion}-{fieldName}.tri. That means that your field names are limited to valid file names, by the way.

The last part of UpsertTransaction is the commit, which essentially boils down to this:

image

I think that this was very insightful read, I have a much better understanding of how Resin actually work. I’m going to speculate wildly, and then use my next post to check further into that.

Let us say that we want to search for all users who live in New York City. We can do that by opening the 636348272149533175-City.tri file. The 636348272149533175 is the index version, by the way.

Using the trie, we search for the value of New York City. The trie value actually gives us a (long Position,int Size) into the 636348272149533175.pos file, which holds the posting. Basically, we now have an array of (int DocumentId, int Count) of the documents that matched that particular value.

If we want to retrieve those documents, we can use the 636348272149533175.da file, which holds the addresses of the documents. Again, this is effectively an array of (long Position, int Size) that we can index into using the DocumentId. This points to the location on the 636348272149533175.rdoc file, which holds the actual document data.

I’m not sure yet what the point of *.pa and *.kix is, but I’m sure the next post, we’ll figure it out.

Document Resin (software)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How To Validate Archives and Identify Invalid Documents in Java
  • How To Get the Comments From a DOCX Document in Java
  • Architecting and Building LLM-Powered Generative AI Applications
  • How to Convert Excel and CSV Documents to HTML in Java

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: