DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Databases
  4. The Guts 'n' Glory of Database Internals: Searching Information and File Format

The Guts 'n' Glory of Database Internals: Searching Information and File Format

Here, we go over how something as simple as adding a value to a database file can become a nightmare — and how to handle it.

Oren Eini user avatar by
Oren Eini
·
Jun. 14, 16 · Tutorial
Like (6)
Save
Tweet
Share
2.57K Views

Join the DZone community and get the full member experience.

Join For Free

[Part of a continuing series designed to refresh you on the issues between using file-based data and the database.]

In my previous post, I showed how we can use a simple text-based format to store records, and what it means for reading/writing those records. Searching in such a system means that you have quite a bit of work to do because you need to scan the whole file to do so.

That isn't going to really work for us. So, we need to introduce indexes. An index is actually a really simple idea. Given that we have the original file, we'll have the following index for username:

Image title

Basically, the first column in the index is the value of the username, and those values are sorted, so searching the index has an O(logN) complexity associated with it. Once we have done that, we have the record number, and we can use that to find the position in the file and read the whole record.

All in all, this is pretty basic, right? So why am I dedicating a whole post to this topic?

Well, the issue with indexes would be so simple if we were planning on only doing them once. But we want to update them, and that lead to certain issues. While the index above shows only a few records, the actual data size we are talking about here is a million records. That gives us a total index size of about 16MB. What happens if we need to add a new username? In this case, a die hard fan whose username is ‘baseball’ ?

Well, in order to maintain the order, we’ll have to put it between the first two entries. Which will require us to move the rest of the data by that same amount. Effectively, in order to add a single entry to the index, we'll have to write 16MB.

In other words, because files don't allow us to dynamically add/remove data from the middle of the file without a lot of expensive I/O, we can't really use flat files for this. It isn't a surprise, but I wanted to start from the bare minimum and walk us up in the complexity hierarchy as we discover why we need those things.

So, we need to find a file format that will allow us to update things without having to shuffle the entire file every time we do that. Looking at the literature, I can compare the flat file approach to having a sorted array, and it has the same problems. The typical solution for that is to use a binary search tree. In this way, we always have the root of the tree at the beginning of the file and use offsets to jump around in the file according to where we need to go.

The code for searching in this kind of file looks like this:

image

Note that this is with all error handling removed, and while it gives us what we want, this solution has several issues.

To start with, if we want to have good performance, we'll need to balance the tree on inserts/deletes, and that can be complex. Not only that, but it also means that the number of file operations that we'll actually perform is pretty high here. This isn't good because files reside on (very slow) hardware, so we probably don't want to use that.

Instead, we need to find something that will allow us to do a lot fewer seeks in the file (which takes about 10-20 ms on a standard HDD), and can handle concurrent work better (won't be fighting over the disk head position so much). We also need to make sure that the amount of work that we have to do when we modify something is minimal. In the case of a balanced tree, the amount of work to balance it can be extremely high, and the number of places you’ll have to jump around (thus, seek) is incredible.

In the next post, we'll discuss persistent algorithm options here, and what we should use to store the data.

Database

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Utilize OpenAI API to Extract Information From PDF Files
  • Ultimate Guide to FaceIO
  • AWS Cloud Migration: Best Practices and Pitfalls to Avoid
  • Building a Scalable Search Architecture

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: