DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 3: Binary Search of Textual Data

Big Data Search, Part 3: Binary Search of Textual Data

Oren Eini user avatar by
Oren Eini
·
Jan. 22, 14 · Interview
Like (0)
Save
Tweet
Share
5.46K Views

Join the DZone community and get the full member experience.

Join For Free

The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging.

However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following:

 public void SetPositionToLineAt(long position)
 {
     // now we need to go back until we either get to the start of the file
     // or find a \n character
     const int bufferSize = 128;
     _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity);
  
     var charCount = _encoding.GetMaxCharCount(bufferSize);
     if (charCount > _charBuf.Length)
         _charBuf = new char[Utils.NearestPowerOfTwo(charCount)];
  
     while (true)
     {
         _input.Position = position - (position < bufferSize ? 0 : bufferSize);
         var read = ReadToBuffer(bufferSize);
         var buffer = _buffer.GetBuffer();
         var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0);
         for (int i = chars - 1; i >= 0; i--)
         {
             if (_charBuf[i] == '\n')
             {
                 _input.Position = position - (bufferSize - i) + 1;
                 return;
             }
         }
         position -= bufferSize;
         if (position < 0)
         {
             _input.Position = 0;
             return;
         }
     }
 }

This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input.

Once I have that, the rest is pretty easy. Here is the binary search:

 while (lo <= hi)
 {
     position = (lo + hi) / 2;
     _reader.SetPositionToLineAt(position);
  
     bool? result;
     do
     {
         result = _reader.ReadOneLine();
     } while (result == null); // skip empty lines
  
     if (result == false)
         yield break; // couldn't find anything
  
     var entry = _reader.Current.Values[0];
     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
 
     if (match == 0)
     {
         break;
     }
     if (match > 0)
         lo = position + _reader.Current.Values.Sum(x => x.Count) + 1;
     else
         hi = position - 1;
 }
  
 if (match != 0)
 {
     // no match
     yield break;
 }

The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on.

We then write the following to actually get the data from the actual data file:

 // we have a match, now we need to return all the matches
 _reader.SetPositionToLineAt(position);
  
 while(true)
 {
     bool? result;
     do
     {
         result = _reader.ReadOneLine();
     } while (result == null); // skip empty lines
  
     if(result == false)
         yield break; // end of file
  
     var entry = _reader.Current.Values[0];
     match = Utils.CompareArraySegments(expectedIndexEntry, entry);
     if (match != 0)
         yield break; // out of the valid range we need
  
     _buffer.SetLength(0);
     _data.Position = Utils.ToInt64(_reader.Current.Values[1]);
  
     while (true)
     {
         var b = _data.ReadByte();
         if (b == -1)
             break;
         if (b == '\n')
         {
             break;
         }
         _buffer.WriteByte((byte)b);
     }
  
     yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length);
 }

As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that.

We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.


Big data

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Beginners’ Guide to Run a Linux Server Securely
  • AWS Cloud Migration: Best Practices and Pitfalls to Avoid
  • Core Machine Learning Metrics
  • Using the PostgreSQL Pager With MariaDB Xpand

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: