DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Efficiently Processing Billions of Rows Daily With Presto
  • The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
  • Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
  • Optimizing Databricks Spark Pipelines Using Declarative Patterns

Trending

  • Architecting an Embedded Efficiency Layer: A Platform Deep Dive into Day-Two Operational Tuning
  • AI-Driven Integration in Large-Scale Agile Environments
  • MuleSoft MCP and A2A in Production: What 17 Recipes Reveal
  • Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 6: Sorting Randomness

Big Data Search, Part 6: Sorting Randomness

By 
Oren Eini user avatar
Oren Eini
·
Feb. 05, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
9.0K Views

Join the DZone community and get the full member experience.

Join For Free

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

 public static IEnumerable<uint> RandomNumbers()
 {
     const long count = 1024 *  1024 * 1024L  * 1;
     var random = new MyRand();
     for (long i = 0; i < count; i++)
     {
         if (i % 1024 == 0)
         {
             yield return 1;
             continue;
         }
         var result = random.NextUInt();
         while (result == 1)
         {
             result = random.NextUInt();
         }
         yield return result;
     }
 }
  
 /// <summary>
 /// Based on Marsaglia, George. (2003). Xorshift RNGs.
 ///  http://www.jstatsoft.org/v08/i14/paper
 /// </summary>
 public class MyRand
 {
     const uint Y = 842502087, Z = 3579807591, W = 273326509;
     uint _x, _y, _z, _w;
  
     public MyRand()
     {
         _y = Y;
         _z = Z;
         _w = W;
  
         _x = 1337;
     }
  
     public uint NextUInt()
     {
         uint t = _x ^ (_x << 11);
         _x = _y; _y = _z; _z = _w;
         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
     }
 }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.



Big data Sorting

Published at DZone with permission of Oren Eini. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Efficiently Processing Billions of Rows Daily With Presto
  • The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
  • Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
  • Optimizing Databricks Spark Pipelines Using Declarative Patterns

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook