DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Related

  • A Guide to Building Data Intelligence Systems: Strategic Practices to Building Robust, Ethical, and AI-Driven Data Structures
  • Leveraging Event-Driven Data Mesh Architecture With AWS for Modern Data Challenges
  • Automate Private Azure Databricks Unity Catalog Creation
  • Integrating Data Engineering Into Artificial Intelligence

Trending

  • Ditch Your Local Setup: Develop Apps in the Cloud With Project IDX
  • Automate Web Portal Deployment in Minutes Using GitHub Actions
  • Using SingleStore and WebAssembly for Sentiment Analysis of Stack Overflow Comments
  • Showing Long Animation Frames in Your DevTools
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 6: Sorting Randomness

Big Data Search, Part 6: Sorting Randomness

By 
Oren Eini user avatar
Oren Eini
·
Feb. 05, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
8.7K Views

Join the DZone community and get the full member experience.

Join For Free

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1.

Because I wanted to ensure a consistent answer, I wrote:

 public static IEnumerable<uint> RandomNumbers()
 {
     const long count = 1024 *  1024 * 1024L  * 1;
     var random = new MyRand();
     for (long i = 0; i < count; i++)
     {
         if (i % 1024 == 0)
         {
             yield return 1;
             continue;
         }
         var result = random.NextUInt();
         while (result == 1)
         {
             result = random.NextUInt();
         }
         yield return result;
     }
 }
  
 /// <summary>
 /// Based on Marsaglia, George. (2003). Xorshift RNGs.
 ///  http://www.jstatsoft.org/v08/i14/paper
 /// </summary>
 public class MyRand
 {
     const uint Y = 842502087, Z = 3579807591, W = 273326509;
     uint _x, _y, _z, _w;
  
     public MyRand()
     {
         _y = Y;
         _z = Z;
         _w = W;
  
         _x = 1337;
     }
  
     public uint NextUInt()
     {
         uint t = _x ^ (_x << 11);
         _x = _y; _y = _z; _z = _w;
         return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8));
     }
 }

I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine.

For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate.

The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help.

Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million.

That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them.

Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this.

For what is it worth, you can find the code for this here.



Big data Sorting

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • A Guide to Building Data Intelligence Systems: Strategic Practices to Building Robust, Ethical, and AI-Driven Data Structures
  • Leveraging Event-Driven Data Mesh Architecture With AWS for Modern Data Challenges
  • Automate Private Azure Databricks Unity Catalog Creation
  • Integrating Data Engineering Into Artificial Intelligence

Partner Resources


Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: