DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Efficiently Processing Billions of Rows Daily With Presto
  • The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
  • Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
  • Optimizing Databricks Spark Pipelines Using Declarative Patterns

Trending

  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
  • When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
  • Securing the AI Host: Spring AI MCP Server Communication With API Keys
  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 5: Sorting Optimizations

Big Data Search, Part 5: Sorting Optimizations

By 
Oren Eini user avatar
Oren Eini
·
Jan. 27, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
7.9K Views

Join the DZone community and get the full member experience.

Join For Free

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here.

In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file.

Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them.

For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer.

We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk.  We are writing sorted data, so it is probably going to compress pretty well.

Those are the things that pop to mind. Can you think of additional options?


Big data Sorting

Published at DZone with permission of Oren Eini. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Efficiently Processing Billions of Rows Daily With Presto
  • The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
  • Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
  • Optimizing Databricks Spark Pipelines Using Declarative Patterns

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook