DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 1

Big Data Search, Part 1

Oren Eini user avatar by
Oren Eini
·
Jan. 17, 14 · Interview
Like (0)
Save
Tweet
Share
8.63K Views

Join the DZone community and get the full member experience.

Join For Free

I got tired of the old questions that we were asking candidates, so I decided to add a new one. This one is usually something that we’ll give the candidates to do at home, at their leisure. Let us imagine the following file:

 "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"

 "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"

 "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"

 "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"

 "Lenna","Paprocki","Feltz Printing Service","639 Main St","Anchorage","Anchorage","AK",99501,"907-385-4412","907-921-2010","lpaprocki@hotmail.com",http://www.feltzprintingservice.com

As you can see, this is a pretty trivial CSV file. However, let assume that it is a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file. We need to be able to query by email or all the people with in a particular zip code. Because of the size, the solution can be composed of two parts, a prepare part (which can run for as long as it is needed) and answer to queries part. Maximum time to answer any query must be under 30 seconds.

  • You can assume that the file never changes, and that once the prepare part is done, it will never need to be run again.
  • The answer to a query is the full CSV row.
  • You can assume a machine with a single machine 100TB disk, 16 GB RAM and 8 CPU cores.
  • The solution cannot use any existing databases.
  • The solution needs to include explanation of the various options that were available and why this specific solution was chosen.
  • After the prepare phase is done, the solution has to take less than 30TB of data (including the original file).
  • The solution should be easy to apply to different CSV file.

I decided that it wouldn’t be fair to ask candidates to do something like that without doing it myself. Mostly because the fact that I have a good idea about how to do something doesn’t meant that I understand the actual implementation issues that might pop up.

I actually gave myself a somewhat harder task, do the above mention task, but do it without access to any library other than the BCL and do so with a minimal amount of memory usage. The entire thing took less than a day, and it solves the problem quite a bit more efficiently than I actually anticipated.

But I’ll discuss the details of this in my next post.


Big data

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • Bye-Bye, Regular Dev [Comic]
  • Top 5 Java REST API Frameworks
  • An Introduction to Data Mesh

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: