DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • No Spark Streaming, No Problem
  • Cutting Big Data Costs: Effective Data Processing With Apache Spark
  • Building Analytics Architectures to Power Real-Time Applications
  • Data Integration

Trending

  • Unveiling Vulnerabilities via Generative AI
  • Traffic Management and Network Resiliency With Istio Service Mesh
  • Automated Testing: The Missing Piece of Your CI/CD Puzzle
  • How to Migrate Vector Data from PostgreSQL to MyScale
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data Search, Part 1

Big Data Search, Part 1

Oren Eini user avatar by
Oren Eini
·
Jan. 17, 14 · Interview
Like (0)
Save
Tweet
Share
8.71K Views

Join the DZone community and get the full member experience.

Join For Free

I got tired of the old questions that we were asking candidates, so I decided to add a new one. This one is usually something that we’ll give the candidates to do at home, at their leisure. Let us imagine the following file:

 "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"

 "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"

 "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"

 "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"

 "Lenna","Paprocki","Feltz Printing Service","639 Main St","Anchorage","Anchorage","AK",99501,"907-385-4412","907-921-2010","lpaprocki@hotmail.com",http://www.feltzprintingservice.com

As you can see, this is a pretty trivial CSV file. However, let assume that it is a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file. We need to be able to query by email or all the people with in a particular zip code. Because of the size, the solution can be composed of two parts, a prepare part (which can run for as long as it is needed) and answer to queries part. Maximum time to answer any query must be under 30 seconds.

  • You can assume that the file never changes, and that once the prepare part is done, it will never need to be run again.
  • The answer to a query is the full CSV row.
  • You can assume a machine with a single machine 100TB disk, 16 GB RAM and 8 CPU cores.
  • The solution cannot use any existing databases.
  • The solution needs to include explanation of the various options that were available and why this specific solution was chosen.
  • After the prepare phase is done, the solution has to take less than 30TB of data (including the original file).
  • The solution should be easy to apply to different CSV file.

I decided that it wouldn’t be fair to ask candidates to do something like that without doing it myself. Mostly because the fact that I have a good idea about how to do something doesn’t meant that I understand the actual implementation issues that might pop up.

I actually gave myself a somewhat harder task, do the above mention task, but do it without access to any library other than the BCL and do so with a minimal amount of memory usage. The entire thing took less than a day, and it solves the problem quite a bit more efficiently than I actually anticipated.

But I’ll discuss the details of this in my next post.


Big data

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • No Spark Streaming, No Problem
  • Cutting Big Data Costs: Effective Data Processing With Apache Spark
  • Building Analytics Architectures to Power Real-Time Applications
  • Data Integration

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: