DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Thomas Spicer

Founder at Openbridge

Boston, US

Joined Sep 2017

https://www.openbridge.com

Stats

Reputation: 404
Pageviews: 199.6K
Articles: 4
Comments: 3
  • Articles
  • Comments

Articles

article thumbnail
How to Transition From Excel Reports to Business Intelligence Tools
While Excel is great, it can't do everything data analysts need. Read on to learn how to convert your Excel data to tools like Power BI and Tableau.
Updated May 28, 2019
· 17,324 Views · 3 Likes
article thumbnail
Apache Parquet vs. CSV Files
When you only pay for the queries that you run, or resources like CPU and storage, it is important to look at optimizing the data those systems rely on.
Updated May 28, 2019
· 125,369 Views · 13 Likes
article thumbnail
The Definitive 10-Point Scorecard Before Choosing a BI Solution
There are a ton of choices these days! In this overview, we are putting through a 10-point checklist for business intelligence platforms
January 19, 2018
· 7,242 Views · 3 Likes
article thumbnail
8 Tips for Configuring Adobe Analytics Clickstream Data Feeds
With some planning and awareness, you'll quickly be on your way to harnessing your clickstream data to discover hidden trends, behaviors, and preferences!
November 2, 2017
· 8,520 Views · 2 Likes

Comments

Apache Parquet vs. CSV Files

May 29, 2019 · Thomas Spicer

In general, yes, Parquet will outperform CSV. Why? First, Parquet is a binary format, optimized for read operations. It also contains metadata about the file. In the case of a CSV, your tooling would first need to infer data structure and types. The greater the complexity of the data in CSV, the greater the performance costs. Imagine a CSV with 100 million rows, where the last row exhibits a different data type for "date" from all the prior rows. The process of inferring CSV schemas has a cost.

Maybe you have done some pre-processing on the CSV to clean it up? Maybe store a schema somewhere else that needs to be read in as part of the query?

Ultimately, performance will often depend on the tools used. However, as a general rule Parquet should exhibit better read performance in most situations.

Apache Parquet vs. CSV Files

Jun 08, 2018 · Thomas Spicer

Hard to answer without understanding what those files represent. I doubt leaving all the files as is would be ideal, since a query in CSV format might need to scan N files, row by row. Pretty intense. If the data shares a set of common schemas, you would be better off running a process to aggregate them so you can optimize for the types of query operations. For example, if you aggregate by date, convert to Parquet and then query you will find your queries complete faster and cost less.

The Definitive 10-Point Scorecard Before Choosing a BI Solution

Jan 23, 2018 · Thomas Spicer

You are most welcome!

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: