Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.
Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.
Stats
| Reputation: | 404 |
| Pageviews: | 199.6K |
| Articles: | 4 |
| Comments: | 3 |
Comments
May 29, 2019 · Thomas Spicer
In general, yes, Parquet will outperform CSV. Why? First, Parquet is a binary format, optimized for read operations. It also contains metadata about the file. In the case of a CSV, your tooling would first need to infer data structure and types. The greater the complexity of the data in CSV, the greater the performance costs. Imagine a CSV with 100 million rows, where the last row exhibits a different data type for "date" from all the prior rows. The process of inferring CSV schemas has a cost.
Maybe you have done some pre-processing on the CSV to clean it up? Maybe store a schema somewhere else that needs to be read in as part of the query?
Ultimately, performance will often depend on the tools used. However, as a general rule Parquet should exhibit better read performance in most situations.
Jun 08, 2018 · Thomas Spicer
Hard to answer without understanding what those files represent. I doubt leaving all the files as is would be ideal, since a query in CSV format might need to scan N files, row by row. Pretty intense. If the data shares a set of common schemas, you would be better off running a process to aggregate them so you can optimize for the types of query operations. For example, if you aggregate by date, convert to Parquet and then query you will find your queries complete faster and cost less.
Jan 23, 2018 · Thomas Spicer
You are most welcome!