Where Should I Store Hadoop Data?

DZone 's Guide to

Where Should I Store Hadoop Data?

Zone Leader Tim Spann looks at several different ways to store Hadoop data, including HBase, Avro, Parquet, and Kudu.

· Big Data Zone ·
Free Resource

Where do I store my Hadoop data? HBase? Avro? Parquet? Kudu? 

Image title

If you have the space, a raw copy of the data in the original format is great to have. Next, you probably want a file format that can be accessed from any framework or CLI. Avro and Parquet work really well for that. Parquet has a ton of support and works really well for most formats.

Image title

Kudu is new, but getting a big push from Cloudera and looks to be faster than Parquet at this point.

Apache Spark supports Parquet very well and performs great. Parquet seems to be the best file format for general usage as it is supported everywhere and is very fast and a smart file format. The columnar format is a great match up for most Spark use cases.

Parquet files within Spark also support a few different compression codecs: uncompressed, snappy, gzip, and lzo. GZIP and Snappy are my preferred formats.

For a quick guide to using Parquet files with Apache Spark, read this.

IBM has a great article with 5 reasons to use Parquet with Spark SQL.

avro, hadoop, hbase, hdfs, impala, parquet

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}