Over a million developers have joined DZone.

Where Should I Store Hadoop Data?

DZone's Guide to

Where Should I Store Hadoop Data?

Zone Leader Tim Spann looks at several different ways to store Hadoop data, including HBase, Avro, Parquet, and Kudu.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Where do I store my Hadoop data? HBase? Avro? Parquet? Kudu? 

Image title

If you have the space, a raw copy of the data in the original format is great to have. Next, you probably want a file format that can be accessed from any framework or CLI. Avro and Parquet work really well for that. Parquet has a ton of support and works really well for most formats.

Image title

Kudu is new, but getting a big push from Cloudera and looks to be faster than Parquet at this point.

Apache Spark supports Parquet very well and performs great. Parquet seems to be the best file format for general usage as it is supported everywhere and is very fast and a smart file format. The columnar format is a great match up for most Spark use cases.

Parquet files within Spark also support a few different compression codecs: uncompressed, snappy, gzip, and lzo. GZIP and Snappy are my preferred formats.

For a quick guide to using Parquet files with Apache Spark, read this.

IBM has a great article with 5 reasons to use Parquet with Spark SQL.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

parquet ,impala ,avro ,hbase ,hadoop ,hdfs

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}