Over a million developers have joined DZone.

Where Should I Store Hadoop Data?

Zone Leader Tim Spann looks at several different ways to store Hadoop data, including HBase, Avro, Parquet, and Kudu.

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Where do I store my Hadoop data? HBase? Avro? Parquet? Kudu? 

Image title

If you have the space, a raw copy of the data in the original format is great to have. Next, you probably want a file format that can be accessed from any framework or CLI. Avro and Parquet work really well for that. Parquet has a ton of support and works really well for most formats.

Image title


Kudu is new, but getting a big push from Cloudera and looks to be faster than Parquet at this point.

Apache Spark supports Parquet very well and performs great. Parquet seems to be the best file format for general usage as it is supported everywhere and is very fast and a smart file format. The columnar format is a great match up for most Spark use cases.

Parquet files within Spark also support a few different compression codecs: uncompressed, snappy, gzip, and lzo. GZIP and Snappy are my preferred formats.

For a quick guide to using Parquet files with Apache Spark, read this.

IBM has a great article with 5 reasons to use Parquet with Spark SQL.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
parquet ,impala ,avro ,hbase ,hadoop ,hdfs

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}