Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Where Should I Store Hadoop Data?

DZone's Guide to

Where Should I Store Hadoop Data?

Zone Leader Tim Spann looks at several different ways to store Hadoop data, including HBase, Avro, Parquet, and Kudu.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Where do I store my Hadoop data? HBase? Avro? Parquet? Kudu? 

Image title

If you have the space, a raw copy of the data in the original format is great to have. Next, you probably want a file format that can be accessed from any framework or CLI. Avro and Parquet work really well for that. Parquet has a ton of support and works really well for most formats.

Image title


Kudu is new, but getting a big push from Cloudera and looks to be faster than Parquet at this point.

Apache Spark supports Parquet very well and performs great. Parquet seems to be the best file format for general usage as it is supported everywhere and is very fast and a smart file format. The columnar format is a great match up for most Spark use cases.

Parquet files within Spark also support a few different compression codecs: uncompressed, snappy, gzip, and lzo. GZIP and Snappy are my preferred formats.

For a quick guide to using Parquet files with Apache Spark, read this.

IBM has a great article with 5 reasons to use Parquet with Spark SQL.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
parquet ,impala ,avro ,hbase ,hadoop ,hdfs

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}