Where Should I Store Hadoop Data?
Zone Leader Tim Spann looks at several different ways to store Hadoop data, including HBase, Avro, Parquet, and Kudu.
Join the DZone community and get the full member experience.
Join For FreeWhere do I store my Hadoop data? HBase? Avro? Parquet? Kudu?
If you have the space, a raw copy of the data in the original format is great to have. Next, you probably want a file format that can be accessed from any framework or CLI. Avro and Parquet work really well for that. Parquet has a ton of support and works really well for most formats.
Kudu is new, but getting a big push from Cloudera and looks to be faster than Parquet at this point.
Apache Spark supports Parquet very well and performs great. Parquet seems to be the best file format for general usage as it is supported everywhere and is very fast and a smart file format. The columnar format is a great match up for most Spark use cases.
Parquet files within Spark also support a few different compression codecs: uncompressed, snappy, gzip, and lzo. GZIP and Snappy are my preferred formats.
For a quick guide to using Parquet files with Apache Spark, read this.
IBM has a great article with 5 reasons to use Parquet with Spark SQL.
Opinions expressed by DZone contributors are their own.
Trending
-
Integration Testing Tutorial: A Comprehensive Guide With Examples And Best Practices
-
Reducing Network Latency and Improving Read Performance With CockroachDB and PolyScale.ai
-
Micro Frontends on Monorepo With Remote State Management
-
Unlocking the Power of AIOps: Enhancing DevOps With Intelligent Automation for Optimized IT Operations
Comments