This week, Apache Hadoop 2.3.0 was released. There are a lot of bug fixes and small changes in this one - you can read it all in Apache's release notes - but the folks at the Cloudera blog highlight one big change: in-memory caching for HDFS. Cloudera describes the feature as follows:
HDFS caching lets users explicitly cache certain files or directories in HDFS. DataNodes will then cache the corresponding blocks in off-heap memory through the use of
mlock. Once cached, Hadoop applications can query the locations of cached blocks and place their tasks for memory-locality. Finally, when memory-local, applications can use the new zero-copy read API to read cached data with no additional overhead. Preliminary benchmarks show that optimized applications can achieve read throughput on the order of gigabytes per second.
Another big feature, according to Arun Murthy at Hortonworks, is support for heterogeneous storage hierarchy in HDFS. According to Murthy:
With support for heterogeneous storage classes in HDFS, we now can take advantage of different storage types on the same Hadoop clusters. Hence, we can now make better cost/benefit tradeoffs with different storage media such as commodity disks, enterprise-grade disks, SSDs, Memory etc.
So, be sure to take a look. Hortonworks' announcement post also includes a look ahead toward 2.4.0, in case 2.3.0 just isn't enough.