The world’s top authorities on Apache Hadoop convene at Hadoop Summit San Jose and one of the top questions that will be answered will be around the future and direction of Hadoop. Sanjay Radia – Founder and Architect, Hortonworks lead the track which selected 13 sessions around this topic. I asked Sanjay what he hoped would be covered by these sessions:
“Hadoop continues to drive innovation at a rapid pace and the next generation of Hadoop is being built today. This track showcases new developments in core Hadoop and closely related technologies. Attendees will hear about key projects, such as HDFS and YARN, projects in incubation and the industry initiatives driving innovation in and around the Hadoop platform. Attendees will interact with technical leads, committers, and expert users who are actively driving the roadmaps, key features, and advanced technology research around what is coming next for the Hadoop ecosystem.”
I asked Sanjay if I were pressed for time what would be the top 3 sessions in the can’t miss category. It took some arm twisting but this is top 3 sessions he would recommend:
Apache Hive 2.0 SQL Speed Scale
Speakers: Alan Gates from Hortonworks
Apache Hive is the most commonly used SQL interface for Hadoop. One of its most frequent uses is data warehousing applications. To meet customer warehousing requirements it is important that it scale to petabytes of data, provide the SQL that users need, and perform in interactive time. The Hive community is working towards a 2.0 release of Hive that includes significant new features and performance improvements. These include: * Adding LLAP, a daemon layer that enables sub-second response time. * Adding HBase as an option to store Hive’s metadata, resulting in faster metadata access and reduced query planning time. * Improving Hive’s support for ingesting data at high speed from streaming inputs such as Apache Flume and Apache Storm. * Improving and expanding Hive’s support for managing changing data in a transactionally consistent way by adding the SQL MERGE command. * Laying the groundwork through Apache Calcite to enable Hive to use multiple storage engines (e.g. HBase) This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.
A Multi-Colored YARN: Apps and First-Class Support for Services
Speakers: Vinod Kumar Vavilapalli from Hortonworks
Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive (Hive, Tez, Spark) and real-time processing (Storm). These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation, and multi-tenancy. There are more use-cases that deserve the same set of powerful platform features. In this talk, we’ll talk about a new suite of use-cases that YARN community is working towards – services. YARN as a technology has always had the right foundations to support a wide variety of applications and services. Support for bringing existing and new services to YARN deserves a fresh look though. With this attention on making services simplified and first-class, we will walk through how Apache Hadoop YARN is morphing to support services well out of the box through various platform level efforts. Businesses also increasingly care less about the infrastructure and more about how to drive the end-to-end user-cases. In this context, we will also discuss APIs, tool-set and how the new multi-colored YARN’s story empowers the developer community.
Evolving HDFS to a Generalized Distributed Storage Subsystem
Speakers: Sanjay Radia and Jitendra Pandey from Hortonworks
We are evolving HDFS to a distributed storage system that will support not just a distributed file system, but other storage services. We plan to evolve the Datanodes’ fault-tolerant block storage layer to a generalized subsystem over which to build other storage services such as HDFS and Object store, etc. We introduce the abstraction of a storage-container that is replicated for reliability. The first two container types are Block-Container and Object-container. A Block-Container is a collection of HDFS blocks replicated as a unit. It will allow block scalability with low block-report overhead while allowing co-location of related files. An Object-Container has a very large number of typically much smaller objects and is targeted towards an object-store service (like S3). We also plan on more structured storage container such as LSM-trees to support HBase in a more first-class way. Our approach has several benefits. It allows the Datanode’s physical storage to be shared across different storage services without fragmentation. A storage container also isolates implementation and client protocols allowing each container type to evolve independently. Further container implementations can share common features such as replication, location-service and overall management of containers and its storage including functions like decommissioning.
Hope to see you at the sessions, but you need to register to attend Hadoop Summit San Jose.