Hadoop Summit San Jose is here once again and with it comes a reminder of the power of the Open Source Community and the tremendous innovation which continues to occur within the Apache Hadoop ecosystem. At Hortonworks, we get the opportunity to engage with this vibrant, creative, and talented group of engineers all year round, but at events like Hadoop Summit we get the opportunity to showcase some of the compelling technical innovations developed over the past year for everyone.
As Hortonworks prepares to deliver HDP 2.5, we have assembled a subset of the technical innovations developed thus far and produced tutorials to allow you to explore these capabilities today. You can find a link to the HDP 2.5 Technical Preview here.
Among the highlights is the progress that’s been made in the areas of:
- Integration of Governance and Security
- Ease of Use for Spark and Agile Analytic Development
- Streamlined Operations for Apache HBase
Integration of Governance and Security
Hortonworks created the Data Governance Initiative in 2015 to address the need for open source governance solution to manage organization requirements for data classification, centralized policy engine, data lineage, security and data lifecycle management.
Apache Atlas was launched as a result of this data governance initiative and the very first set of capabilities to address data lineage with Hive along with a foundation for metadata tagging arrived with HDP 2.3 in July 2015. Hortonworks and the community partners have continued to deliver on the original vision of the data governance initiative and you can explore the integration of Apache Atlas with Apache Ranger to see how Governance and Security are related. The tutorial provided walks through an example of tagging data in Atlas and building a security policy in Ranger.
By integrating Atlas with Ranger enterprises can now implement dynamic classification-based security policies, in addition to role-based security. Ranger’s centralized platform empowers data administrators to define security policy based on Atlas metadata tags or attributes and apply this policy in real-time to the entire hierarchy of assets including databases, tables, and columns, thereby preventing violations from occurring. Ranger also allows for location, time, and other dynamic policies to be defined.
Ease of Use for Apache Spark and Agile Analytic Development
The Apache Spark community continues an extremely rapid pace of innovation and the adoption of Spark by Hortonworks customers has been aggressive. With each new release of Spark, our customers are leveraging the latest capabilities for data science and more. But, one of the things that we have consistently heard is that while Spark is a wonderful engine, working with it could use some additional tooling for visualization and exploration of insights along with simply making it fit better with all the places data is stored within the Hadoop ecosystem.
To address the ease of use of Spark through visual tools, Hortonworks began working with the team from NFLabs within the Apache Zeppelin community in 2015. Zeppelin addresses use cases like data exploration, data discovery, and interactive code snippets while providing built-in visualization. We believe that Zeppelin has the potential to modern data science studio and it is particularly powerful in the context of Spark. The work we have focused on within the community has centered around enterprise readiness with a particular focus on security. Zeppelin now runs on a secure cluster and has basic authentication and authorization capabilities that allow it to be used within the enterprise.
To address making Spark fit better within the Hadoop ecosystem, we took a look at the need of customers to leverage ORC files with Spark. The ORC file format continues to grow in popularity and the Apache ORC project launched as a top-level project just over a year ago. Customers who are using Hive in conjunction with ORC wanted to also easily gain access to the data stored with those files via Spark. Similarly, customers who are using HBase wished to more easily access data they stored within HBase via Spark as well.
The tutorials provided takes you on a guided tour of Spark itself before exposing you to the powerful and visually rich experience delivered through Zeppelin. There are additional tutorials allowing you to explore using Hive with ORC from Spark and HBase from Spark as well.
Streamlined Operations for Apache HBase
Speaking of Apache HBase, that community continues to push forward on both feature innovation while also looking at ways to streamline operational support for HBase.
Two key areas to highlight are the improved dashboarding and visualization of metrics for HBase within Ambari. Ambari 2.2.2 delivered a set of pre-built dashboards, including one for HBase, which provides the ability to display metrics filtered by time, component, and contextual information to provide greater flexibility, granularity, and context. HBase itself has always had a large degree of instrumentation, but the power of visualizing and putting these metrics into context has been a challenge. Recent work has been done to further enhance metrics by adding in table and user-level statistics as well. Operators can leverage this additional depth with the latest releases of Ambari.
Another area that has received additional focus for HBase is backup and restore. While this may not seem like the sexiest capability, it is absolutely important for customers, who are relying upon HBase for their real-time applications, to be able to backup and restore their mission critical data. While a full backup capability has existed previously, the current work within the community has focused on an incremental backup and restore capability that we think HBase operators will love.
The tutorial provided allows you to explore the basics of Apache HBase and Apache Phoenix along with the new incremental backup and restore capability for HBase.
We are excited to share these latest innovations with you as we continue to test and complete our work on HDP 2.5. Hortonworks would like to thank everyone within the Apache community for all of their efforts and contributions. The pace of innovation continues to be truly astounding and we are grateful for the opportunity to connect, collaborate, and engage with such a unique group of dedicated professionals.