Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Hive to Perform Advanced Analytics in Hadoop

DZone's Guide to

Using Hive to Perform Advanced Analytics in Hadoop

Using Hive to store your data is the first part in a potentially powerful workflow. This sample setup, assisted by the Alpine data tool, to visualizes the flow and operators.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Hadoop data warehouses have continued to gain popularity — with solutions such as Hive, Impala, and HAWQ now frequently deployed at customer sites. Access to these warehouses is typically tightly controlled using Ranger or Sentry — ensuring comprehensive data security. Due to the ease with which data can be governed in Hive, an increasing number of IT departments are locating all of their Hadoop data there and requiring data scientists to interact with their data only via Hive.

Data residing in these warehouses is typically accessed via SQL. This is great for ETL and reporting, but it can be limiting when data scientists wish to leverage advanced analytics to train a random forest, for example. Fortunately, Chorus provides functionality that allows users to seamlessly leverage advanced analytics on Hive data sets. To deliver this functionality, Chorus transparently leverages MapReduce and Spark to perform these operations, as shown below:

In the above flow:

  1. The initial data resides in a Hive table.

  2. The data can be optionally manipulated using HQL queries.

  3. The data is split into test and train components using a basic MapReduce operation.

  4. A logistic regression model is trained on the training split, using Spark.

  5. The model accuracy is evaluated using test split of the data.

  6. All interim results can be optionally retained as Hive tables.

Hive functionality in Chorus provides data scientists with an intuitive visual interface for performing advanced analytics without having to write complex SQL queries or even a single line of code. It is also fully compatible with secured Hadoop clusters, including those secured with Kerberos (including support for impersonation), Ranger and Sentry — ensuring that IT policies and data governance are enforced.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,hadoop ,spark ,hive

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}