Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Hive to Perform Advanced Analytics in Hadoop

DZone's Guide to

Using Hive to Perform Advanced Analytics in Hadoop

Using Hive to store your data is the first part in a potentially powerful workflow. This sample setup, assisted by the Alpine data tool, to visualizes the flow and operators.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Hadoop data warehouses have continued to gain popularity — with solutions such as Hive, Impala, and HAWQ now frequently deployed at customer sites. Access to these warehouses is typically tightly controlled using Ranger or Sentry — ensuring comprehensive data security. Due to the ease with which data can be governed in Hive, an increasing number of IT departments are locating all of their Hadoop data there and requiring data scientists to interact with their data only via Hive.

Data residing in these warehouses is typically accessed via SQL. This is great for ETL and reporting, but it can be limiting when data scientists wish to leverage advanced analytics to train a random forest, for example. Fortunately, Chorus provides functionality that allows users to seamlessly leverage advanced analytics on Hive data sets. To deliver this functionality, Chorus transparently leverages MapReduce and Spark to perform these operations, as shown below:

In the above flow:

  1. The initial data resides in a Hive table.

  2. The data can be optionally manipulated using HQL queries.

  3. The data is split into test and train components using a basic MapReduce operation.

  4. A logistic regression model is trained on the training split, using Spark.

  5. The model accuracy is evaluated using test split of the data.

  6. All interim results can be optionally retained as Hive tables.

Hive functionality in Chorus provides data scientists with an intuitive visual interface for performing advanced analytics without having to write complex SQL queries or even a single line of code. It is also fully compatible with secured Hadoop clusters, including those secured with Kerberos (including support for impersonation), Ranger and Sentry — ensuring that IT policies and data governance are enforced.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,hadoop ,spark ,hive

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}