DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Using Hive to Perform Advanced Analytics in Hadoop

Using Hive to Perform Advanced Analytics in Hadoop

Using Hive to store your data is the first part in a potentially powerful workflow. This sample setup, assisted by the Alpine data tool, to visualizes the flow and operators.

Lawrence Spracklen user avatar by
Lawrence Spracklen
·
Dec. 13, 16 · Big Data Zone · Tutorial
Like (3)
Save
Tweet
8.00K Views

Join the DZone community and get the full member experience.

Join For Free

Hadoop data warehouses have continued to gain popularity — with solutions such as Hive, Impala, and HAWQ now frequently deployed at customer sites. Access to these warehouses is typically tightly controlled using Ranger or Sentry — ensuring comprehensive data security. Due to the ease with which data can be governed in Hive, an increasing number of IT departments are locating all of their Hadoop data there and requiring data scientists to interact with their data only via Hive.

Data residing in these warehouses is typically accessed via SQL. This is great for ETL and reporting, but it can be limiting when data scientists wish to leverage advanced analytics to train a random forest, for example. Fortunately, Chorus provides functionality that allows users to seamlessly leverage advanced analytics on Hive data sets. To deliver this functionality, Chorus transparently leverages MapReduce and Spark to perform these operations, as shown below:

In the above flow:

  1. The initial data resides in a Hive table.

  2. The data can be optionally manipulated using HQL queries.

  3. The data is split into test and train components using a basic MapReduce operation.

  4. A logistic regression model is trained on the training split, using Spark.

  5. The model accuracy is evaluated using test split of the data.

  6. All interim results can be optionally retained as Hive tables.

Hive functionality in Chorus provides data scientists with an intuitive visual interface for performing advanced analytics without having to write complex SQL queries or even a single line of code. It is also fully compatible with secured Hadoop clusters, including those secured with Kerberos (including support for impersonation), Ranger and Sentry — ensuring that IT policies and data governance are enforced.

Data science hadoop Analytics

Published at DZone with permission of Lawrence Spracklen, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Biometric Authentication: Best Practices
  • Build a Data Pipeline on AWS With Kafka, Kafka Connect, and DynamoDB
  • Exhaustive JUNIT5 Testing with Combinations, Permutations, and Products
  • JIT Compilation of SQL in NoSQL

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo