AIR: Data Intelligence in Qubole
AIR: Data Intelligence in Qubole
Learn about products built on an infrastructure that collects data from Hive, Spark, Presto, and Hadoop clusters to provide Alerts, Recommendations, and Insights (AIR).
Join the DZone community and get the full member experience.Join For Free
The proliferation of the cloud for Big Data workloads has provided many benefits and has also opened up new areas of optimization. But data teams often do not have the right information and as a result, there are several missed opportunities to:
Reduce resources consumed
Reduce resource cost
To solve this problem, we built an intelligence service that collects data across the stack, analyzes and provides insights and recommendations to all types of personas: data consumers (analysts, data scientists, data engineers), and data service providers (admins) in an organization.
At the Data Platforms 2017 Conference, we announced the early access of data discovery and data model intelligence. The same service — AIR (Alerts, Insights, Recommendations) — is now generally available.
Data Discovery for Data Consumers
Analysts spend most of their time finding the right data for the business problem they are trying to solve. The key to increasing their productivity lies in increasing the agility and democratizing analytics by making it truly self-service. While there are many ways to provide this, data discovery helps analysts get the information about their data model intelligently and intuitively through:
- Autosuggestion and completion:
- Usage-based ranking: Collects usage metrics from all users and sorts suggestions intelligently based on top usage.
- Context-aware suggestions: Knows when to suggest which object (tables, columns, joins, and filters).
- Search and discovery:
- Search for all available column names (filter all tables which contain a column) or table names (filter all tables which has the search word and browse through the available columns).
- Top-down analysis provides insights into how other users query a particular table making it easy to use tables that are unfamiliar.
Usage insights provide insights into the most used columns, tables, joins, filters, and the most frequent users enabling analysts to be more context-aware to learn from other expert users and to ramp up on new data sets more efficiently.
Statistics provide a high-level summary of a particular table or column.
Data preview provides sample data for a particular table or column giving the analyst an overview of the data that they are going to query.
Data profile provides a summary of the data itself, such as for column cardinality, number of rows with unique values for a particular column, and number of nulls or zeros for a particular column.
Data Model Intelligence for Data Service Providers
Data admins, on the other hand, face a different set of problems. In addition to productivity, cost and performance of the infrastructure play a big role. One important factor is data model design. Relevant questions on data model design are:
Do I have the right data to make intelligent decisions?
Am I optimizing the data model efficiently? What is the right strategy given my specific setup?
In order to solve this problem, we built data model usage intelligence, which provides insights (for "All Tables" as well as "Hot Tables") and recommendations to help the data services team identify opportunities for optimizing the data model (partition, sorting, changing the data format, etc.).
AIR infrastructure extracts insights and recommendations based on usage metrics such as the following.
Data Model Insights
Most frequently used tables, columns, and partitions.
Top users of specific tables, columns, and partitions. These users may be the subject matter experts on the data in these tables.
Distribution of I/O or compute resource consumed across users, tables, and commands.
Qubole Data Service (QDS) can report on all tables, or hot tables specifically.
Summarizes total commands and average execution time.
Shows a graph of the volume of commands per day.
Shows a graph of total execution time in seconds per day.
Shows a histogram of the number of tables that were joined together.
Shows a leaderboard of the top users in terms of query count, with other data such as total commands submitted by each user, total errors, and total execution time (for all the queries submitted by this user).
This view provides:
- A pie chart showing the percentage of commands using hot tables that:
- Use as a filter the column on which the table is partitioned
- Use as a filter a column on which a table is not partitioned
- Do not use any filter
- A high percentage in the first category indicates that the table is correctly partitioned; a high percentage in the second category may mean that the table needs to be re-partitioned.
- A pie chart showing the percentage of commands using hot tables that are in:
- Columnar format (ORC, Avro, Parquet)
- Non-columnar format
- This highlights the overall efficiency of the data models and helps to determine if the data models have to be converted into a columnar format to improve query performance and reduce the overall data scans.
A graph showing the above analysis broken out by individual hot tables so that admins can decide which table to prioritize for repartitioning.
A table showing the most used hot tables and their relevant data formats.
A table showing the most-used columns for each hot table in ascending order.
A table showing the top used Join groups and the count of how many times have they been used in queries.
These insights drive recommendations to improve usability, performance, and cost, such as the following.
Data Model Recommendations
Identify the right partition column (based on top used predicate) for a specific table.
Identify tables that are not in columnar format and recommend the right data format (ORC, Parquet).
Recommend the right list of column(based on top used predicate) based on which a particular table can be sorted.
But this is barely the tip of the iceberg. What else is possible with this intelligence infrastructure? Can we expand this other to other areas of the QDS technology stack? Can this help analyze the workloads and answer the below questions?
Can the user reduce the cost for the same performance? Cost can be saved by using smaller machines or smaller clusters.
At what cost can a workload run faster?
Is my AWS Spot Node policy effective?
Will an heterogenous cluster help me?
We’re committed to building data-driven products that help organizations increase the productivity of data team members and performance while reducing TCO of big data initiatives. In this blog post, we’ve introduced two such new products that are built on an infrastructure that collects data from Hive, Spark, Presto, and Hadoop clusters to provide Alerts, Recommendations, and Insights (AIR).
Published at DZone with permission of Balaji Mohanam , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.