Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Drilling Data With Apache Drill

DZone's Guide to

Drilling Data With Apache Drill

Learn about configuring and connecting different data sources and querying data from these data sources with Apache Drill.

Free Resource

Read why times series is the fastest growing database category.

Apache Drill is an open-source, low-latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10 with CTTAS, Web console, and JDBC connection. It can be integrated with several data sources such as CSV, JSON, TSV, PSV, Avro, and Parquet and can be operated using a single query. 

In this blog, let's discuss configuring and connecting different data sources and querying data from these data sources with Apache Drill. To know more, refer to our previous blog post on drilling data with Apache Drill.

Use Case

Persist data files in different data sources such as MySQL, HDFS, and Hive, query them on-the-fly, export query output in different file formats such as CSV, JSON, and TSV, and load the result sets into HDFS location.

Data Description

To demonstrate Apache Drill’s capability, financial stock data downloaded in parts in different formats such as JSON, CSV, and TSV is loaded into different data sources. The files used are as follows:

  • energy_overview.tsvMySQL
  • energy_technical.json: HDFS
  • stock_market_exchange.csv: Hive

Prerequisites

Install Apache Drill 1.10.0 in Distributed Mode on Linux.

wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz
tar xf apache-drill-1.10.0.tar.gz

Note: Drill requires JDK 1.7 and above.

Synopsis

  • Configure and connect different data sources such as MySQL, HDFS, and Hive.
  • Query the data sources on the fly.
  • Export query output in different file formats such as CSV, JSON, and TSV.
  • Store the output file in HDFS location.

Configuring Different Data Sources

Let's look at RDBMS storage plugins, Hive storage plugins, and file storage plugins.

RDBMS Storage Plugins

Apache Drill is tested with MySQL's mysql-connector-java-5.1.37-bin.jar driver.

To connect Apache Drill with MySQL, configure the RDBMS storage plugin in the Apache Drill console as shown in the below diagram:

select

Hive Storage Plugins

To connect Apache Drill with Hive, enable the existing Hive plugin and update the configuration in the Apache Drill console as shown in the below diagram:select

File Storage Plugin

To connect Apache Drill with HDFS, replace file:/// with hdfs://hostname:port/ in the DFS storage plugin as shown in the below diagram:select

Using Apache Drill in Distributed Mode

Start a Drillbit on each node in a cluster to use Apache Drill in the distributed mode.

Starting Drill

To start Apache Drill, use the below command:

drillbit.sh start

Stopping Drill

To stop Apache Drill, use the below command:

drillbit.sh stop

Verifying Drill Setup

To verify Apache Drill setup, use the below command:

cd apache-drill-1.10.0
bin/sqlline –u jdbc:drill:zk=local
!connect jdbc:drill:zk=local:2181
show databases;

select

To check Drillbits running on the cluster, use the below command:

0: jdbc:drill:zk=<zk1host>:<port> SELECT * FROM sys.drillbits;

select

Querying Data Sources With Apache Drill

Apache Drill’s self-describing data exploration behavior allows users to query different data files from diverse data stores.

Simple Join

It is used to join all three files and retrieve the name, ticker, MarketCap, sector, sub-sector, volume, and moving averages.

select b.Symbol,b.Status,b.Name,b.MarketCap,a.ATR,a.SMA20,c.sector,c.subsector 
from dfs.`/user/tsldp/energy_technical/energy_technical.json` a,
hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` c 
where a.Ticker = b.Symbol and a.Ticker = c.ticker limit 10;

select

Total Volume Based on Subsector

select sum(b.volume) as total_volume,a.subsector 
from hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` a
where a.ticker = b.Symbol group by a.subsector limit 5;

select

Minimum Volume for Tickers

select a.ticker as ticker,min(b.volume) as min_volume 
from hive.stock_market.`stock_market_exchange` b, 
mysqldb.energy_data.`energy_overview` a 
where a.ticker = b.Symbol group by a.ticker 
order by a.ticker desc limit 10;

select

Subsector With ATR > 2

select b.subsector,a.ATR 
from dfs.`/user/tsldp/energy_technical/energy_technical.json`a,
mysqldb.energy_data.`energy_overview` b 
where a.ticker = b.ticker 
group by b.subsector,a.ATR having cast(a.ATR as float) > 2;

select

Tickers With w52high Between -20 and -100 With Their Last Price

select symbol,lastprice,change,changepercent 
from hive.stock_market.`stock_market_exchange` 
where symbol in 
(select ticker
 from dfs.`/user/tsldp/energy_technical/energy_technical.json`
 where cast(W52High as float) < -20 and cast(W52High as float) > -100 
order by ticker desc limit 10);

The query can also be executed in the Apache Drill web console as shown below:select

The query output is as follows:

select

Loading Output Files in HDFS

To load the result sets as files in HDFS, configure workspace in DFS storage plugin by changing the configuration as shown in the below diagram:

select

The configured workspaces are shown in a database list as shown in the below diagram:select

Exporting Output Files

The result sets of the join queries are exported into CSV file format by using csvOut workspace and JSON file format by using jsonOut workspace in the HDFS storage plugin.

Switch to csvOut Workspace

To switch to csvOut workspace, use the below command:

use dfs.csvOut';

Change File Storage Format to CSV

To change file storage format to CSV, use the below command:

alter session set `store.format`='csv';

Exporting Output File Into CSV Format

create table total_resultset as 
select a.ticker as ticker,sum(b.volume) as total_volume,a.subsector 
from hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` a 
where a.ticker = b.Symbol 
group by a.ticker,a.subsector;

select

The output files are exported and stored in the location: /user/tsldp/csv_output/total_resulset/total_volume.csv.

select

And that's it!

Learn how to get 20x more performance than Elastic by moving to a Time Series database.

Topics:
apache drill ,data queries ,mysql ,persistent data ,database ,tutorial

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}