Drilling Data With Apache Drill
Drilling Data With Apache Drill
Learn about configuring and connecting different data sources and querying data from these data sources with Apache Drill.
Join the DZone community and get the full member experience.
Join For FreeApache Drill is an open-source, low-latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10 with CTTAS, Web console, and JDBC connection. It can be integrated with several data sources such as CSV, JSON, TSV, PSV, Avro, and Parquet and can be operated using a single query.
In this blog, let's discuss configuring and connecting different data sources and querying data from these data sources with Apache Drill. To know more, refer to our previous blog post on drilling data with Apache Drill.
Use Case
Persist data files in different data sources such as MySQL, HDFS, and Hive, query them on-the-fly, export query output in different file formats such as CSV, JSON, and TSV, and load the result sets into HDFS location.
Data Description
To demonstrate Apache Drill’s capability, financial stock data downloaded in parts in different formats such as JSON, CSV, and TSV is loaded into different data sources. The files used are as follows:
energy_overview.tsv
: MySQLenergy_technical.json
: HDFSstock_market_exchange.csv
: Hive
Prerequisites
Install Apache Drill 1.10.0 in Distributed Mode on Linux.
wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz
tar xf apache-drill-1.10.0.tar.gz
Note: Drill requires JDK 1.7 and above.
Synopsis
- Configure and connect different data sources such as MySQL, HDFS, and Hive.
- Query the data sources on the fly.
- Export query output in different file formats such as CSV, JSON, and TSV.
- Store the output file in HDFS location.
Configuring Different Data Sources
Let's look at RDBMS storage plugins, Hive storage plugins, and file storage plugins.
RDBMS Storage Plugins
Apache Drill is tested with MySQL's mysql-connector-java-5.1.37-bin.jar
driver.
To connect Apache Drill with MySQL, configure the RDBMS storage plugin in the Apache Drill console as shown in the below diagram:
Hive Storage Plugins
To connect Apache Drill with Hive, enable the existing Hive plugin and update the configuration in the Apache Drill console as shown in the below diagram:
File Storage Plugin
To connect Apache Drill with HDFS, replace file:/// with hdfs://hostname:port/ in the DFS storage plugin as shown in the below diagram:
Using Apache Drill in Distributed Mode
Start a Drillbit on each node in a cluster to use Apache Drill in the distributed mode.
Starting Drill
To start Apache Drill, use the below command:
drillbit.sh start
Stopping Drill
To stop Apache Drill, use the below command:
drillbit.sh stop
Verifying Drill Setup
To verify Apache Drill setup, use the below command:
cd apache-drill-1.10.0
bin/sqlline –u jdbc:drill:zk=local
!connect jdbc:drill:zk=local:2181
show databases;
To check Drillbits running on the cluster, use the below command:
0: jdbc:drill:zk=<zk1host>:<port> SELECT * FROM sys.drillbits;
Querying Data Sources With Apache Drill
Apache Drill’s self-describing data exploration behavior allows users to query different data files from diverse data stores.
Simple Join
It is used to join all three files and retrieve the name, ticker, MarketCap, sector, sub-sector, volume, and moving averages.
select b.Symbol,b.Status,b.Name,b.MarketCap,a.ATR,a.SMA20,c.sector,c.subsector
from dfs.`/user/tsldp/energy_technical/energy_technical.json` a,
hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` c
where a.Ticker = b.Symbol and a.Ticker = c.ticker limit 10;
Total Volume Based on Subsector
select sum(b.volume) as total_volume,a.subsector
from hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` a
where a.ticker = b.Symbol group by a.subsector limit 5;
Minimum Volume for Tickers
select a.ticker as ticker,min(b.volume) as min_volume
from hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` a
where a.ticker = b.Symbol group by a.ticker
order by a.ticker desc limit 10;
Subsector With ATR > 2
select b.subsector,a.ATR
from dfs.`/user/tsldp/energy_technical/energy_technical.json`a,
mysqldb.energy_data.`energy_overview` b
where a.ticker = b.ticker
group by b.subsector,a.ATR having cast(a.ATR as float) > 2;
Tickers With w52high Between -20 and -100 With Their Last Price
select symbol,lastprice,change,changepercent
from hive.stock_market.`stock_market_exchange`
where symbol in
(select ticker
from dfs.`/user/tsldp/energy_technical/energy_technical.json`
where cast(W52High as float) < -20 and cast(W52High as float) > -100
order by ticker desc limit 10);
The query can also be executed in the Apache Drill web console as shown below:
The query output is as follows:
Loading Output Files in HDFS
To load the result sets as files in HDFS, configure workspace in DFS storage plugin by changing the configuration as shown in the below diagram:
The configured workspaces are shown in a database list as shown in the below diagram:
Exporting Output Files
The result sets of the join queries are exported into CSV file format by using csvOut
workspace and JSON file format by using jsonOut
workspace in the HDFS storage plugin.
Switch to csvOut Workspace
To switch to csvOut
workspace, use the below command:
use dfs.csvOut';
Change File Storage Format to CSV
To change file storage format to CSV, use the below command:
alter session set `store.format`='csv';
Exporting Output File Into CSV Format
create table total_resultset as
select a.ticker as ticker,sum(b.volume) as total_volume,a.subsector
from hive.stock_market.`stock_market_exchange` b,
mysqldb.energy_data.`energy_overview` a
where a.ticker = b.Symbol
group by a.ticker,a.subsector;
The output files are exported and stored in the location: /user/tsldp/csv_output/total_resulset/total_volume.csv.
And that's it!
Published at DZone with permission of Rathnadevi Manivannan . See the original article here.
Opinions expressed by DZone contributors are their own.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}