Drilling Data With Apache Drill
Drilling Data With Apache Drill
Learn about configuring and connecting different data sources and querying data from these data sources with Apache Drill.
Join the DZone community and get the full member experience.Join For Free
New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.
Apache Drill is an open-source, low-latency SQL on Hadoop query engine for larger datasets. The latest version of Apache Drill is 1.10 with CTTAS, Web console, and JDBC connection. It can be integrated with several data sources such as CSV, JSON, TSV, PSV, Avro, and Parquet and can be operated using a single query.
In this blog, let's discuss configuring and connecting different data sources and querying data from these data sources with Apache Drill. To know more, refer to our previous blog post on drilling data with Apache Drill.
Persist data files in different data sources such as MySQL, HDFS, and Hive, query them on-the-fly, export query output in different file formats such as CSV, JSON, and TSV, and load the result sets into HDFS location.
To demonstrate Apache Drill’s capability, financial stock data downloaded in parts in different formats such as JSON, CSV, and TSV is loaded into different data sources. The files used are as follows:
Install Apache Drill 1.10.0 in Distributed Mode on Linux.
wget http://apache.mirrors.hoobly.com/drill/drill-1.10.0/apache-drill-1.10.0.tar.gz tar xf apache-drill-1.10.0.tar.gz
Note: Drill requires JDK 1.7 and above.
- Configure and connect different data sources such as MySQL, HDFS, and Hive.
- Query the data sources on the fly.
- Export query output in different file formats such as CSV, JSON, and TSV.
- Store the output file in HDFS location.
Configuring Different Data Sources
Let's look at RDBMS storage plugins, Hive storage plugins, and file storage plugins.
RDBMS Storage Plugins
Apache Drill is tested with MySQL's
To connect Apache Drill with MySQL, configure the RDBMS storage plugin in the Apache Drill console as shown in the below diagram:
Hive Storage Plugins
File Storage Plugin
Using Apache Drill in Distributed Mode
Start a Drillbit on each node in a cluster to use Apache Drill in the distributed mode.
To start Apache Drill, use the below command:
To stop Apache Drill, use the below command:
Verifying Drill Setup
To verify Apache Drill setup, use the below command:
cd apache-drill-1.10.0 bin/sqlline –u jdbc:drill:zk=local !connect jdbc:drill:zk=local:2181 show databases;
To check Drillbits running on the cluster, use the below command:
0: jdbc:drill:zk=<zk1host>:<port> SELECT * FROM sys.drillbits;
Querying Data Sources With Apache Drill
Apache Drill’s self-describing data exploration behavior allows users to query different data files from diverse data stores.
It is used to join all three files and retrieve the name, ticker, MarketCap, sector, sub-sector, volume, and moving averages.
select b.Symbol,b.Status,b.Name,b.MarketCap,a.ATR,a.SMA20,c.sector,c.subsector from dfs.`/user/tsldp/energy_technical/energy_technical.json` a, hive.stock_market.`stock_market_exchange` b, mysqldb.energy_data.`energy_overview` c where a.Ticker = b.Symbol and a.Ticker = c.ticker limit 10;
Total Volume Based on Subsector
select sum(b.volume) as total_volume,a.subsector from hive.stock_market.`stock_market_exchange` b, mysqldb.energy_data.`energy_overview` a where a.ticker = b.Symbol group by a.subsector limit 5;
Minimum Volume for Tickers
select a.ticker as ticker,min(b.volume) as min_volume from hive.stock_market.`stock_market_exchange` b, mysqldb.energy_data.`energy_overview` a where a.ticker = b.Symbol group by a.ticker order by a.ticker desc limit 10;
Subsector With ATR > 2
select b.subsector,a.ATR from dfs.`/user/tsldp/energy_technical/energy_technical.json`a, mysqldb.energy_data.`energy_overview` b where a.ticker = b.ticker group by b.subsector,a.ATR having cast(a.ATR as float) > 2;
Tickers With w52high Between -20 and -100 With Their Last Price
select symbol,lastprice,change,changepercent from hive.stock_market.`stock_market_exchange` where symbol in (select ticker from dfs.`/user/tsldp/energy_technical/energy_technical.json` where cast(W52High as float) < -20 and cast(W52High as float) > -100 order by ticker desc limit 10);
The query output is as follows:
Loading Output Files in HDFS
To load the result sets as files in HDFS, configure workspace in DFS storage plugin by changing the configuration as shown in the below diagram:
Exporting Output Files
The result sets of the join queries are exported into CSV file format by using
csvOut workspace and JSON file format by using
jsonOut workspace in the HDFS storage plugin.
Switch to csvOut Workspace
To switch to
csvOut workspace, use the below command:
Change File Storage Format to CSV
To change file storage format to CSV, use the below command:
alter session set `store.format`='csv';
Exporting Output File Into CSV Format
create table total_resultset as select a.ticker as ticker,sum(b.volume) as total_volume,a.subsector from hive.stock_market.`stock_market_exchange` b, mysqldb.energy_data.`energy_overview` a where a.ticker = b.Symbol group by a.ticker,a.subsector;
The output files are exported and stored in the location: /user/tsldp/csv_output/total_resulset/total_volume.csv.
And that's it!
Published at DZone with permission of Rathnadevi Manivannan . See the original article here.
Opinions expressed by DZone contributors are their own.