Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Drill: A Standalone Quickstart

DZone's Guide to

Apache Drill: A Standalone Quickstart

In this quick tutorial, you will get a breakdown of how to get Apache Drill up and running on your machine and getting started with data analysis.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Image title

Use Case

The requirements:

  1. Query Ngnix access logs with 3rd party tool using JDBC

  2. Quick as possible

  3. Nginx logs saved in two formats (.log, .gz)

  4. Using cloud forbidden

  5. See #2

  6. See #5

Apache Drill: Quickstart in 5 Minutes

Access https://drill.apache.org/, and download and run.

Image title

(or access from here: https://drill.apache.org/download/)

After extracting files, access from the terminal:

Image title

Just run the following command:

./bin/drill-embedded

Image title

Now you can access some UI : http://localhost:8047/

Apache Drill web UI

Wohho now it's UP AND RUNNING!

Let's Test

Best way to test is to create sample data... let's create simple csv.

header1,header2,header3
1,2,3
2,3,3

So, I created the file test.csv.

Image title

Remember the ./bin/drill-embeddedcommand?

Let's run inside:  select * from dfs.root.`/home/yehuda/Projects/Analyze/dzone-is-

awesome/test.csv`; 

Image title

Yeah! It looks better... let's improve it some more.

Go to the Apache drill website and click on "Storage" from the top menu.

Image title

Click "update" on dfs.

Image title

We will edit in the JSON "formats" section in the CSV subsection and write this code:

"csv": {
  "type": "text",
  "extensions": [
    "csv"
  ],
  "extractHeader": true,
  "delimiter": ","
},

Click "Update," and re-run SQL:

Image title

Wow! It works, I can even query headers!

Clean Up and Do Real Work

Let's clean up/remove the test.csv.

Small research.

Our files:

Image title

We have: 

  • log - text

  • log.1 - text

  • gz - gzip archive

What's inside:

Image title

Standard Nginx access logs - textual with space separator and strings are quoted.

We will add nested to the "csv", "log" section.

"csv": { 
      ... snip ...
    },
"log": {
      "type": "text",
      "extensions": [
        "log",
        "gz",
        "1"
      ],
      "delimiter": " "
    },

Let's save this, and run this query: 

select * from dfs.root.`/home/yehuda/Projects/Analyze/dzone-is-awesome/access.log
*` limit 5;

Image title

Data Preparation

Let's make it more useful

use dfs.tmp; -- Writebale schema, allows me to create view

-- Create view and casting data
create view dfs.tmp.access_logs_view as SELECT 
`columns`[0] AS `IP`, 
TO_TIMESTAMP(LTRIM(columns[3], '['), 'dd/MMM/yyyy:HH:mm:ss') AS `log_date`, 
SUBSTR(columns[5], 1, STRPOS(columns[5], ' ')) AS `req`, 
CONVERT_TO(TRIM(BOTH ' ' FROM SUBSTR(columns[5], STRPOS(columns[5], ' '))), 'UTF8') AS `url`, 
columns[5] AS `furl`, CAST(columns[6] AS INTEGER) AS `status_code`, 
CAST(columns[7] AS INTEGER) AS `blangth`, 
columns[8] AS `refference`, 
columns[9] AS useragent
FROM dfs.root.`/home/yehuda/Projects/Analyze/dzone-is-awesome/access.log*`;

Image title

select * 
from dfs.tmp.access_logs_view
where log_date > '2018-04-08 06:40:00'
and log_date < '2018-04-08 07:00:00'
LIMIT 5 

Image title

Wow, awesome. Let's connect it to JDBC.... to jdbc:drill:drillbit=localhost 

Image title

And let's start doing some analytics!

Image title

I achieved this goal in less than 5 minutes.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
apache drill ,log analysis ,log analytics ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}