Over a million developers have joined DZone.
Refcard #223

SQL Syntax for Apache Drill

Using SQL for the SQL-on-Everything Engine

Written by

Rick Van Der Lans Analyst, R20/Consultancy

Acquaints you with the agility and flexibility of Apache Drill and enables you to easily query non-relational datastores, including Hadoop. Discover the benefits of data agility and gain faster insights into BI and analytics through advanced SQL queries on massive datasets.

Free PDF
DOWNLOAD
Brought to you by MapR
Section 1

Introduction

This Refcard describes version 1.2 of Apache Drill released in October 2015. Apache Drill is an open-source SQL-on-Everything engine. It allows SQL queries to be executed on any kind of data source, ranging from simple CSV files to advanced SQL and NoSQL database servers.

Drill has three distinguishing features:

  • Drill can access flat, relational data structures as well as data sources with non-relational structures, such as arrays, hierarchies, maps, nested tables, and complex data types. Besides being able to access these data sources, Drill can run queries that join data across multiple data sources, including non-relational and relational ones.
  • Drill allows for schema-less access of data. In other words, it doesn’t need access to the schema definition of the data source. It doesn’t need to know the structure of the tables in advance, nor does it need statistical data. It goes straight for the data. The schema of the query result is therefore not known in advance. It’s built up and derived when data comes back from the data source. During the processing of all the data, the schema of the query result is continuously adapted. For example, the schema of a Hadoop file (or any data source) doesn’t have to be documented in Apache Hive to make it accessible for Drill.
  • As with most SQL database servers, Drill doesn’t have its own database. It has been designed and optimized to access other data sources. There is rarely ever a need to copy data from a data source to Hadoop to make it accessible by Drill.

Apache Drill's Supported Plugins for Data Access

  • CSV and TSV files
  • Hadoop files with Parquet and AVRO file formats, including Amazon S3
  • NoSQL databases, such as MongoDB and Apache HBase
  • SQL database servers, such as MySQL and Oracle, through an ODBC/JDBC interface
  • Files with JSON or BSON data structures
  • SQL-on-Hadoop engines, such as Apache Hive and Impala

Note: This Refcard uses several example files (CSV and JSON files) to demonstrate several of Drill's capabilities.

Section 2

Installing and Starting Up Drill

The following webpage contains detailed descriptions on how to install Drill on various platforms: https://drill.apache.org/docs/install-drill-introduction/.

After installation of the software, Drill can be started by using the command-line tool SQLLine:

sqlline.bat -u "jdbc:drill:"

Depending on the platform, the result looks something like this:

DRILL_ARGS - " -u jdbc:drill:"
Calculating Drill classpath...
oct 26, 2015 9:33:46 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.2.0
"json ain't no thang"
0: jdbc:drill:>

Drill is now ready to query data. To check if it really works, use the following query:

VALUES(CURRENT_DATE);

Result:

+---------------+
| CURRENT_DATE  |
+---------------+
| 2015-10-28    |
+---------------+

To stop Drill enter:

!quit
Section 3

First SQL Queries

Create a simple text file with the following text and call it HelloWorld.csv:

HelloWorld!

Query the file:

SELECT * FROM dfs.`\MyDirectory\HelloWorld.csv`;

Result:

+------------------+
|     columns      |
+------------------+
| ["HelloWorld!"]  |
+------------------+

The unique thing about this query is its use of the FROM clause. Instead of a table name, it contains a reference to the file to be accessed. The term dfs represents one of the supported storage plugins. This particular plugin indicates that a file in the local file system is accessed. The storage plugin is followed by a file specification containing the correct directory and the file name. Note that the file specification must be specified in between backticks and not single quotes. The file contains only one line with data, so only one row is returned. Because no column names are specified, the column name columns is used.

Create a file with the following content and call it "Employees.json":

{ "number"  : "6",
  "name"    : "Manzarek",
  "initials": "R",
  "street"  : "Haseltine Lane",
  "town"    : "Phoenix" }
{ "number"  : "8",
  "name"    : "Young",
  "initials": "N",
  "street"  : "Brownstreet",
  "mobile"  : "1234567" }
{ "number"  : "15",
  "name"    : "Metheny",
  "initials": "M",
  "province": "South"
}

Query the file:

SELECT * FROM dfs.`\MyDirectory\Employees.json`;

The JSON data structure is turned into a flat table, and for the missing values in specific columns, the null value is presented. Result:

+---------+-----------+-----------+-----------------+----------+----------+-----------+
| number  |   name    | initials  |     street      |   town   |  mobile  | province  |
+---------+-----------+-----------+-----------------+----------+----------+-----------+
| 6       | Manzarek  | R         | Haseltine Lane  | Phoenix  | null     | null      |
| 8       | Young     | N         | Brownstreet     | null     | 1234567  | null      |
| 15      | Metheny   | M         | null            | null     | null     | South     |
+---------+-----------+-----------+-----------------+----------+----------+-----------+
Section 4

Querying Nested Data Structures

Many data sources, such as MongoDB, Hadoop with AVRO, and JSON files, contain nested data structures. In relational terminology they would be called columns within columns. To address these nested columns, a specific syntax is introduced. Several examples are used to illustrate this syntax.

Create a file with the following content and call it "EmployeesNested.json":

{ "employee" : {
   "number"  : "6",
   "name"    : { "lastname": "Manzarek",
                 "initials": "R" },
   "address" : { "street"  : "Haseltine Lane",
                 "houseno" : "80",
                 "postcode": "1234KK",
                 "town"    : "Stratford" } }
}
{ "employee" : {
   "number"  : "8",
   "name"    : { "lastname": "Young",
                 "initials": "N" },
   "address" : { "street"  : "Brownstreet",
                 "houseno" : "80",
                 "province": "ZH",
                 "town"    : "Boston" } }
}
{ "employee" : {
   "number"  : "15",
   "name"    : { "lastname": "Metheny",
                 "initials": "M",
                 "code"    : "45" } }
}

Query the file:

SELECT emp.employee.number AS enr, emp.employee.name.lastname AS lastname
FROM   dfs.`\MyDirectory\EmployeesNested.json` AS emp;

Result:

+------+-----------+
| enr  | lastname  |
+------+-----------+
| 6    | Manzarek  |
| 8    | Young     |
| 15   | Metheny   |
+------+-----------+

To refer to the employee numbers the specification emp.employee.number is used. Emp refers to the table name specified in the FROM clause. Specifying this table name is required when dealing with nested tables, otherwise Drill may confuse columns with tables and therefore won’t run the query. Next, employee refers to the first level in the JSON document and number to the next level. To get the lastname of an employee, employee must be specified, followed by name and lastname. Assigning column names to these two result columns is not required, but does make the result easier to read.

Section 5

Querying Arrays

Many data sources contain arrays (sometimes called repeating groups). Drill supports the flatten function to transform these arrays into flat, relational data structures. Create a file with the following content (each employee can work for several projects) and call it EmployeesArrays.json:

{ "number"  : "8",
  "projects": [ "ACP3", "FGTR" ]
}
{ "number"  : "15",
  "projects": [ "ACP3", "HHGT", "X456" ]
}

Query the file:

SELECT * FROM dfs.`\MyDirectory\EmployeesArrays.json`;

Result:

+---------+-------------------------+
| number  |        projects         |
+---------+-------------------------+
| 8       | ["ACP3","FGTR"]         |
| 15      | ["ACP3","HHGT","X456"]  |
+---------+-------------------------+

In each row (so for each employee), the projects column contains a set of project values. To see them as separate values, use the flatten function:

SELECT FLATTEN(projects) AS project, enr
FROM   dfs.`\MyDirectory\EnployeesArrays.json`;

In the result each row contains a separate project value and the number of the employee to which the project value belongs:

+----------+------+
| project  | enr  |
+----------+------+
| ACP3     | 8    |
| FGTR     | 8    |
| ACP3     | 15   |
| HHGT     | 15   |
| X456     | 15   |
+----------+------+

Besides flatten, Drill supports two additional functions to work with the contents of arrays: repeated_count (counts the number of values in an array) and repeated_contains (searches for a specific value in an array):

SELECT number AS enr, REPEATED_COUNTS(projects) AS NumberOfProjects,
       REPEATED_CONTAINS(projects,'FGTR') AS ContainsFGTR
FROM   dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

Result:

+------+-------------------+---------------+
| enr  | NumberOfProjects  | ContainsFGTR  |
+------+-------------------+---------------+
| 8    | 2                 | true          |
| 15   | 3                 | false         |
+------+-------------------+---------------+

Individual values can be retrieved from an array:

SELECT number, projects[0], projects[1], projects[2]
FROM   dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

Result:

+---------+---------+---------+---------+
| number  | EXPR$1  | EXPR$2  | EXPR$3  |
+---------+---------+---------+---------+
| 8       | ACP3    | FGTR    | null    |
| 15      | ACP3    | HHGT    | X456    |
+---------+---------+---------+---------+
Section 6

Querying Maps With Data

The kvgen function (Key-Value Generation) transforms query maps that contain keys instead of a schema to arrays. Create a file with the following content and call it "CarOwners.json". This file contains the number of cars that each car owner owns by each manufacturer. The element cars doesn’t have a schema, but instead contains a set of keys (car manufacturer) and each key has a value (number of cars of that particular manufacturer):

{ "owner" : "1",
  "cars"  : { "Ford" : 3,
              "BMW"  : 2,
              "Ferrari" : 1 }
}
{ "owner" : "2",
  "cars"  : { "BMW" : 4,
              "GM"  : 5 }
}

Query the file:

SELECT KVGEN(cars) AS cars FROM dfs.`\MyDirectory\CarOwners.json`;

Result:

+---------------------------------------------------------------------------------+
|                                      cars                                       |
+---------------------------------------------------------------------------------+
| [{"key":"Ford","value":3},{"key":"BMW","value":2},{"key":"Ferrari","value":1}]  |
| [{"key":"BMW","value":4},{"key":"GM","value":5}]                                |
+---------------------------------------------------------------------------------+

The effect of the kvgen function is that the car data inside the cars element is transformed to an array with two elements: key and value. The result of this function can subsequently be processed by the flatten function to turn each element combination into a separate row:

SELECT owner, FLATTEN(KVGEN(cars)) AS cars FROM dfs.`\MyDirectory\CarOwners.json`;

Result:

+--------+------------------------------+
| owner  |             cars             |
+--------+------------------------------+
| 1      | {"key":"Ford","value":3}     |
| 1      | {"key":"BMW","value":2}      |
| 1      | {"key":"Ferrari","value":1}  |
| 2      | {"key":"BMW","value":4}      |
| 2      | {"key":"GM","value":5}       |
+--------+------------------------------+
Section 7

Combining Multiple Functions in Queries

Apache Drill supports all the query features to be expected from a SQL product. The next example shows how the special functions can be combined with more traditional joins and window functions. Create a file with the following content and call it "EmployeesProjects.json". Each line in this file indicates how many hours an employee has worked on a project on a specific day.

{"enr": "8", "project":"ACP3", "date":"2015-10-01", "hours":"4"}
{"enr": "8", "project":"ACP3", "date":"2015-10-04", "hours":"5"}
{"enr": "8", "project":"FGTR", "date":"2015-10-02", "hours":"2"}
{"enr":"15", "project":"ACP3", "date":"2015-10-01", "hours":"7"}
{"enr":"15", "project":"ACP3", "date":"2015-10-03", "hours":"5"}
{"enr":"15", "project":"HHGT", "date":"2015-10-01", "hours":"4"}
{"enr":"15", "project":"HHGT", "date":"2015-10-05", "hours":"2"}
{"enr":"15", "project":"HHGT", "date":"2015-10-07", "hours":"8"}
{"enr":"15", "project":"X456", "date":"2015-10-01", "hours":"6"}

Join this file with the EmployeesNested file:

SELECT CAST(proj.`date` AS DATE) AS pdate, emp.employee.number AS enr,
       emp.employee.name.lastname AS ename, proj.project, proj.hours,
       SUM(CAST(proj.hours AS INTEGER))
          OVER(PARTITION BY proj.`date`, emp.employee.number) AS sum_hours
FROM   dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesNested.json` AS emp LEFT OUTER JOIN
       dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesProjects.json` AS proj
       ON CAST(emp.employee.number AS INTEGER) = CAST(proj.enr AS INTEGER)
ORDER  BY 1, 2;

Result:

+-------------+------+-----------+----------+--------+------------+
|    pdate    | enr  |   ename   | project  | hours  | sum_hours  |
+-------------+------+-----------+----------+--------+------------+
| 2015-10-01  | 15   | Metheny   | ACP3     | 7      | 17         |
| 2015-10-01  | 15   | Metheny   | HHGT     | 4      | 17         |
| 2015-10-01  | 15   | Metheny   | X456     | 6      | 17         |
| 2015-10-01  | 8    | Young     | ACP3     | 4      | 4          |
| 2015-10-02  | 8    | Young     | FGTR     | 2      | 2          |
| 2015-10-03  | 15   | Metheny   | ACP3     | 5      | 5          |
| 2015-10-04  | 8    | Young     | ACP3     | 5      | 5          |
| 2015-10-05  | 15   | Metheny   | HHGT     | 2      | 2          |
| 2015-10-07  | 15   | Metheny   | HHGT     | 8      | 8          |
| null        | 6    | Manzarek  | null     | null   | null       |
+-------------+------+-----------+----------+--------+------------+
Section 8

Definitions of the SQL Statements for Querying Data Sources

This section contains the definitions of the SQL statements supported by Drill related to querying.

Query Statement

Definition

SELECT

<select statement> ::=
   [ WITH <table name> [ ( <column name> [ , <column name> ]... ) ]
       AS ( <table expression> ) ]
   <table expression>

<table expression> ::=
   SELECT <select clause>
   FROM <from clause>
   [ WHERE <boolean expression> ]
   [ GROUP BY <expression> [ , <expression> ] ]
   [ HAVING <boolean expression> ]
   [ ORDER BY <clause> ]
   [ LIMIT { <count> | ALL } ]
   [ OFFSET <number> { ROW | ROWS } ]

<from clause> ::=
   <table reference> [ , <table reference> ]...


<table reference> ::=
   { <table specification>  |
     <join specification>   |
     ( <table expression> ) |
     VALUES ( <expression list>) [ , ( <expression list> ) ]...
   [ [ AS ] <alias name> [ ( <column name> [ , <column name> ]... ) ] ]

<table specification> ::=
   <table name> | <storage plugin> . `<workspace>`

<join specification> ::=
   <table reference> <join type> <table reference> [ <join condition> ]

<join condition: ::= ON <condition>

<join type> ::=
   [INNER] JOIN | LEFT [OUTER] JOIN | RIGHT [OUTER] JOIN | FULL [OUTER] JOIN

<boolean expression> ::=
   <boolean expression { AND | OR } <boolean expression> |
   NOT <boolean expression> |
   <case expression> |
   <expression> { < | > | <= | >= | = | <> } <expression> |
   <expression> [ NOT ] BETWEEN <expression> and <expression> |
   <expression> IN ( <expression> [ , <expression> ]... ) |
   <expression> { LIKE | ILIKE | NOT LIKE | SIMILAR TO | NOT SIMILAR TO }
           <expression> |
   <expression> ( IS [ NOT ] NULL | IS [ NOT ] FALSE | IS [ NOT ] TRUE } |
   <expression> { IN | ANY | ALL } ( <table expression> ) |
   EXISTS ( <table expression> ) |
   <expression> || <expression>

<case expression> ::=
CASE
    WHEN <expression> [ , <expression> [ ... ] ] THEN <statements>
  [ WHEN <expression> [ , <expression> [ ... ] ] THEN <statements> ]...
  [ ELSE <statements> ]
END

VALUES

<values statement> ::=
   VALUES ( <expression list>) [ , ( <expression list> ) ]...

<expression list> ::=
   <expression> [ , <expression> ]...

EXPLAIN

EXPLAIN PLAN
   [ INCLUDING ALL ATTRIBUTES]
   [ WITH IMPLEMENTATION | WITHOUT IMPLEMENTATION ]
   FOR <select statement>

Section 9

Definitions of Data Definition Statements

This section contains the definitions of the data definition statements supported by Drill.

SQL Statement

Definition

CREATE TABLE

CREATE TABLE name [ ( <column list> ) ]
   [ PARTITION BY ( <column_name> [ , ... ] ) ]
   AS <select statement>

CREATE VIEW

CREATE [OR REPLACE] VIEW [ <workspace> . ] <view name>
   [ ( <column name> [ , ... ] ) ]
   AS <select statement>

DROP TABLE

DROP TABLE [ <workspace> . ] <table name>

DROP VIEW

DROP VIEW [ <workspace> . ] <view name>

DESCRIBE

DESCRIBE [ <workspace> . ] { <table name> | <view name> }

SHOW DATABASES

SHOW DATABASES

SHOW SCHEMAS

SHOW SCHEMAS

Shows all the schemas that can be used.

SHOW FILES

SHOW FILES
   [ FROM <filesystem> . <director name> | IN <filesystem> . <directory name> ]

SHOW TABLES

SHOW TABLES

Section 10

Definitions of SQL Statements for Working with Schemas and Sessions

This section contains the definitions of the statements supported by Drill for working with schemas and sessions.

SQL Statement

Definition

USE

USE <schema name>

ALTER SESSION

ALTER SESSION SET `<option name>` = <value>

ALTER SYSTEM

ALTER SYSTEM SET `<option name>` =  <value>

Section 11

Data Types

The following data types are supported by Drill and can be used when converting the data types of values.

Data types

Description

BIGINT

8-byte signed integer in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

9223372036854775807.

BINARY

Variable-length byte string

B@e6d9eb7.

BOOLEAN

True or false

true.

DATE

Years, months, and days in YYYY-MM-DD format since 4713 BC

2015-12-30.

DECIMAL(p,s), DEC(p,s), or NUMERIC(p,s)*

38-digit precision number, precision is p, and scale is s DECIMAL(6,2) is 1234.56, 4 digits before and 2 digits after the decimal point.

FLOAT

4-byte floating point number

0.456.

DOUBLE or DOUBLE PRECISION

8-byte floating point number, precision-scalable

0.456.

INTEGER or INT

4-byte signed integer in the range -2,147,483,648 to 2,147,483,647

2147483646.

INTERVAL

A day-time or year-month interval

'1 10:20:30.123' (day-time) or '1-2' year to month (year-month).

Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.

SMALLINT

2-byte signed integer in the range -32,768 to 32,767 32000.

This data type is not supported in version 1.2.

TIME

24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss

22:55:55.23.

TIMESTAMP

JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS 2015-12-30

22:55:55.23.

CHARACTER VARYING, CHARACTER, CHAR, or VARCHAR

UTF8-encoded variable-length string. The default limit is 1 character. The maximum character limit is 2,147,483,647.

CHAR(30) casts data to a 30-character string maximum.

Section 12

SQL Scalar Functions

This section contains descriptions of all the scalar functions supported by Drill that can be used in any expression.

The processing logic of all the scalar functions can easily be tested by using the VALUES statement. For example, with the following statement the CBRT function can be tested:

VALUES(CBRT(64));

Result:

+---------+
| EXPR$0  |
+---------+
| 4.0     |
+---------+ 

Numeric function

Data type of result

Definition

ABS(x)

Data type of x

Returns the absolute value of the input argument x.

CBRT(x)

FLOAT8

Returns the cubic root of x.

CEIL(x) and CEILING(x)

Data type of x

Returns the smallest integer not less than x.

DEGREES(x)

FLOAT8

Converts x radians to degrees.

E()

FLOAT8

Returns 2.718281828459045.

EXP(x)

FLOAT8

Returns e (Euler's number) to the power of x.

FLOOR(x)

Data type of x

Returns the largest integer not greater than x.

LOG(x)

FLOAT8

Returns the natural log (base e) of x.

LOG(x, y)

FLOAT8

Returns log base x to the y power.

LOG10(x)

FLOAT8

Returns the common log of x.

LSHIFT(x, y)

Data type of x

Shifts the binary x by y times to the left.

MOD(x, y)

FLOAT8

Returns the remainder of x divided by y.

NEGATIVE(x)

Data type of x

Returns x as a negative number.

PI

FLOAT8

Returns pi.

POW(x, y)

FLOAT8

Returns the value of x to the y power.

RADIANS

FLOAT8

Converts x degrees to radians.

RAND

FLOAT8

Returns a random number from 0-1.

ROUND(x)

Data type of x

Rounds to the nearest integer.

ROUND(x, y)

DECIMAL

Rounds x to y decimal places.

RSHIFT(x, y)

Data type of x

Shifts the binary x by y times to the right.

SIGN(x)

INT

Returns the sign of x.

SQRT(x)

Data type of x

Returns the square root of x.

TRUNC(x [ , y ] )

Data type of x

Truncates x to y decimal places. Y is optional. Default is 1.

TRUNC(x, y)

DECIMAL

Truncates x to y decimal places.

Date/time function

Data type of result

Definition

AGE(x [, y ] )

INTERVALDAY or INTERVALYEAR

Returns interval between two timestamps or subtracts a timestamp from midnight of the current date.

CURRENT_DATE

DATE

Returns the current date.

CURRENT_TIME

TIME

Returns the current time.

CURRENT_TIMESTAMP

TIMESTAMP

Returns the current timestamp.

DATE_ADD(x,y)

DATE, TIMESTAMP

Returns the sum of the sum of a date/time and a number of days/hours, or of a date/time and date/time interval. X must be a date, time, or timestamp expression, and y must be an integer or an interval expression. If y is an integer then x must be a date and y represents the number of days to be added. In other situations an interval must be used. Use the CAST function to create an interval.

DATE_PART(x,y)

DOUBLE

Extracts a time unit from the value of a time, date, or timestamp expression (y). Allowed time units for x are second, minute, hour, day, month, and year. A time unit must be specified between single quotes. Y represents a date, time, or timestamp.

DATE_SUB(x,y)

DATE, TIMESTAMP

Subtracts an interval (y) from a date or timestamp expression (x). X must be a date, time, or timestamp expression, and y must be an integer or an interval expression. If y is an integer then x must be a date and y represents the number of days to be subtracted. In other situations an interval must be used. Use the CAST function to create an interval.

EXTRACT(x FROM y)

DOUBLE


Extracts a time unit from a date or timestamp expression (y). X indicates which time unit to extract. This must be one of the following values: SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR. The time unit must not be specified between quotes.

LOCALTIME

TIME

Returns the local current time.

LOCALTIMESTAMP

TIMESTAMP

Returns the local current timestamp.

NOW()

TIMESTAMP

Returns the current timestamp.

TIMEOFDAY()

VARCHAR

Returns the current timestamp for the UTC time zone.

UNIX_TIMESTAMP( [ x] )

BIGINT

If no x is specified, it returns the number of seconds since the UNIX epoch (January 1, 1970 at 00:00:00). If x is specified it must be a timestamp and then the number of seconds since the UNIX epoch and the timestamp x is returned.

String function

Data type of result

Definition

BYTE_SUBSTR(x,y [, z ] )

BINARY or VARCHAR

Returns in binary format a substring y of the string x.

CHAR_LENGTH(x)

INTEGER

Returns the length of the alphanumeric argument x.

CONCAT(x,y)

VARCHAR

Combines the two alphanumeric values x and y. Has the same effect as the || operator.

INITCAP(x)

VARCHAR

Returns x in which the first character is capitalized.

LENGTH(x)

INTEGER

Returns the length in bytes of the alphanumeric value x.

LOWER(x)

VARCHAR

Converts all upper-case letters of x to lower-case letters.

LPAD(x,y [ , z ] )

VARCHAR

The value of x is filled in the front (the left-hand side) with the value of z just until the total length of the value is equal y’s length. If the maximum length is smaller than that of x, x is shortened on the left side. If no z is specified, blanks are used.

LTRIM(x)

VARCHAR

Removes all blanks that appear at the beginning of x.

POSITION( x IN y)

INTEGER

Returns the start position of the string x in the string y.

REGEXP_REPLACE(x,y,x)

VARCHAR

Substitutes new text for substrings that match Java regular expression patterns. In the string x, y is replaced by z. Y is the regular expression.

RPAD(x,y,z)

VARCHAR

The value of x is filled in the front (the right-hand side) with the value of z just until the total length of the value is equal to that of y. If the maximum length is smaller than that of x, x is shortened on the right side.

RTRIM(x)

VARCHAR

Removes all blanks from the end of the value of x.

STRPOS(x,y)

INTEGER

Returns the start position of the string y in the string x.

SUBSTR(x,y,z)

VARCHAR

Extracts characters from position 1 - x of x an optional y times.

TRIM(x)

VARCHAR

Removes all blanks from the start and from the end of x. Blanks in the middle are not removed.

UPPER(x)

VARCHAR

Converts all lower-case letters of x to upper-case letters.

Data type conversion function

Data type of result

Definition

CAST(x AS y)

Data type of y

Converts the data type of x to y. Y must be one of the supported data types; see Section 12.

CONVERT_TO(x,y)

Data type of y

Converts binary data (x) to Drill internal types (y) based on the little or big endian encoding of the data.

CONVERT_FROM(x,y)

Data type of y

Converts binary data (x) from Drill internal types (y) based on the little or big endian encoding of the data.

Null handling function

Data type of result

Definition

COALESCE(x, y [ , y ]... )

Data type of y

Returns the first non-null argument in the list of y’s.

NULLIF(x,y )

Data type of x

Returns the value of the x if x and y are not equal, and returns a null value if x and y are equal.

Section 13

SQL Aggregate and Window Functions

This section contains descriptions of all the aggregate and window functions supported by Drill that can be used in any expression.

Aggregate function

Data type of result

Definition

AVG(x)

DECIMAL for any integer-type argument, DOUBLE for a floating-point argument, otherwise the data type of x.

Calculates the weighted average of all values in x.

COUNT(*)

BIGINT

Counts the number of rows.

COUNT([DISTINCT] x)

BIGINT


Counts the number of non-null values in x, and if DISTINCT is specified the number of different non-null values in x.

MAX(x)

Data type of x.

Determines the maximum value of all non-null values in x.

MIN(x)

Data type of x.

Determines the minimum value of all non-null values in x.

SUM(x)

BIGINT for SMALLINT or INTEGER arguments, DECIMAL for BIGINT arguments, DOUBLE for floating-point arguments, otherwise the same as data type of x.

Calculates the sum of all values in x.

The syntax definition for all the window functions in the table below is as follows:

<window_function_name> ( [ ALL ] <expression> )
   OVER ( [ PARTITION BY <expression list> ] [ ORDER BY <order list> ] )

Window function

Data type of result

Definition

AVG(x) OVER (y)

DECIMAL for any integer-type argument, DOUBLE for a floating-point argument, otherwise the same as the argument data type

Returns the average value for the input expression values. It works with numeric values and ignores null values.

COUNT(x) OVER (y)

BIGINT

Counts the number of input rows. COUNT(*) counts all of the rows in the target table if they do or do not include nulls. COUNT(expression) computes the number of rows with non-NULL values in a specific column or expression.

COUNT( { * | x } ) OVER (y)

BIGINT

Counts the number of input rows. COUNT(*) counts all of the rows in the target table if they do or do not include nulls. COUNT(x) computes the number of rows with non-NULL values in a specific column or expression.

MAX(x) OVER (y)

Data type of x

Determines the maximum value of all non-null values in x.

MIN(x) OVER (y)

Data type of x

Determines the minimum value of all non-null values in x.

SUM(x) OVER (y)

BIGINT for SMALLINT or INTEGER arguments, DECIMAL for BIGINT arguments, DOUBLE for floating-point arguments, otherwise the data type of x

Calculates the sum of all values in x.

CUME_DIST() OVER (y)

DOUBLE PRECISION

Calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition).

DENSE_RANK()OVER (y)

BIGINT

Determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.

NTILE(x) OVER(y)

INTEGER

Divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.

PERCENT_RANK() OVER(y)

DOUBLE PRECISION

Calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.

RANK() OVER(y)

BIGINT

Determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.

ROW_NUMBER() OVER(y)

BIGINT

Determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers non-deterministically.

LAG(x) OVER (y)

Data type of x

Returns the value for the row before the current row in a partition. If no row exists, null is returned.

LEAD(x) OVER (y)

Data type of x

Returns the value for the row after the current row in a partition. If no row exists, null is returned.

FIRST_VALUE(x) OVER (y)

Data type of x

Returns the value of x with respect to the first row in the window frame.

LAST_VALUE(x) OVER (y)

Data type of x

Returns the value of x with respect to the last row in the window frame.

Section 14

SQL Nested Data Functions

This section contains descriptions of the functions supported by Drill for manipulating arrays and nested data.

Nested data function

Data type of result

Definition

FLATTEN(x)

----

Separates the elements in a repeated field x into individual rows; see Section 6.

KVGEN(x)

VARCHAR

Returns a list of the keys that exist in x; see Section 7.

REPEATED_COUNT(x)

INTEGER

Counts the number of values in x. X must be an array. See also Section 6.

REPEATED_CONTAINS(x,y)

BOOLEAN

Determines if the value y appears in the array x. Y may contain the following regular expression wildcards: asterisk (*), period (.), question mark (?), square bracketed ranges [a-z], square bracketed characters [ch], and negated square bracketed ranges or characters [!ch]. See also Section 6.

Publications

  • Featured
  • Latest
  • Popular
DOWNLOAD
Design Patterns
Learn design patterns quickly with Jason McDonald's outstanding tutorial on the original 23 Gang of Four design patterns, including class diagrams, explanations, usage info, and real world examples.
198.9k 535.8k
DOWNLOAD
Core Java
Gives you an overview of key aspects of the Java language and references on the core library, commonly used tools, and new Java 8 features.
122.2k 321.6k
DOWNLOAD
Getting Started with Ajax
Introduces Ajax, a group interrelated techniques used in client-side web development for creating asynchronous web applications.
100.4k 196.6k
DOWNLOAD
Getting Started with Git
This updated Refcard explains why so many developers are migrating to this exciting platform. Learn about creating a new Git repository, cloning existing projects, the remote workflow, and more to pave the way for limitless content version control.
108.7k 241.8k
DOWNLOAD
Spring Configuration
Catalogs the XML elements available as of Spring 2.5 and highlights those most commonly used: a handy resource for Spring context configuration.
101.7k 254k
DOWNLOAD
Core CSS: Part I
Covers Core principles of CSS that will expand and strengthen your professional ability to work with CSS. Part one of three.
88.7k 191.5k
DOWNLOAD
jQuery Selectors
Introduces jQuery Selectors, which allow you to select and manipulate HTML elements as a group or as a single element in jQuery.
92k 347.8k
DOWNLOAD
Foundations of RESTful Architecture
Introduces the REST architectural style, a worldview that can elicit desirable properties from the systems we deploy.
90.6k 132.9k
DOWNLOAD
The Ultimate Scrum Reference Card
Provides a concise overview of roles, meetings, rules, and artifacts within a Scrum organization. Updated for 2016.
84.5k 222.5k
DOWNLOAD
Core Java Concurrency
Helps Java developers working with multi-threaded programs understand the core concurrency concepts and how to apply them.
88.1k 179.3k
DOWNLOAD
Core CSS: Part II
Covers Core principles of CSS that will expand and strengthen your professional ability to work with CSS. Part two of three.
72.3k 137.8k
DOWNLOAD
Getting Started with Eclipse
Gives insights on Eclipse, the leading IDE for Java, which has a rich ecosystem of plug-ins and an open-source framework that supports other languages.
72k 182.6k
{{ card.title }}
{{card.downloads | formatCount }} {{card.views | formatCount }}

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}