SQL Syntax for Apache Drill

Section 1

Introduction

This Refcard describes version 1.2 of Apache Drill released in October 2015. Apache Drill is an open-source SQL-on-Everything engine. It allows SQL queries to be executed on any kind of data source, ranging from simple CSV files to advanced SQL and NoSQL database servers.

Drill has three distinguishing features:

Drill can access flat, relational data structures as well as data sources with non-relational structures, such as arrays, hierarchies, maps, nested tables, and complex data types. Besides being able to access these data sources, Drill can run queries that join data across multiple data sources, including non-relational and relational ones.
Drill allows for schema-less access of data. In other words, it doesn’t need access to the schema definition of the data source. It doesn’t need to know the structure of the tables in advance, nor does it need statistical data. It goes straight for the data. The schema of the query result is therefore not known in advance. It’s built up and derived when data comes back from the data source. During the processing of all the data, the schema of the query result is continuously adapted. For example, the schema of a Hadoop file (or any data source) doesn’t have to be documented in Apache Hive to make it accessible for Drill.
As with most SQL database servers, Drill doesn’t have its own database. It has been designed and optimized to access other data sources. There is rarely ever a need to copy data from a data source to Hadoop to make it accessible by Drill.

Apache Drill's Supported Plugins for Data Access

CSV and TSV files
Hadoop files with Parquet and AVRO file formats, including Amazon S3
NoSQL databases, such as MongoDB and Apache HBase
SQL database servers, such as MySQL and Oracle, through an ODBC/JDBC interface
Files with JSON or BSON data structures
SQL-on-Hadoop engines, such as Apache Hive and Impala

Note: This Refcard uses several example files (CSV and JSON files) to demonstrate several of Drill's capabilities.

Section 2

Installing and Starting Up Drill

The following webpage contains detailed descriptions on how to install Drill on various platforms: https://drill.apache.org/docs/install-drill-introduction/.

After installation of the software, Drill can be started by using the command-line tool SQLLine:

sqlline.bat -u "jdbc:drill:"

Depending on the platform, the result looks something like this:

DRILL_ARGS - " -u jdbc:drill:"
Calculating Drill classpath...
oct 26, 2015 9:33:46 AM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
apache drill 1.2.0
"json ain't no thang"
0: jdbc:drill:>

Drill is now ready to query data. To check if it really works, use the following query:

VALUES(CURRENT_DATE);

Result:

+---------------+
| CURRENT_DATE  |
+---------------+
| 2015-10-28    |
+---------------+

To stop Drill enter:

!quit

Section 3

First SQL Queries

Create a simple text file with the following text and call it HelloWorld.csv:

HelloWorld!

Query the file:

SELECT * FROM dfs.`\MyDirectory\HelloWorld.csv`;

Result:

+------------------+
|     columns      |
+------------------+
| ["HelloWorld!"]  |
+------------------+

The unique thing about this query is its use of the FROM clause. Instead of a table name, it contains a reference to the file to be accessed. The term dfs represents one of the supported storage plugins. This particular plugin indicates that a file in the local file system is accessed. The storage plugin is followed by a file specification containing the correct directory and the file name. Note that the file specification must be specified in between backticks and not single quotes. The file contains only one line with data, so only one row is returned. Because no column names are specified, the column name columns is used.

Create a file with the following content and call it "Employees.json":

{ "number"  : "6",
  "name"    : "Manzarek",
  "initials": "R",
  "street"  : "Haseltine Lane",
  "town"    : "Phoenix" }
{ "number"  : "8",
  "name"    : "Young",
  "initials": "N",
  "street"  : "Brownstreet",
  "mobile"  : "1234567" }
{ "number"  : "15",
  "name"    : "Metheny",
  "initials": "M",
  "province": "South"
}

Query the file:

SELECT * FROM dfs.`\MyDirectory\Employees.json`;

The JSON data structure is turned into a flat table, and for the missing values in specific columns, the null value is presented. Result:

+---------+-----------+-----------+-----------------+----------+----------+-----------+
| number  |   name    | initials  |     street      |   town   |  mobile  | province  |
+---------+-----------+-----------+-----------------+----------+----------+-----------+
| 6       | Manzarek  | R         | Haseltine Lane  | Phoenix  | null     | null      |
| 8       | Young     | N         | Brownstreet     | null     | 1234567  | null      |
| 15      | Metheny   | M         | null            | null     | null     | South     |
+---------+-----------+-----------+-----------------+----------+----------+-----------+

Section 4

Querying Nested Data Structures

Many data sources, such as MongoDB, Hadoop with AVRO, and JSON files, contain nested data structures. In relational terminology they would be called columns within columns. To address these nested columns, a specific syntax is introduced. Several examples are used to illustrate this syntax.

Create a file with the following content and call it "EmployeesNested.json":

{ "employee" : {
   "number"  : "6",
   "name"    : { "lastname": "Manzarek",
                 "initials": "R" },
   "address" : { "street"  : "Haseltine Lane",
                 "houseno" : "80",
                 "postcode": "1234KK",
                 "town"    : "Stratford" } }
}
{ "employee" : {
   "number"  : "8",
   "name"    : { "lastname": "Young",
                 "initials": "N" },
   "address" : { "street"  : "Brownstreet",
                 "houseno" : "80",
                 "province": "ZH",
                 "town"    : "Boston" } }
}
{ "employee" : {
   "number"  : "15",
   "name"    : { "lastname": "Metheny",
                 "initials": "M",
                 "code"    : "45" } }
}

Query the file:

SELECT emp.employee.number AS enr, emp.employee.name.lastname AS lastname
FROM   dfs.`\MyDirectory\EmployeesNested.json` AS emp;

Result:

+------+-----------+
| enr  | lastname  |
+------+-----------+
| 6    | Manzarek  |
| 8    | Young     |
| 15   | Metheny   |
+------+-----------+

To refer to the employee numbers the specification emp.employee.number is used. Emp refers to the table name specified in the FROM clause. Specifying this table name is required when dealing with nested tables, otherwise Drill may confuse columns with tables and therefore won’t run the query. Next, employee refers to the first level in the JSON document and number to the next level. To get the lastname of an employee, employee must be specified, followed by name and lastname. Assigning column names to these two result columns is not required, but does make the result easier to read.

Section 5

Querying Arrays

Many data sources contain arrays (sometimes called repeating groups). Drill supports the flatten function to transform these arrays into flat, relational data structures. Create a file with the following content (each employee can work for several projects) and call it EmployeesArrays.json:

{ "number"  : "8",
  "projects": [ "ACP3", "FGTR" ]
}
{ "number"  : "15",
  "projects": [ "ACP3", "HHGT", "X456" ]
}

Query the file:

SELECT * FROM dfs.`\MyDirectory\EmployeesArrays.json`;

Result:

+---------+-------------------------+
| number  |        projects         |
+---------+-------------------------+
| 8       | ["ACP3","FGTR"]         |
| 15      | ["ACP3","HHGT","X456"]  |
+---------+-------------------------+

In each row (so for each employee), the projects column contains a set of project values. To see them as separate values, use the flatten function:

SELECT FLATTEN(projects) AS project, enr
FROM   dfs.`\MyDirectory\EnployeesArrays.json`;

In the result each row contains a separate project value and the number of the employee to which the project value belongs:

+----------+------+
| project  | enr  |
+----------+------+
| ACP3     | 8    |
| FGTR     | 8    |
| ACP3     | 15   |
| HHGT     | 15   |
| X456     | 15   |
+----------+------+

Besides flatten, Drill supports two additional functions to work with the contents of arrays: repeated_count (counts the number of values in an array) and repeated_contains (searches for a specific value in an array):

SELECT number AS enr, REPEATED_COUNTS(projects) AS NumberOfProjects,
       REPEATED_CONTAINS(projects,'FGTR') AS ContainsFGTR
FROM   dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

Result:

+------+-------------------+---------------+
| enr  | NumberOfProjects  | ContainsFGTR  |
+------+-------------------+---------------+
| 8    | 2                 | true          |
| 15   | 3                 | false         |
+------+-------------------+---------------+

Individual values can be retrieved from an array:

SELECT number, projects[0], projects[1], projects[2]
FROM   dfs.`\MyDirectory\EmployeesArrays.json` AS emp;

Result:

+---------+---------+---------+---------+
| number  | EXPR$1  | EXPR$2  | EXPR$3  |
+---------+---------+---------+---------+
| 8       | ACP3    | FGTR    | null    |
| 15      | ACP3    | HHGT    | X456    |
+---------+---------+---------+---------+

Section 6

Querying Maps With Data

The kvgen function (Key-Value Generation) transforms query maps that contain keys instead of a schema to arrays. Create a file with the following content and call it "CarOwners.json". This file contains the number of cars that each car owner owns by each manufacturer. The element cars doesn’t have a schema, but instead contains a set of keys (car manufacturer) and each key has a value (number of cars of that particular manufacturer):

{ "owner" : "1",
  "cars"  : { "Ford" : 3,
              "BMW"  : 2,
              "Ferrari" : 1 }
}
{ "owner" : "2",
  "cars"  : { "BMW" : 4,
              "GM"  : 5 }
}

Query the file:

SELECT KVGEN(cars) AS cars FROM dfs.`\MyDirectory\CarOwners.json`;

Result:

+---------------------------------------------------------------------------------+
|                                      cars                                       |
+---------------------------------------------------------------------------------+
| [{"key":"Ford","value":3},{"key":"BMW","value":2},{"key":"Ferrari","value":1}]  |
| [{"key":"BMW","value":4},{"key":"GM","value":5}]                                |
+---------------------------------------------------------------------------------+

The effect of the kvgen function is that the car data inside the cars element is transformed to an array with two elements: key and value. The result of this function can subsequently be processed by the flatten function to turn each element combination into a separate row:

SELECT owner, FLATTEN(KVGEN(cars)) AS cars FROM dfs.`\MyDirectory\CarOwners.json`;

Result:

+--------+------------------------------+
| owner  |             cars             |
+--------+------------------------------+
| 1      | {"key":"Ford","value":3}     |
| 1      | {"key":"BMW","value":2}      |
| 1      | {"key":"Ferrari","value":1}  |
| 2      | {"key":"BMW","value":4}      |
| 2      | {"key":"GM","value":5}       |
+--------+------------------------------+

Section 7

Combining Multiple Functions in Queries

Apache Drill supports all the query features to be expected from a SQL product. The next example shows how the special functions can be combined with more traditional joins and window functions. Create a file with the following content and call it "EmployeesProjects.json". Each line in this file indicates how many hours an employee has worked on a project on a specific day.

{"enr": "8", "project":"ACP3", "date":"2015-10-01", "hours":"4"}
{"enr": "8", "project":"ACP3", "date":"2015-10-04", "hours":"5"}
{"enr": "8", "project":"FGTR", "date":"2015-10-02", "hours":"2"}
{"enr":"15", "project":"ACP3", "date":"2015-10-01", "hours":"7"}
{"enr":"15", "project":"ACP3", "date":"2015-10-03", "hours":"5"}
{"enr":"15", "project":"HHGT", "date":"2015-10-01", "hours":"4"}
{"enr":"15", "project":"HHGT", "date":"2015-10-05", "hours":"2"}
{"enr":"15", "project":"HHGT", "date":"2015-10-07", "hours":"8"}
{"enr":"15", "project":"X456", "date":"2015-10-01", "hours":"6"}

Join this file with the EmployeesNested file:

SELECT CAST(proj.`date` AS DATE) AS pdate, emp.employee.number AS enr,
       emp.employee.name.lastname AS ename, proj.project, proj.hours,
       SUM(CAST(proj.hours AS INTEGER))
          OVER(PARTITION BY proj.`date`, emp.employee.number) AS sum_hours
FROM   dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesNested.json` AS emp LEFT OUTER JOIN
       dfs.`\Prive\ARTIKEL\DZone\Examples\EmployeesProjects.json` AS proj
       ON CAST(emp.employee.number AS INTEGER) = CAST(proj.enr AS INTEGER)
ORDER  BY 1, 2;

Result:

+-------------+------+-----------+----------+--------+------------+
|    pdate    | enr  |   ename   | project  | hours  | sum_hours  |
+-------------+------+-----------+----------+--------+------------+
| 2015-10-01  | 15   | Metheny   | ACP3     | 7      | 17         |
| 2015-10-01  | 15   | Metheny   | HHGT     | 4      | 17         |
| 2015-10-01  | 15   | Metheny   | X456     | 6      | 17         |
| 2015-10-01  | 8    | Young     | ACP3     | 4      | 4          |
| 2015-10-02  | 8    | Young     | FGTR     | 2      | 2          |
| 2015-10-03  | 15   | Metheny   | ACP3     | 5      | 5          |
| 2015-10-04  | 8    | Young     | ACP3     | 5      | 5          |
| 2015-10-05  | 15   | Metheny   | HHGT     | 2      | 2          |
| 2015-10-07  | 15   | Metheny   | HHGT     | 8      | 8          |
| null        | 6    | Manzarek  | null     | null   | null       |
+-------------+------+-----------+----------+--------+------------+

Section 8

Definitions of the SQL Statements for Querying Data Sources

This section contains the definitions of the SQL statements supported by Drill related to querying.

Query Statement	Definition
SELECT	<select statement> ::= [ WITH <table name> [ ( <column name> [ , <column name> ]... ) ] AS ( <table expression> ) ] <table expression> <table expression> ::= SELECT <select clause> FROM <from clause> [ WHERE <boolean expression> ] [ GROUP BY <expression> [ , <expression> ] ] [ HAVING <boolean expression> ] [ ORDER BY <clause> ] [ LIMIT { <count> \| ALL } ] [ OFFSET <number> { ROW \| ROWS } ] <from clause> ::= <table reference> [ , <table reference> ]... <table reference> ::= { <table specification> \| <join specification> \| ( <table expression> ) \| VALUES ( <expression list>) [ , ( <expression list> ) ]... [ [ AS ] <alias name> [ ( <column name> [ , <column name> ]... ) ] ] <table specification> ::= <table name> \| <storage plugin> . `<workspace>` <join specification> ::= <table reference> <join type> <table reference> [ <join condition> ] <join condition: ::= ON <condition> <join type> ::= [INNER] JOIN \| LEFT [OUTER] JOIN \| RIGHT [OUTER] JOIN \| FULL [OUTER] JOIN <boolean expression> ::= <boolean expression { AND \| OR } <boolean expression> \| NOT <boolean expression> \| <case expression> \| <expression> { < \| > \| <= \| >= \| = \| <> } <expression> \| <expression> [ NOT ] BETWEEN <expression> and <expression> \| <expression> IN ( <expression> [ , <expression> ]... ) \| <expression> { LIKE \| ILIKE \| NOT LIKE \| SIMILAR TO \| NOT SIMILAR TO } <expression> \| <expression> ( IS [ NOT ] NULL \| IS [ NOT ] FALSE \| IS [ NOT ] TRUE } \| <expression> { IN \| ANY \| ALL } ( <table expression> ) \| EXISTS ( <table expression> ) \| <expression> \|\| <expression> <case expression> ::= CASE WHEN <expression> [ , <expression> [ ... ] ] THEN <statements> [ WHEN <expression> [ , <expression> [ ... ] ] THEN <statements> ]... [ ELSE <statements> ] END
VALUES	`<values statement> ::= VALUES ( <expression list>) [ , ( <expression list> ) ]... <expression list> ::= <expression> [ , <expression> ]...`
EXPLAIN	`EXPLAIN PLAN [ INCLUDING ALL ATTRIBUTES] [ WITH IMPLEMENTATION \| WITHOUT IMPLEMENTATION ] FOR <select statement>`

Section 9

Definitions of Data Definition Statements

This section contains the definitions of the data definition statements supported by Drill.

SQL Statement	Definition
CREATE TABLE	`CREATE TABLE name [ ( <column list> ) ] [ PARTITION BY ( <column_name> [ , ... ] ) ] AS <select statement>`
CREATE VIEW	`CREATE [OR REPLACE] VIEW [ <workspace> . ] <view name> [ ( <column name> [ , ... ] ) ] AS <select statement>`
DROP TABLE	`DROP TABLE [ <workspace> . ] <table name>`
DROP VIEW	`DROP VIEW [ <workspace> . ] <view name>`
DESCRIBE	`DESCRIBE [ <workspace> . ] { <table name> \| <view name> }`
SHOW DATABASES	`SHOW DATABASES`
SHOW SCHEMAS	`SHOW SCHEMAS` Shows all the schemas that can be used.
SHOW FILES	`SHOW FILES [ FROM <filesystem> . <director name> \| IN <filesystem> . <directory name> ]`
SHOW TABLES	`SHOW TABLES`

Section 10

Definitions of SQL Statements for Working with Schemas and Sessions

This section contains the definitions of the statements supported by Drill for working with schemas and sessions.

SQL Statement	Definition
USE	`USE <schema name>`
ALTER SESSION	ALTER SESSION SET `<option name>` = <value>
ALTER SYSTEM	ALTER SYSTEM SET `<option name>` = <value>

Section 11

Data Types

The following data types are supported by Drill and can be used when converting the data types of values.

Data types	Description
BIGINT	8-byte signed integer in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 9223372036854775807.
BINARY	Variable-length byte string B@e6d9eb7.
BOOLEAN	True or false true.
DATE	Years, months, and days in YYYY-MM-DD format since 4713 BC 2015-12-30.
DECIMAL(p,s), DEC(p,s), or NUMERIC(p,s)*	38-digit precision number, precision is p, and scale is s DECIMAL(6,2) is 1234.56, 4 digits before and 2 digits after the decimal point.
FLOAT	4-byte floating point number 0.456.
DOUBLE or DOUBLE PRECISION	8-byte floating point number, precision-scalable 0.456.
INTEGER or INT	4-byte signed integer in the range -2,147,483,648 to 2,147,483,647 2147483646.
INTERVAL	A day-time or year-month interval '1 10:20:30.123' (day-time) or '1-2' year to month (year-month). Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR.
SMALLINT	2-byte signed integer in the range -32,768 to 32,767 32000. This data type is not supported in version 1.2.
TIME	24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss 22:55:55.23.
TIMESTAMP	JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS 2015-12-30 22:55:55.23.
CHARACTER VARYING, CHARACTER, CHAR, or VARCHAR	UTF8-encoded variable-length string. The default limit is 1 character. The maximum character limit is 2,147,483,647. CHAR(30) casts data to a 30-character string maximum.

Section 12

SQL Scalar Functions

This section contains descriptions of all the scalar functions supported by Drill that can be used in any expression.

The processing logic of all the scalar functions can easily be tested by using the VALUES statement. For example, with the following statement the CBRT function can be tested:

VALUES(CBRT(64));

Result:

+---------+
| EXPR$0  |
+---------+
| 4.0     |
+---------+

Numeric function	Data type of result	Definition
ABS(x)	Data type of x	Returns the absolute value of the input argument x.
CBRT(x)	FLOAT8	Returns the cubic root of x.
CEIL(x) and CEILING(x)	Data type of x	Returns the smallest integer not less than x.
DEGREES(x)	FLOAT8	Converts x radians to degrees.
E()	FLOAT8	Returns 2.718281828459045.
EXP(x)	FLOAT8	Returns e (Euler's number) to the power of x.
FLOOR(x)	Data type of x	Returns the largest integer not greater than x.
LOG(x)	FLOAT8	Returns the natural log (base e) of x.
LOG(x, y)	FLOAT8	Returns log base x to the y power.
LOG10(x)	FLOAT8	Returns the common log of x.
LSHIFT(x, y)	Data type of x	Shifts the binary x by y times to the left.
MOD(x, y)	FLOAT8	Returns the remainder of x divided by y.
NEGATIVE(x)	Data type of x	Returns x as a negative number.
PI	FLOAT8	Returns pi.
POW(x, y)	FLOAT8	Returns the value of x to the y power.
RADIANS	FLOAT8	Converts x degrees to radians.
RAND	FLOAT8	Returns a random number from 0-1.
ROUND(x)	Data type of x	Rounds to the nearest integer.
ROUND(x, y)	DECIMAL	Rounds x to y decimal places.
RSHIFT(x, y)	Data type of x	Shifts the binary x by y times to the right.
SIGN(x)	INT	Returns the sign of x.
SQRT(x)	Data type of x	Returns the square root of x.
TRUNC(x [ , y ] )	Data type of x	Truncates x to y decimal places. Y is optional. Default is 1.
TRUNC(x, y)	DECIMAL	Truncates x to y decimal places.

Date/time function	Data type of result	Definition
AGE(x [, y ] )	INTERVALDAY or INTERVALYEAR	Returns interval between two timestamps or subtracts a timestamp from midnight of the current date.
CURRENT_DATE	DATE	Returns the current date.
CURRENT_TIME	TIME	Returns the current time.
CURRENT_TIMESTAMP	TIMESTAMP	Returns the current timestamp.
DATE_ADD(x,y)	DATE, TIMESTAMP	Returns the sum of the sum of a date/time and a number of days/hours, or of a date/time and date/time interval. X must be a date, time, or timestamp expression, and y must be an integer or an interval expression. If y is an integer then x must be a date and y represents the number of days to be added. In other situations an interval must be used. Use the CAST function to create an interval.
DATE_PART(x,y)	DOUBLE	Extracts a time unit from the value of a time, date, or timestamp expression (y). Allowed time units for x are second, minute, hour, day, month, and year. A time unit must be specified between single quotes. Y represents a date, time, or timestamp.
DATE_SUB(x,y)	DATE, TIMESTAMP	Subtracts an interval (y) from a date or timestamp expression (x). X must be a date, time, or timestamp expression, and y must be an integer or an interval expression. If y is an integer then x must be a date and y represents the number of days to be subtracted. In other situations an interval must be used. Use the CAST function to create an interval.
EXTRACT(x FROM y)	DOUBLE	Extracts a time unit from a date or timestamp expression (y). X indicates which time unit to extract. This must be one of the following values: SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR. The time unit must not be specified between quotes.
LOCALTIME	TIME	Returns the local current time.
LOCALTIMESTAMP	TIMESTAMP	Returns the local current timestamp.
NOW()	TIMESTAMP	Returns the current timestamp.
TIMEOFDAY()	VARCHAR	Returns the current timestamp for the UTC time zone.
UNIX_TIMESTAMP( [ x] )	BIGINT	If no x is specified, it returns the number of seconds since the UNIX epoch (January 1, 1970 at 00:00:00). If x is specified it must be a timestamp and then the number of seconds since the UNIX epoch and the timestamp x is returned.

String function	Data type of result	Definition
BYTE_SUBSTR(x,y [, z ] )	BINARY or VARCHAR	Returns in binary format a substring y of the string x.
CHAR_LENGTH(x)	INTEGER	Returns the length of the alphanumeric argument x.
CONCAT(x,y)	VARCHAR	Combines the two alphanumeric values x and y. Has the same effect as the \|\| operator.
INITCAP(x)	VARCHAR	Returns x in which the first character is capitalized.
LENGTH(x)	INTEGER	Returns the length in bytes of the alphanumeric value x.
LOWER(x)	VARCHAR	Converts all upper-case letters of x to lower-case letters.
LPAD(x,y [ , z ] )	VARCHAR	The value of x is filled in the front (the left-hand side) with the value of z just until the total length of the value is equal y’s length. If the maximum length is smaller than that of x, x is shortened on the left side. If no z is specified, blanks are used.
LTRIM(x)	VARCHAR	Removes all blanks that appear at the beginning of x.
POSITION( x IN y)	INTEGER	Returns the start position of the string x in the string y.
REGEXP_REPLACE(x,y,x)	VARCHAR	Substitutes new text for substrings that match Java regular expression patterns. In the string x, y is replaced by z. Y is the regular expression.
RPAD(x,y,z)	VARCHAR	The value of x is filled in the front (the right-hand side) with the value of z just until the total length of the value is equal to that of y. If the maximum length is smaller than that of x, x is shortened on the right side.
RTRIM(x)	VARCHAR	Removes all blanks from the end of the value of x.
STRPOS(x,y)	INTEGER	Returns the start position of the string y in the string x.
SUBSTR(x,y,z)	VARCHAR	Extracts characters from position 1 - x of x an optional y times.
TRIM(x)	VARCHAR	Removes all blanks from the start and from the end of x. Blanks in the middle are not removed.
UPPER(x)	VARCHAR	Converts all lower-case letters of x to upper-case letters.

Data type conversion function	Data type of result	Definition
CAST(x AS y)	Data type of y	Converts the data type of x to y. Y must be one of the supported data types; see Section 12.
CONVERT_TO(x,y)	Data type of y	Converts binary data (x) to Drill internal types (y) based on the little or big endian encoding of the data.
CONVERT_FROM(x,y)	Data type of y	Converts binary data (x) from Drill internal types (y) based on the little or big endian encoding of the data.

Null handling function	Data type of result	Definition
COALESCE(x, y [ , y ]... )	Data type of y	Returns the first non-null argument in the list of y’s.
NULLIF(x,y )	Data type of x	Returns the value of the x if x and y are not equal, and returns a null value if x and y are equal.

Section 13

SQL Aggregate and Window Functions

This section contains descriptions of all the aggregate and window functions supported by Drill that can be used in any expression.

Aggregate function	Data type of result	Definition
AVG(x)	DECIMAL for any integer-type argument, DOUBLE for a floating-point argument, otherwise the data type of x.	Calculates the weighted average of all values in x.
COUNT(*)	BIGINT	Counts the number of rows.
COUNT([DISTINCT] x)	BIGINT	Counts the number of non-null values in x, and if DISTINCT is specified the number of different non-null values in x.
MAX(x)	Data type of x.	Determines the maximum value of all non-null values in x.
MIN(x)	Data type of x.	Determines the minimum value of all non-null values in x.
SUM(x)	BIGINT for SMALLINT or INTEGER arguments, DECIMAL for BIGINT arguments, DOUBLE for floating-point arguments, otherwise the same as data type of x.	Calculates the sum of all values in x.

The syntax definition for all the window functions in the table below is as follows:

<window_function_name> ( [ ALL ] <expression> )
   OVER ( [ PARTITION BY <expression list> ] [ ORDER BY <order list> ] )

Window function	Data type of result	Definition
AVG(x) OVER (y)	DECIMAL for any integer-type argument, DOUBLE for a floating-point argument, otherwise the same as the argument data type	Returns the average value for the input expression values. It works with numeric values and ignores null values.
COUNT(x) OVER (y)	BIGINT	Counts the number of input rows. COUNT(*) counts all of the rows in the target table if they do or do not include nulls. COUNT(expression) computes the number of rows with non-NULL values in a specific column or expression.
COUNT( { * \| x } ) OVER (y)	BIGINT	Counts the number of input rows. COUNT(*) counts all of the rows in the target table if they do or do not include nulls. COUNT(x) computes the number of rows with non-NULL values in a specific column or expression.
MAX(x) OVER (y)	Data type of x	Determines the maximum value of all non-null values in x.
MIN(x) OVER (y)	Data type of x	Determines the minimum value of all non-null values in x.
SUM(x) OVER (y)	BIGINT for SMALLINT or INTEGER arguments, DECIMAL for BIGINT arguments, DOUBLE for floating-point arguments, otherwise the data type of x	Calculates the sum of all values in x.
CUME_DIST() OVER (y)	DOUBLE PRECISION	Calculates the relative rank of the current row within a window partition: (number of rows preceding or peer with current row) / (total rows in the window partition).
DENSE_RANK()OVER (y)	BIGINT	Determines the rank of a value in a group of values based on the ORDER BY expression and the OVER clause. Each value is ranked within its partition. Rows with equal values receive the same rank. There are no gaps in the sequence of ranked values if two or more rows have the same rank.
NTILE(x) OVER(y)	INTEGER	Divides the rows for each window partition, as equally as possible, into a specified number of ranked groups. The NTILE window function requires the ORDER BY clause in the OVER clause.
PERCENT_RANK() OVER(y)	DOUBLE PRECISION	Calculates the percent rank of the current row using the following formula: (x - 1) / (number of rows in window partition - 1) where x is the rank of the current row.
RANK() OVER(y)	BIGINT	Determines the rank of a value in a group of values. The ORDER BY expression in the OVER clause determines the value. Each value is ranked within its partition. Rows with equal values for the ranking criteria receive the same rank. Drill adds the number of tied rows to the tied rank to calculate the next rank and thus the ranks might not be consecutive numbers. For example, if two rows are ranked 1, the next rank is 3. The DENSE_RANK window function differs in that no gaps exist if two or more rows tie.
ROW_NUMBER() OVER(y)	BIGINT	Determines the ordinal number of the current row within its partition. The ORDER BY expression in the OVER clause determines the number. Each value is ordered within its partition. Rows with equal values for the ORDER BY expressions receive different row numbers non-deterministically.
LAG(x) OVER (y)	Data type of x	Returns the value for the row before the current row in a partition. If no row exists, null is returned.
LEAD(x) OVER (y)	Data type of x	Returns the value for the row after the current row in a partition. If no row exists, null is returned.
FIRST_VALUE(x) OVER (y)	Data type of x	Returns the value of x with respect to the first row in the window frame.
LAST_VALUE(x) OVER (y)	Data type of x	Returns the value of x with respect to the last row in the window frame.

Section 14

SQL Nested Data Functions

This section contains descriptions of the functions supported by Drill for manipulating arrays and nested data.

Nested data function	Data type of result	Definition
FLATTEN(x)	----	Separates the elements in a repeated field x into individual rows; see Section 6.
KVGEN(x)	VARCHAR	Returns a list of the keys that exist in x; see Section 7.
REPEATED_COUNT(x)	INTEGER	Counts the number of values in x. X must be an array. See also Section 6.
REPEATED_CONTAINS(x,y)	BOOLEAN	Determines if the value y appears in the array x. Y may contain the following regular expression wildcards: asterisk (*), period (.), question mark (?), square bracketed ranges [a-z], square bracketed characters [ch], and negated square bracketed ranges or characters [!ch]. See also Section 6.

SQL Syntax for Apache Drill

Using SQL for the SQL-on-Everything Engine

Introduction

Apache Drill's Supported Plugins for Data Access

Installing and Starting Up Drill

First SQL Queries

Querying Nested Data Structures

Querying Arrays

Querying Maps With Data

Combining Multiple Functions in Queries

Definitions of the SQL Statements for Querying Data Sources

Definitions of Data Definition Statements

Definitions of SQL Statements for Working with Schemas and Sessions

Data Types

SQL Scalar Functions

SQL Aggregate and Window Functions

SQL Nested Data Functions

{{ parent.title || parent.header.title}}

{{ parent.linkDescription }}

SQL Syntax for Apache Drill

Using SQL for the SQL-on-Everything Engine

Introduction

Apache Drill's Supported Plugins for Data Access

Installing and Starting Up Drill

First SQL Queries

Querying Nested Data Structures

Querying Arrays

Querying Maps With Data

Combining Multiple Functions in Queries

Definitions of the SQL Statements for Querying Data Sources

Definitions of Data Definition Statements

Definitions of SQL Statements for Working with Schemas and Sessions

Data Types

SQL Scalar Functions

SQL Aggregate and Window Functions

SQL Nested Data Functions

Like This Refcard? Read More From DZone

{{ parent.title || parent.header.title}}

{{ parent.linkDescription }}