This Refcard describes version 1.2 of Apache Drill released in October 2015. Apache Drill is an open-source SQL-on-Everything engine. It allows SQL queries to be executed on any kind of data source, ranging from simple CSV files to advanced SQL and NoSQL database servers.
Drill has three distinguishing features:
- Drill can access flat, relational data structures as well as data sources with non-relational structures, such as arrays, hierarchies, maps, nested tables, and complex data types. Besides being able to access these data sources, Drill can run queries that join data across multiple data sources, including non-relational and relational ones.
- Drill allows for schema-less access of data. In other words, it doesn’t need access to the schema definition of the data source. It doesn’t need to know the structure of the tables in advance, nor does it need statistical data. It goes straight for the data. The schema of the query result is therefore not known in advance. It’s built up and derived when data comes back from the data source. During the processing of all the data, the schema of the query result is continuously adapted. For example, the schema of a Hadoop file (or any data source) doesn’t have to be documented in Apache Hive to make it accessible for Drill.
- As with most SQL database servers, Drill doesn’t have its own database. It has been designed and optimized to access other data sources. There is rarely ever a need to copy data from a data source to Hadoop to make it accessible by Drill.
Apache Drill's Supported Plugins for Data Access
- CSV and TSV files
- Hadoop files with Parquet and AVRO file formats, including Amazon S3
- NoSQL databases, such as MongoDB and Apache HBase
- SQL database servers, such as MySQL and Oracle, through an ODBC/JDBC interface
- Files with JSON or BSON data structures
- SQL-on-Hadoop engines, such as Apache Hive and Impala
Note: This Refcard uses several example files (CSV and JSON files) to demonstrate several of Drill's capabilities.