Apache Drill is very efficient and fast, till you try to use it with a huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now — attempting to query large segments of data with a dynamic structure and nested schema.
I may construct a parquet data source from a nested array, as below,
create table dfs.tmp.camic as ( select camic.geometry.coordinates as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Here I am giving the indices of the array.
Then I can query the data efficiently. For example,
select * from dfs.tmp.camic;
However, giving the indices won't work as I need, as I don't just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.
$ create table dfs.tmp.camic as ( select camic.geometry.coordinates as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic); Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010] (java.lang.UnsupportedOperationException) Unsupported type LIST org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225 org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187 org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172 org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155 org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():744 (state=,code=0)
Here, I am trying to query a multi-dimensional array, which is not straight-forward.
(I set the error messages to be verbose using SET `exec.errors.verbose` = true;
The commonly suggested options to query multi-dimensional arrays are:
Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson — the coordinates. It may be millions or as low as 3.
Flatten keyword: I am using Drill on top of Mongo — and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).
This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.