Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Drill and the Lack of Support for Nested Arrays

DZone's Guide to

Apache Drill and the Lack of Support for Nested Arrays

Nested Arrays and multidimensional arrays do not work well in Apache Drill; other alternatives are available.

Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Apache Drill is very efficient and fast, till you try to use it with a huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now — attempting to query large segments of data with a dynamic structure and nested schema.

I may construct a parquet data source from a nested array, as below,  

create table dfs.tmp.camic as ( select camic.geometry.coordinates[0][0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);

Here I am giving the indices of the array. 

Then I can query the data efficiently. For example,  

select * from dfs.tmp.camic;

However, giving the indices won't work as I need, as I don't just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.

$ create table dfs.tmp.camic as ( select camic.geometry.coordinates[0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]

  (java.lang.UnsupportedOperationException) Unsupported type LIST
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
    org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():744 (state=,code=0)

Here, I am trying to query a multi-dimensional array, which is not straight-forward.

(I set the error messages to be verbose using  SET `exec.errors.verbose` = true;
 above).

The commonly suggested options to query multi-dimensional arrays are:

  1. Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson — the coordinates. It may be millions or as low as 3.

  2. Flatten keyword: I am using Drill on top of Mongo — and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).

This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
schema ,large ,array ,elements ,drill ,dynamic ,apache drill

Published at DZone with permission of Pradeeban Kathiravelu, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}