Over a million developers have joined DZone.

Apache Drill and the Lack of Support for Nested Arrays

Nested Arrays and multidimensional arrays do not work well in Apache Drill; other alternatives are available.

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Apache Drill is very efficient and fast, till you try to use it with a huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now — attempting to query large segments of data with a dynamic structure and nested schema.

I may construct a parquet data source from a nested array, as below,  

create table dfs.tmp.camic as ( select camic.geometry.coordinates[0][0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);

Here I am giving the indices of the array. 

Then I can query the data efficiently. For example,  

select * from dfs.tmp.camic;

However, giving the indices won't work as I need, as I don't just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.

$ create table dfs.tmp.camic as ( select camic.geometry.coordinates[0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]

  (java.lang.UnsupportedOperationException) Unsupported type LIST
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
    org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():744 (state=,code=0)

Here, I am trying to query a multi-dimensional array, which is not straight-forward.

(I set the error messages to be verbose using  SET `exec.errors.verbose` = true;
 above).

The commonly suggested options to query multi-dimensional arrays are:

  1. Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson — the coordinates. It may be millions or as low as 3.

  2. Flatten keyword: I am using Drill on top of Mongo — and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).

This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
schema ,large ,array ,elements ,drill ,dynamic ,apache drill

Published at DZone with permission of Pradeeban Kathiravelu, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}