Over a million developers have joined DZone.

Apache Drill and the Lack of Support for Nested Arrays

DZone's Guide to

Apache Drill and the Lack of Support for Nested Arrays

Nested Arrays and multidimensional arrays do not work well in Apache Drill; other alternatives are available.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Apache Drill is very efficient and fast, till you try to use it with a huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now — attempting to query large segments of data with a dynamic structure and nested schema.

I may construct a parquet data source from a nested array, as below,  

create table dfs.tmp.camic as ( select camic.geometry.coordinates[0][0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);

Here I am giving the indices of the array. 

Then I can query the data efficiently. For example,  

select * from dfs.tmp.camic;

However, giving the indices won't work as I need, as I don't just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.

$ create table dfs.tmp.camic as ( select camic.geometry.coordinates[0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]

  (java.lang.UnsupportedOperationException) Unsupported type LIST
    java.lang.Thread.run():744 (state=,code=0)

Here, I am trying to query a multi-dimensional array, which is not straight-forward.

(I set the error messages to be verbose using  SET `exec.errors.verbose` = true;

The commonly suggested options to query multi-dimensional arrays are:

  1. Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson — the coordinates. It may be millions or as low as 3.

  2. Flatten keyword: I am using Drill on top of Mongo — and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).

This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

schema ,large ,array ,elements ,drill ,dynamic ,apache drill

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}