Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Messing With the Data Schema to Make it Work With Drill

DZone's Guide to

Messing With the Data Schema to Make it Work With Drill

Knowing what schemas work with Apache Drill is difficult but important for performance and functionality. Some things from standard SQL/RDBMS are not supported, make sure you read the manuals. Map works!

Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Knowing what schemas work with Apache Drill is difficult but important for performance and functionality.  Some things from standard SQL/RDBMS are not supported, make sure you read the manuals.  Map works!

I must warn that this is not practical — you may not have the access or capacity to modify the schema of the data you want to query in the first place. Unless the data bulk was taken as a dump and queried a million times, there is no performance benefit in doing the below attempt. But for research purposes, why not? 

I had this multi-dimensional array in my data that was impossible to query with Drill due to its complex data schema. Before the readers pointing me out that it is indeed possible to query complex arrays, what I mean is, it is impossible to query with the same performance level, as we need to use flatten function which ruins the performance and output format. On the other hand, knowing the exact indices of arrays is impractical too.

I have this multi-dimensional array that makes it impossible for me to proceed:

"coordinates":[[ [.. , ..] , [.. , ..] , [.. , ..] ]]

Error message was:

Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST

Fragment 0:0

[Error Id: 5cc520ff-9594-4b9b-998d-20bf8569981b on llovizna:31010] (state=,code=0)

I had to transition the above into the below structure to make it work without any Drill functionality such as flatten.

"coordinates" : [ { "x" : .., "y" : .. }, { "x" : .., "y" : .. }, { "x" : .., "y" : .. } ]

0: jdbc:drill:zk=local> create table dfs.tmp.camic as select * from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/head2.json`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (1,392 seconds)

What I did essentially was to make the multi-dimensional array into a map.

0: jdbc:drill:zk=local> select * from dfs.tmp.camic;
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
| _id | type | parent_id | randval | creation_date | object_type | x | y | normalized | bbox | geometry | footprint | properties | provenance | submit_date |
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
| {"$oid":"56a784647b7b51c562"} | Feature | self | 0.3712421875 | 2026-11-16 01:17:13.101 | nucleus | 0.049646965 | 0.435353796 | true | [0.042729646965,0.85353796,0.8105608,0.7145562075] | {"type":"Polygon","coordinates":[{"x":0.04795445442,"y":0.87641187789917},{"x":0.0427805229,"y":0.87187789917}]} | 17.0 | {"scalar_features":[{"ns":"http://u24.bi.rk.eu/v1","nv":[{"name":"Hty","value":242.50489875},{"name":"ty","value":25.0},{"name":"Hty","value":-12.11},{"name":"ee","value":2.11}]}]} | {"image":{"case_id":"TC-2-00-01-01-T2","subject_id":"TC-02-000"},"analysis":{"execution_id":"ta-test","study_id":"tdma:::tue-jan-6-19:17:13-est-2011","source":"computer","computation":"segmentation"},"data_loader":"1.3"} | 2016-01-16 01:17:13.102 |
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
1 row selected (1,31 seconds)

Now, this works.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
schema ,apache drill ,big data ,database

Published at DZone with permission of Pradeeban Kathiravelu, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}