Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Use OSS LOCATION in Data Lake Analytics

DZone's Guide to

How to Use OSS LOCATION in Data Lake Analytics

Due to the size of data lakes, working with them can be tricky. In this post, we examine how to use a few handy tools for data lake analysis.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Alibaba Cloud Data Lake Analytics (DLA) enables you to directly query and analyze data stored in Object Storage Service (OSS) and Table Store instances by using standard SQL statements.

Before querying data, you must create a table in DLA based on the format and content of the data file. This tutorial takes a file stored in an OSS instance as an example to explain how to specify the location of a table.

OSS Location

In DLA, the syntax of a table creation statement is as follows:

CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
    [COMMENT table_comment]
    [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
    [ROW FORMAT row_format]
    [STORE AS file_format]
        | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
    LOCATION oss_path

LOCATION can be set to the data file itself or to the directory where the data file is located.

If LOCATION Is Set to a Data File

When creating a table, you can directly associate the table with the data file. In this case, LOCATION must be set to the absolute path of the data file in the OSS instance.

For example:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv';

If LOCATION Is Set to a Directory

If LOCATION is set to a directory in the OSS instance, all the files under this directory are the data files of the table.

When creating a table, you can set recursive.directories to determine whether or not to traverse all data files under this directory. If recursive.directories is set to the default value or is not explicitly specified, DLA resolves that recursive.directories is false, that is, the directory is not traversed.

For example, the directory structure in the OSS instance is as follows:

2018-07-05 11:16:11 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv
2018-07-05 11:15:57 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/nation.csv
2018-07-05 11:16:17 1752.00B Standard oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv

The table creation statement is as follows:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
'oss://my-bucket/datasets/test/test_create/create_table/csv'
TBLPROPERTIES ('recursive.directories' = 'false');

If recursive.directories is set to false, the data file of this table is the nation.csv file under LOCATION:

oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv

The table creation statement is as follows:

CREATE EXTERNAL TABLE loc_file_csv(
    N_NATIONKEY INT,
    N_NAME STRING,
    N_REGIONKEY INT,
    N_COMMENT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
'oss://my-bucket/datasets/test/test_create/create_table/csv'
TBLPROPERTIES ('recursive.directories' = 'true');

If recursive.directories is set to true, the data files of this table are all the .csv files under LOCATION and all its subdirectories:

oss://my-bucket/datasets/test/test_create/create_table/csv/nation.csv
oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/nation.csv
oss://my-bucket/datasets/test/test_create/create_table/csv/dir1/dir2/nation.csv

Additional Notes

  1. If LOCATION is a directory, DLA determines by default that formats of all data files under this directory are the same, and parses files based on SERDE specified in the table creation statement. If the formats of data files under this directory are different, DLA encounters a parsing failure. Consequently, data in the table is incorrect and the query result is affected. For example, if STORED AS TEXTFILE is specified in the table creation statement, but both .csv and .orc files are available under the directory, no error is reported during table creation and query. However, you may see data with garbled characters or incorrect data when running the SELECT statement.
  2. Currently, DLA does not support .csv files with headers. If the first line of a file is the header, you must manually process data. Otherwise, DLA identifies the header as a data record.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
alibaba cloud ,big data ,storage ,serverless ,data lake

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}