Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Drill vs. Amazon Athena: A Comparison on Data Partitioning

DZone's Guide to

Apache Drill vs. Amazon Athena: A Comparison on Data Partitioning

In this article, we use SQL to run various commands to test which of these two data partitioning platforms will work best for you.

Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Big data exploration in almost all fields has led to the development of multiple big data technologies such as Hadoop (Hive, HDFS, Pig, HBase), NoSQL databases (MongoDB), and so on for accessing, exploring, and reporting huge volume of data. Amazon Athena, a serverless, interactive query service, is used to easily analyze big data using standard SQL in Amazon S3. Apache Drill, a schema-free, low-latency SQL query engine, enables self-service data exploration on big data.

In this blog, let us compare data partitioning in Apache Drill and AWS Athena and the distinct features of both.

Dataset Description

A sample dataset, containing census data of a particular country in the USA, is used in this use case. For sample dataset, consider Reference section.

Partitioning Data

In this section, let us discuss data partitioning based on male and female fertility rate in a predefined age group in Apache Drill and Athena.

Partitioning Data in Apache Drill

To perform data partition in Drill, perform the following:

  • Change data storage format to Parquet using the following command:
ALTER SESSION SET `store.format`='parquet';

select

  • Create a table and partition the data using the following command:

CREATE TABLE dfs.`csvOut`.AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION(country_code, country_name, 
`year`, fertility_rate_15_19, fertility_rate_20_24, fertility_rate_25_29, fertility_rate_30_34, fertility_rate_35_39, 
fertility_rate_40_44, fertility_rate_45_49, total_fertility_rate, gross_reproduction_rate, sex_ratio_at_birth,gender) 
PARTITION BY (gender)
AS
SELECT columns[0] as country_code,columns[1] as country_name,columns[2] as `year`,columns[3] as 
fertility_rate_15_19,columns[4] as fertility_rate_20_24,columns[5] as fertility_rate_25_29,columns[6] as 
fertility_rate_30_34,columns[7] as fertility_rate_35_39,columns[8] as fertility_rate_40_44,columns[9] as 
fertility_rate_45_49,columns[10] as total_fertility_rate,columns[11] as gross_reproduction_rate,columns[12] as 
sex_ratio_at_birth,columns[13] as gender FROM 
dfs.`/user/tsldp/drillathena/age_specific_fertility_rates_gender.csv`;

The table created is as shown below:select

The time taken to create a table is as shown below:select

You can check the data loaded into the database using the following command:

select * from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` ;

The time taken to select the required data in a table is as shown below:select

  • Get total count of male and female fertility data using the following command:

select count(*),gender from dfs.`csvOut`.`AGE_FERTILITY_RATES_GENDER_PARQUET_PARTITION` group by gender;

The count of males and females in a country is shown below:select

The file size after partitioning data using Apache drill is as shown below:select

Partitioning Data in Athena

Athena uses Hive data partitioning and provides improved query performance by reducing the amount of data scanned.

In Athena, data partitioning can be done in two separate ways:

  • With already partitioned data stored on Amazon S3 and accessed on Athena.
  • With unpartitioned data.

In both methods, specify the partitioned column in a create statement.

To perform data partitioning in Athena, perform the following:

  • Create a table using the below query:
create external table sampledb.age_fertility_rates_gender_parq_part(
country_code string,
country_name string,
year string,
fertility_rate_15_19 decimal(10,5),
fertility_rate_20_24 decimal(10,5),
fertility_rate_25_29 decimal(10,5),
fertility_rate_30_34 decimal(10,5),
fertility_rate_35_39 decimal(10,5),
fertility_rate_40_44 decimal(10,5),
fertility_rate_45_49 decimal(10,5),
total_fertility_rate decimal(10,5),
gross_reproduction_rate decimal(10,5),
sex_ratio_at_birth decimal(10,5))
PARTITIONED BY (gender string)
stored as parquet
LOCATION 's3://cps3bucket/data_gender_parquet/';

 select

  • Add partitions to the catalog by using the below command:
lMSCK REPAIR TABLE age_fertility.age_fertility_rates_gender_parq_part;

Image title

  • Check partitioned data using the below query:
select * from age_fertility.age_fertility_rates_gender_parq_part;

select

Data Partition Comparison Between Apache Drill and Amazon Athena

selectThe time taken to perform create a partition and select partition is as follows:

Distinct Features of Drill and Athena

select

References

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
apache drill ,athena ,big data ,data exploration ,partition

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}