DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • AWS S3 Strategies for Scalable and Secure Data Lake Storage
  • Building Scalable Data Lake Using AWS
  • Deep Learning Fraud Detection With AWS SageMaker and Glue
  • Relational DB Migration to S3 Data Lake Via AWS DMS, Part I

Trending

  • 5 Best Node.js Practices to Develop Scalable and Robust Applications
  • Solid Testing Strategies for Salesforce Releases
  • AI Meets Vector Databases: Redefining Data Retrieval in the Age of Intelligence
  • The Transformative Power of Artificial Intelligence in Cloud Security
  1. DZone
  2. Data Engineering
  3. Data
  4. AWS Partition Projections: Enhancing Athena Query Performance

AWS Partition Projections: Enhancing Athena Query Performance

This article provides an analysis of performance improvements in AWS Athena queries using the partition projections feature.

By 
Mandar Khoje user avatar
Mandar Khoje
·
Nov. 15, 23 · Analysis
Likes (9)
Comment
Save
Tweet
Share
5.3K Views

Join the DZone community and get the full member experience.

Join For Free

In today's data-driven landscape, organizations are increasingly turning to robust solutions like AWS Data Lake to centralize vast amounts of structured and unstructured data. AWS Data Lake, a scalable and secure repository, allows businesses to store data in its native format, facilitating diverse analytics and machine learning tasks. One of the popular tools to query this vast reservoir of information is Amazon Athena, a serverless, interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. However, as the volume of data grows exponentially, performance challenges can emerge. Large datasets, complex queries, and suboptimal table structures can lead to increased query times and costs, potentially undermining the very benefits that these solutions promise. This article delves specifically into the details of how to harness the power of partition projections to address these performance challenges.

Before diving into the advanced concept of partition projections in Athena, it's essential to grasp the foundational idea of partitions, especially in the context of a data lake.

What Are Partitions in AWS Data Lake?

In the realm of data storage and retrieval, a partition refers to a division of a table's data based on the values of one or more columns. Think of it as organizing a vast bookshelf (your data) into different sections (partitions) based on genres (column values). By doing so, when you're looking for a specific type of book (data), you only need to search in the relevant section (partition) rather than the entire bookshelf.

In a data lake, partitions are typically directories that contain data files. Each directory corresponds to a specific value or range of values from the partitioning column(s).

Why Are Partitions Important?

Efficiency: Without partitions, querying vast datasets would involve scanning every single file, which is both time-consuming and costly. With partitions, only the relevant directories are scanned, significantly reducing the amount of data processed.

  • Cost Savings: In cloud environments like AWS, where you pay for the amount of data scanned, partitions can lead to substantial cost reductions.
  • Scalability: As data grows, so does the importance of partitions. They ensure that even as your data lake swells with more data, retrieval times remain manageable.

Challenges With Partitions

While partitions offer numerous benefits, they aren't without challenges:

  • Maintenance: As new data comes in, new partitions might need to be created, and existing ones might need updates.
  • Optimal Partitioning: Too few partitions can mean you're still scanning a lot of unnecessary data. Conversely, too many partitions can lead to a large number of small files, which can also degrade performance.

With this foundational understanding of partitions in a data lake, we can now delve deeper into the concept of partition projections in Athena and how they aim to address some of these challenges.

What Are Partition Projections?

Partition pruning is a technique where only the relevant metadata, specific to a query, is selected, eliminating unnecessary data. This method often makes queries run faster. Athena employs this strategy for all tables that have partitioned columns. In a typical scenario, when Athena processes queries, it first communicates with the AWS Glue Data Catalog by making a GetPartitions request, after which it performs partition pruning. However, if a table has an extensive set of partitions, this call can slow things down. 

To avoid this expensive operation on a highly partitioned table, AWS has introduced the technique of partition projections. With partition projection, Athena doesn't need to make the GetPartitions call. Instead, the configuration provided in partition projection equips Athena with all it needs to create the partitions on its own.

Athena Query

Benefits of Partition Projections

  • Improved Query Performance: By reducing the amount of data scanned, queries run faster and more efficiently.
  • Reduced Costs: With Athena, you pay for the data you scan. By scanning less data, costs are minimized.
  • Simplified Data Management: Virtual partitions eliminate the need for continuous partition maintenance tasks, such as adding new partitions when new data arrives.

Setting Up Partition Projections

To utilize partition projections:

1. Define Projection Types: Athena supports several projection types, including `integer,` `enum,` `date,` and `injected.` Each type serves a specific use case, like generating a range of integers or dates.

2. Specify Projection Configuration: This involves defining the rules and patterns for your projections. For instance, for a date projection, you'd specify the start date, end date, and the date format.

3. Modify Table Properties: Once projections are defined, modify your table properties in Athena to use these projections.

An Example Use-Case

Let us take an example where our data is stored in the data lake and is partitioned by customer_id and dt. The data is stored in parquet format, which is a columnar data format.

s3://my-bucket/data/customer_id/yyyy-MM-dd/*.parquet

In our example, let us have data for one year, i.e., 365 days and 100 customers. This would result in 365*100=36500 partitions on the data-lake. 

Let us benchmark the queries on this table with and without partition projections enabled. 

Let us get the count of all the records for the entire year for five customers. 

Query

SQL
 
SELECT count(*) FROM "analytic_test"."customer_events" 
where dt >= '2022-01-01' and 
customer_id IN ('Customer_001', 'Customer_002', 'Customer_003', 'Customer_004', 'Customer_005')


Without Partition Projection

Without partition projections enabled, the total query runtime is 7.3 seconds. Out of that, it spends 78% in planning and 20% executing the query. 

Query Results

Planning: 78% = 5.6 seconds
Execution 20% = 1.46


Total Runtime

With Partition Projections

Now, let us enable partition projection for this table. Take a look at all the table properties that are suffixed with "partition.*". In this example, since we had two partitions, dt and cutsomer_id. We will use date type projection, and for customer_id, we will use enum type projection. For enum types, you can build an automation job to update the table property whenever there are newer records for it.

SQL
 
CREATE EXTERNAL TABLE `customer_events`(
  `event_id` bigint COMMENT '', 
  `event_text` string COMMENT '')
PARTITIONED BY ( 
  `customer_id` string COMMENT '', 
  `dt` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://my-bucket/data/events.customer_events'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='SNAPPY', 
  'transient_lastDdlTime'='1698737433',
  'projection.enabled'='true',
  'projection.dt.type'='date',
  'projection.dt.range'='NOW-1YEARS,NOW',
  'projection.dt.format'='yyyy-MM-dd',
  'projection.dt.interval'='1',
  'projection.dt.interval.unit'='DAYS',
  'projection.customer_id.type'='enum',
  'projection.customer_id.values'='Customer_001,Customer_002,Customer_003,Customer_004,Customer_005,Customer_006,Customer_007,Customer_008,Customer_009,Customer_010,Customer_011,Customer_012,Customer_013,Customer_014,Customer_015,Customer_016,Customer_017,Customer_018,Customer_019,Customer_020,Customer_021,Customer_022,Customer_023,Customer_024,Customer_025,Customer_026,Customer_027,Customer_028,Customer_029,Customer_030,Customer_031,Customer_032,Customer_033,Customer_034,Customer_035,Customer_036,Customer_037,Customer_038,Customer_039,Customer_040,Customer_041,Customer_042,Customer_043,Customer_044,Customer_045,Customer_046,Customer_047,Customer_048,Customer_049,Customer_050,Customer_051,Customer_052,Customer_053,Customer_054,Customer_055,Customer_056,Customer_057,Customer_058,Customer_059,Customer_060,Customer_061,Customer_062,Customer_063,Customer_064,Customer_065,Customer_066,Customer_067,Customer_068,Customer_069,Customer_070,Customer_071,Customer_072,Customer_073,Customer_074,Customer_075,Customer_076,Customer_077,Customer_078,Customer_079,Customer_080,Customer_081,Customer_082,Customer_083,Customer_084,Customer_085,Customer_086,Customer_087,Customer_088,Customer_089,Customer_090,Customer_091,Customer_092,Customer_093,Customer_094,Customer_095,Customer_096,Customer_097,Customer_098,Customer_099,Customer_100')


Query results

Planning: 1.69 seconds
Execution: 0.6 seconds

Results

We can see a roughly 70% improvement in the query performance. This is because Athena avoids a remote call to AWS glue to fetch the partitions, as with this feature, it is able to project the values for these partitions. 

Limitations and Considerations

While powerful, partition projections do not solve all the problems.

  • Complex Setups: Setting up projections requires a deep understanding of your data and the patterns it follows.
  • Not Always the Best Fit: For datasets that don't follow predictable patterns or have irregular updates, traditional partitioning might be more suitable.

Conclusion

AWS's introduction of Partition Projections in Athena is a testament to their commitment to improving user experience and efficiency. By leveraging this feature, organizations can achieve faster query performance with minimal configuration changes. As with all tools, understanding its strengths and limitations is key to harnessing its full potential.

AWS Data lake Machine learning Use case Partition (database) Data Types

Opinions expressed by DZone contributors are their own.

Related

  • AWS S3 Strategies for Scalable and Secure Data Lake Storage
  • Building Scalable Data Lake Using AWS
  • Deep Learning Fraud Detection With AWS SageMaker and Glue
  • Relational DB Migration to S3 Data Lake Via AWS DMS, Part I

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!