DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Just Use PostgreSQL, a Quick-Start Guide: Exploring Essential and Extended Capabilities of the Most Beloved Database
  • Understanding RDS Costs
  • How to Build a Full-Stack App With Next.js, Prisma, Postgres, and Fastify
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps

Trending

  • Designing a Java Connector for Software Integrations
  • AI’s Role in Everyday Development
  • AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
  • Agile and Quality Engineering: A Holistic Perspective
  1. DZone
  2. Data Engineering
  3. Databases
  4. Formulating a Robust Strategy for Storage in Amazon Relational Database Service PostgreSQL Deployments

Formulating a Robust Strategy for Storage in Amazon Relational Database Service PostgreSQL Deployments

This article explores understanding the key components of RDS storage, optimizing these storage factors, and capping storage growth by utilizing retention periods.

By 
Vivek Singh user avatar
Vivek Singh
·
Jan. 09, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

If you are using the AWS Relational Database Service (RDS) offered managed database services, you may wonder how to strategize database storage size. Strategizing database storage includes understanding the key components of RDS storage, optimizing these storage factors, and capping storage growth by using retention periods.

AWS RDS offers managed database services for Oracle, MySQL, PostgreSQL, and SQL Server. These managed services include automated backups, single-click upgrades, replication and high availability, and disaster recovery solutions. Under the hood, all these RDS databases use Amazon Elastic Block Store (EBS) volumes for storage. This post discusses the storage components, optimization steps for these storage components using automation, and utilizing various retention period mechanisms to control storage growth.

Components of RDS PostgreSQL Storage

AWS RDS PostgreSQL uses Amazon EBS volumes for storage. The limitations on the size and performance of these storage volumes are governed by AWS EBS limitations and characteristics. RDS PostgreSQL offers various configuration options for High Availability and Disaster Recovery purposes, including Multi-AZ and Multi-AZ clusters. For simplicity, we'll discuss Single-AZ RDS PostgreSQL EBS volumes.

Before jumping into the RDS PostgreSQL EBS contents, let's have a look at the community PostgreSQL storage components. Once we understand the community PostgreSQL data directory architecture, we can easily understand the RDS storage components. Looking at the PostgreSQL 15 $PGDATA directory, we find the following files and directories:

Files and directories in the PostgreSQL 15 $PGDATA directory

Most of the storage is consumed by the base, pg_wal, and log directories. The base directory contains all relational data, such as tables, indexes, and sequences. The pg_wal directory contains write-ahead log files. PostgreSQL records any modifications to data files in transactional logs called write-ahead log (WAL) files. The log directory contains database log files, which log database activities. The following diagram shows major directories with typical relative sizes in the PGDATA directory:

PostgreSQL major directories in PGDATA

It's worth noting that in the community PostgreSQL, you can create tablespaces for hosting objects on different storage drives. This option is not supported in RDS PostgreSQL. All data is stored in EBS volumes. With that in mind, below are the major storage components of a Single-AZ RDS instance and strategies to manage storage growth. Managing storage includes understanding retention options, automating the purging of old data, and proactively monitoring storage growth.

Database Log Files

All database activities are logged in RDS PostgreSQL log files. A common use case for these log files is to identify issues with database workloads, query failures, login failures, deadlocks, and fatal server errors. The size of the log files is governed by some PostgreSQL logging parameters and the RDS log file retention setting.

Large log files can consume most of your RDS storage and cause production outages if the consumed storage reaches 100% of the provisioned storage. It's critical to review what you are logging and whether the logged information is needed for the business. 

The most important parameters that dictate the size of log files are log_statements, log_min_duration_statement, log_connections, and log_disconnections. Most fintech companies, who are required to log all user and application activities, set the most verbose option log_statement=all. This is the easiest way to bloat the log files and invite storage issues if storage consumption is not monitored. Pg_audit can be a smarter way of logging user activities, where you can specify which class of activities you want to log, such as READ, WRITE, DDL, or FUNCTION. 

Below is a diagram that shows the typical verbosity based on the logging parameter settings in PostgreSQL:

A diagram that shows the typical verbosity based on the logging parameter settings in PostgreSQL

Red: High, Orange: Medium, Green=Low

One of the good practices for controlling storage used by log files is setting log file retention. The RDS parameter rds.log_retention_period sets the retention period for log files. 

For example, the default setting of 3 will purge log files after 3 days of their creation. Most users set it to a lower value, such as one day, and have an automated job, such as a Lambda function, configured to back up the log files to S3 as soon as they are created. Later, you can automate pgBadger to analyze these log files and send you reports at a set frequency. The following AWS CLI command can be used to find out the total size of log files in an RDS database: 

Plain Text
 
aws rds describe-db-log-files --db-instance-identifier <rds_identifier> | grep "Size" | grep -o '[0-9]*' | awk '{n += $1}; END{print n}'

Use this AWS CLI command to know the total size of log files in an RDS database

If the total size of log files is over 25% of the total RDS used storage, it's time to build a strategy for reviewing the logging parameters and retention settings. 

Database Objects

Database objects include regular database relations such as tables, indexes, and sequences. Below are the top factors why PostgreSQL database sizes keep increasing in an uncontrolled manner: 

Table Bloat

When PostgreSQL deletes a row, it keeps the old version of the tuple for MVCC (multi-version concurrency control) purposes. This way, writers don't interrupt readers, and readers don't interrupt writers. Cumulatively, these dead tuples are called bloat. 

In PostgreSQL, an UPDATE is a combination of DELETE and INSERT. Thus, DELETE and UPDATE operations are responsible for bloat. Bloat can occur in tables or indexes. PostgreSQL hosts a native daemon process called autovacuum responsible for cleaning up this bloat and making the space available for subsequent inserts. There are some common reasons that prevent autovacuum from completing its cycle, and manual review of parameters is important. These reasons include the table being super active in use and exclusive locks blocking autovacuum jobs, long-running queries pausing autovacuum, and autovacuum_max_workers not set high enough to vacuum all bloated tables. The most optimal way to measure bloat in your database is by using the pgstattuple extension. 

Alternatively, you can use the official query from the PostgreSQL wiki to find a loose estimation of table bloat. The image below shows how to use pgstattuple to find the number of dead tuples. In this example, the pgbench_accounts table has a total of 200,000,000 rows and 3,263,668 dead rows, i.e., 1.44% bloat. Anything over 10% should be diagnosed, and DBAs should find the reason for the excessive bloat. The 10% factor comes from the autovacuum_vacuum_scale_factor parameter's default value in RDS PostgreSQL, which suggests autovacuum to launch if bloat goes over this value.

Table bloat

Controlling bloat will not only control overall table size but also improve workload performance and save I/O operations.

Time Series Data

I have encountered many customers hosting time-series data in PostgreSQL databases. PostgreSQL offers great features for time-series data. It's important to purge old data and set the row retention period. PostgreSQL natively doesn't support TTL (Time to Live) for table rows. 

However, this can be achieved by using a trigger and function. The most suitable approach for handling time-series data growth is table partitioning. Table partitioning not only makes purging data easier by dropping old partitions but also makes maintenance jobs such as autovacuum more effective on large tables. Pg_cron is a suitable extension to purge old partitions at a set frequency. The command below purges data older than 10 days every midnight:

SQL
 
SELECT cron.schedule('0 0 * * *', $$DELETE 

    FROM cron.job_run_details 

    WHERE end_time < now() - interval '10 days'$$);


Storing Unwanted Data

I have met with many customers storing database logs and application logs in relational databases. The following query can be used to find the least accessed tables — the last line of the code searches for tables with less than 10 scans.

SQL
 
SELECT 

    relname, 

    schemaname 

FROM    

    pg_stat_user_tables

WHERE 

    (coalesce(idx_tup_fetch,0) + coalesce(seq_tup_read,0)) < 10;


The smarter approach could be storing application logs in cheaper storage options such as S3. Regular review is important to ensure you only store the working dataset in your relational database. 

Temporary Files

Data resulting from large hash and sort operations that cannot fit in work_mem are stored in temporary files. Uncontrolled generation of sort and hash data can easily occupy most of the RDS PostgreSQL storage. The logging parameter log_temp_files should be enabled to log temporary file activities, and work_mem should be set to higher values. By looking at the logging details coming from the log_temp_files parameter, you can find the associated tables and queries causing most of the temporary data. 

One of the best practices is to set work_mem at the session level. In the EXPLAIN plan, look for the pattern: "Sort Method: external merge Disk: 47224kB". This shows that the query will create around 50MB of temporary files. In the query session, use SET LOCAL work_mem = '55MB'; to optimize query performance and lower temporary data.

Conclusion 

In this post, we explored some strategies for storage capacity management in Amazon Relational Database Service PostgreSQL deployments. Having a close eye on the top storage contributors helps optimize storage consumption, reduces overall operational costs, and increases application performance. If you have any questions or suggestions about this post, leave a comment.

Database Relational database PostgreSQL

Opinions expressed by DZone contributors are their own.

Related

  • Just Use PostgreSQL, a Quick-Start Guide: Exploring Essential and Extended Capabilities of the Most Beloved Database
  • Understanding RDS Costs
  • How to Build a Full-Stack App With Next.js, Prisma, Postgres, and Fastify
  • How to Store Text in PostgreSQL: Tips, Tricks, and Traps

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!