DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Distributed Rate Limiting in Java: A Deep Dive into Bucket4j + PostgreSQL
  • Data Storage and Indexing in PostgreSQL: Practical Guide With Examples and Performance Insights
  • Monorepo Development With React, Node.js, and PostgreSQL With Prisma and ClickHouse
  • Comparing Managed Postgres Options on The Azure Marketplace

Trending

  • Analysis of the Data Processing Framework of Pandas and Snowpark Pandas API
  • Securing Software Delivery: Zero Trust CI/CD Patterns for Modern Pipelines
  • Migrating SQL Failover Clusters Without Downtime: A Practical Guide
  • Run Scalable Python Workloads With Modal
  1. DZone
  2. Data Engineering
  3. Databases
  4. 8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

PostgreSQL offers robust features like failover slot synchronization and precise WAL control to help recover from data corruption.

By 
Dr. Ibrar Ahmed user avatar
Dr. Ibrar Ahmed
·
Jun. 04, 25 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
2.0K Views

Join the DZone community and get the full member experience.

Join For Free

When disaster strikes, whether a natural disaster or a technical event, its impact on your network, database, and end-users can cause data corruption. Data corruption, whether sparked by hardware failures like dying disks or faulty RAM, software glitches such as operating system bugs, or human missteps like accidental overwrites, is a terrifying prospect for any administrator. Yet, it’s not a death sentence.

Your PostgreSQL database is typically a dependable cornerstone of your operations. Still, when disaster strikes, it can swiftly morph into an inaccessible liability, bringing applications to a grinding halt and leaving critical data at risk. PostgreSQL 17 arms you with an enhanced arsenal to tackle this challenge head-on, offering built-in tools like pg_amcheck for pinpointing corruption, improved failover slot synchronization to keep replication intact during recovery, and finer-grained Write-Ahead Logging (WAL) control for precise restoration. In this blog, we'll dive deep into the disaster management process, equipping you with real-world commands and expected outputs to diagnose corruption accurately and recover effectively, whether you’re restoring from a robust backup or salvaging scraps from a crippled cluster with no safety net. With the right approach, you can turn panic into a plan and restore order to your database.

Step 1: Detecting Corruption in PostgreSQL

Corruption usually doesn’t introduce itself politely; it sneaks in through failed queries, panicked logs, or startup errors. Identifying corruption is the first step towards resolving it.

Check the PostgreSQL Log Files

Start by inspecting the log files.  Typically, you'll find the log files in /var/log/postgresql/ or $PGDATA/pg_log. Within the log files, entry headers indicate the severity level of the log entry; for example:

Shell
 
ERROR:  could not read block 128 in file "base/16384/12571": read only 0 of 8192 bytes  
PANIC:  invalid page in block 42 of relation base/16384/16728  
LOG:    startup process (PID 1234) was terminated by signal 6: Aborted  
FATAL:  the database system is in recovery mode


Severity levels indicate:

Shell
 
ERROR: Read failure, possibly a disk-level issue.
PANIC: Serious corruption PostgreSQL crashed to prevent further damage.
FATAL: The server is trying to recover from an unclean shutdown.


Best Practice: While excessive logging potentially wastes storage space, PostgreSQL has fine-grained logging control that allows you to customize the configuration parameters that specify logging events to provide the information you need to manage your system.  We recommend you review the Postgres documentation and your settings to confirm that your settings are right for your system to use!

pg_amcheck

pg_amcheck is a powerful tool introduced as a core utility in PostgreSQL 17. It allows you to verify the physical integrity of heap tables and index structures without locking them, making them safe for live environments. It helps detect data corruption caused by storage failures or file-level inconsistencies by scanning data blocks and index pages for structural errors. To invoke pg_amcheck, use the command:

Shell
 
pg_amcheck -d mydb --all


If your database is healthy, pg_amcheck returns:

Shell
 
No failures detected in database "mydb"


If pg_amcheck detects corruption:

Shell
 
heap table "public.orders": block 128 is not valid  
btree index "public.orders_idx": block 22 contains invalid tuple


Best Practice: Include the --heapallindexed or --rootdescend flags when you invoke pg_amcheck for deeper validation.

Optional: Verify your checksums

If you specified the --data-checksums flag when you initialized your PostgreSQL cluster (when running initdb), you can use the pg_checksums tool to detect low-level, file-based corruption across data blocks. pg_checksums provides an integrity safeguard, allowing PostgreSQL to verify whether the data on the disk has been altered unexpectedly due to bit rot, disk failures, or faulty hardware. 

pg_checksums must be run while the PostgreSQL server is stopped and will only work if checksums were enabled during cluster initialization.  Running pg_checksums is especially important after unclean shutdowns, system crashes, or if you suspect disk issues.

A clean report indicates that your data blocks are intact, while checksum failures identify specific block-level corruption in table or index files. SQL queries can map these file identifiers (like base/16384/12571) to table names.

The tool doesn’t fix anything; it simply reports which blocks are damaged, allowing you to take the appropriate steps to recover (e.g., restore from backup, isolate affected tables, or investigate hardware issues). Always consider enabling checksums in production environments for better observability and earlier corruption detection.

Shell
 
sudo systemctl stop postgresql
pg_checksums -c -D /var/lib/pgsql/data
Checksum verification failed in file "base/16384/12571", block 128


Best Practice: Enable checksum verification when you initialize each new database.  To enable pg_checksums on a new cluster, include the --data-checksums option when you invoke initdb:

Shell
 
initdb --data-checksums -D /var/lib/pgsql/data  

Step 2: Stop PostgreSQL Immediately

When you find data corruption, you should prevent further damage by halting the service:

Shell
 
sudo systemctl stop postgresql
postgresql.service - PostgreSQL RDBMS  
   Active: inactive (dead)


This will prevent PostgreSQL from continuing to write to WAL, potentially worsening data loss.

Step 3: Restore from a Known Good Backup

pgBackRest is a robust and efficient backup and restore solution for PostgreSQL that supports full, differential, and incremental backups, with compression, encryption, parallel processing, and offsite storage (for example, S3). pgBackRest is designed to handle large-scale environments with high performance and minimal impact on the database server. pgBackRest also simplifies disaster recovery by offering automated restore processes, archive management, and point-in-time recovery (PITR) capabilities. 

Clean and Restore the Cluster with pgBackRest

Before you restore, take a backup of the corrupted (old) data directory:

Shell
 
cp -rf /var/lib/pgsql/data /var/lib/pgsql/data_backup


After confirming that the backup is saved, wipe the old data directory:

Shell
 
rm -rf /var/lib/pgsql/data/*


Then, restore from your last known good backup:

Shell
 
pgbackrest --stanza=main restore --db-path=/var/lib/pgsql/data
INFO: restore command begin  
INFO: restored file base/16384/12571 (16MB, 50%) checksum verified  
INFO: restore command end: completed successfully


Then, correct ownership:

After restoring your database, ensure the data directory is correctly owned by the PostgreSQL user:

Shell
 
chown -R postgres:postgres /var/lib/pgsql/data


Step 4: Use Point-in-Time Recovery (PITR)

Using a backup strategy that supports Point-in-time recovery will allow you to stop right before corruption occurs.

Configure Recovery

Add the following commands to your postgresql.conf file:

Shell
 
restore_command = 'cp /mnt/backup/wal/%f %p'
recovery_target_time = '2025-03-25 13:59:00 UTC'


Create the recovery trigger:

Shell
 
touch /var/lib/pgsql/data/recovery.signal


When you start PostgreSQL, you can watch the server recover to the point in time that you specified in the recovery_target_time parameter:

Shell
 
sudo systemctl start postgresql
LOG: starting point-in-time recovery to "2025-03-25 13:59:00 UTC"  
LOG: restored log file "000000010000000000000005" from archive  
LOG: consistent recovery state reached  
LOG: recovery stopping before commit of transaction 123  
LOG: database system is ready to accept connections


Best Practice: Using a backup strategy that supports point-in-time recovery allows you to return to a clean state, just before corruption.

Step 5: Salvage What You Can

If you don’t have a backup but some tables still work, you can use pg_dump and other Postgres tools to extract what you can.

First, use pg_dump to save the definitions of any readable tables and their data:

Shell
 
pg_dump -t customers mydb > customers.sql


SQL
 
SELECT count(*) FROM customers;


Then, create a new cluster:

Shell
 
initdb -D /var/lib/pgsql/new_data --data-checksums
pg_ctl -D /var/lib/pgsql/new_data -l logfile start


Then, restore the salvaged data into your new cluster:

Shell
 
createdb -h /var/lib/pgsql/new_data newdb
psql -d newdb < customers.sql


Best Practice: Maintain a dependable backup strategy for any data that you can't afford to lose.  In a crisis, you can use these steps to restore salvaged data, but the restoration may not be complete, and you will still need to manually review and recreate schema objects that may have been damaged. These steps will leave you with a partial recovery in a clean environment. 

Step 6: Use pg_resetwal as the Last Resort

pg_resetwal is a low-level PostgreSQL utility used to forcibly reset a database cluster's write-ahead log (WAL), typically used as a last resort when the server cannot start due to missing or corrupted WAL files. This tool should be used cautiously, as it bypasses normal crash recovery and may lead to data inconsistency or loss of recent transactions. It is only safe to run when you are sure the data files are in a consistent state or when you're attempting to salvage partial data from a failed system. 

Only use this tool if all else fails. It resets WAL records, risking transaction loss and corruption.

Shell
 
pg_resetwal -f /var/lib/pgsql/data
sudo systemctl start postgresql
LOG: WAL reset performed  
LOG: database system is ready to accept connections

Note: Data added since the last checkpoint may be lost; you should proceed only after consulting experts.

Step 7: Prevent Future Corruption

Don’t let this happen again. PostgreSQL 17 gives you excellent tools to stay protected. In summary, the best practices that can help you recover from a disaster are:

Enable checksums when you initialize your cluster.

Shell
 
initdb --data-checksums -D /var/lib/pgsql/data


Automate backups with pgBackRest.

Shell
 
pgbackrest --stanza=main --type=full backup  
pgbackrest --stanza=main --type=incr backup


Run regular integrity checks with pg_amcheck.

Shell
 
pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +%F).log


Create a simple cron job to run pg_amcheck with the command:

Shell
 
0 2 * * * pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +\%F).log 2>&1


Step 8: Embrace High Availability and WAL Archiving

If you have configured a replication solution that allows you to configure high availability and maintain backup nodes, you can promote a replica if the primary fails:

Shell
 
pg_ctl promote -D /var/lib/pgsql/standby


Ensure that you have configured WAL Archiving for PITR; in your postgresql.conf file, set:

Shell
 
archive_mode = on 
archive_command = 'cp %p /mnt/wal_archive/%f'

Conclusion

Disaster recovery in PostgreSQL demands quick action and careful planning, and PostgreSQL 17 significantly strengthens your ability to respond. You can handle even the most critical failures with integrated tools like pg_amcheck for live corruption detection, pgBackRest for reliable backups and PITR, and pg_resetwal for last-resort recovery. Whether restoring from a clean backup, recovering to a point just before the disaster, or salvaging data from a damaged cluster, this post walks you through every step with actual commands and practical advice. 

Remember that recovery doesn’t start when something breaks. It begins with preparation. Make it a goal to turn your PostgreSQL database into a resilient, self-defending system by enabling data checksums, automating backups, monitoring for corruption, and setting up high availability with WAL archiving. In PostgreSQL, disaster may strike, but recovery can occur with the right tools and approach.

Data corruption Disaster recovery PostgreSQL

Published at DZone with permission of Dr. Ibrar Ahmed. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Distributed Rate Limiting in Java: A Deep Dive into Bucket4j + PostgreSQL
  • Data Storage and Indexing in PostgreSQL: Practical Guide With Examples and Performance Insights
  • Monorepo Development With React, Node.js, and PostgreSQL With Prisma and ClickHouse
  • Comparing Managed Postgres Options on The Azure Marketplace

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: