8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

PostgreSQL offers robust features like failover slot synchronization and precise WAL control to help recover from data corruption.

Dr. Ibrar Ahmed

Jun. 04, 25 · Tutorial

Likes (14)

Comment

Save

2.8K Views

When disaster strikes, whether a natural disaster or a technical event, its impact on your network, database, and end-users can cause data corruption. Data corruption, whether sparked by hardware failures like dying disks or faulty RAM, software glitches such as operating system bugs, or human missteps like accidental overwrites, is a terrifying prospect for any administrator. Yet, it’s not a death sentence.

Your PostgreSQL database is typically a dependable cornerstone of your operations. Still, when disaster strikes, it can swiftly morph into an inaccessible liability, bringing applications to a grinding halt and leaving critical data at risk. PostgreSQL 17 arms you with an enhanced arsenal to tackle this challenge head-on, offering built-in tools like pg_amcheck for pinpointing corruption, improved failover slot synchronization to keep replication intact during recovery, and finer-grained Write-Ahead Logging (WAL) control for precise restoration. In this blog, we'll dive deep into the disaster management process, equipping you with real-world commands and expected outputs to diagnose corruption accurately and recover effectively, whether you’re restoring from a robust backup or salvaging scraps from a crippled cluster with no safety net. With the right approach, you can turn panic into a plan and restore order to your database.

Step 1: Detecting Corruption in PostgreSQL

Corruption usually doesn’t introduce itself politely; it sneaks in through failed queries, panicked logs, or startup errors. Identifying corruption is the first step towards resolving it.

Check the PostgreSQL Log Files

Start by inspecting the log files. Typically, you'll find the log files in /var/log/postgresql/ or $PGDATA/pg_log. Within the log files, entry headers indicate the severity level of the log entry; for example:

    Shell
   
   ERROR:  could not read block 128 in file "base/16384/12571": read only 0 of 8192 bytes  
PANIC:  invalid page in block 42 of relation base/16384/16728  
LOG:    startup process (PID 1234) was terminated by signal 6: Aborted  
FATAL:  the database system is in recovery mode

Severity levels indicate:

    Shell
   
   ERROR: Read failure, possibly a disk-level issue.
PANIC: Serious corruption PostgreSQL crashed to prevent further damage.
FATAL: The server is trying to recover from an unclean shutdown.

Best Practice: While excessive logging potentially wastes storage space, PostgreSQL has fine-grained logging control that allows you to customize the configuration parameters that specify logging events to provide the information you need to manage your system. We recommend you review the Postgres documentation and your settings to confirm that your settings are right for your system to use!

pg_amcheck

pg_amcheck is a powerful tool introduced as a core utility in PostgreSQL 17. It allows you to verify the physical integrity of heap tables and index structures without locking them, making them safe for live environments. It helps detect data corruption caused by storage failures or file-level inconsistencies by scanning data blocks and index pages for structural errors. To invoke pg_amcheck, use the command:

    Shell
   
   pg_amcheck -d mydb --all

If your database is healthy, pg_amcheck returns:

    Shell
   
   No failures detected in database "mydb"

If pg_amcheck detects corruption:

    Shell
   
   heap table "public.orders": block 128 is not valid  
btree index "public.orders_idx": block 22 contains invalid tuple

Best Practice: Include the --heapallindexed or --rootdescend flags when you invoke pg_amcheck for deeper validation.

Optional: Verify your checksums

If you specified the --data-checksums flag when you initialized your PostgreSQL cluster (when running initdb), you can use the pg_checksums tool to detect low-level, file-based corruption across data blocks. pg_checksums provides an integrity safeguard, allowing PostgreSQL to verify whether the data on the disk has been altered unexpectedly due to bit rot, disk failures, or faulty hardware.

pg_checksums must be run while the PostgreSQL server is stopped and will only work if checksums were enabled during cluster initialization. Running pg_checksums is especially important after unclean shutdowns, system crashes, or if you suspect disk issues.

A clean report indicates that your data blocks are intact, while checksum failures identify specific block-level corruption in table or index files. SQL queries can map these file identifiers (like base/16384/12571) to table names.

The tool doesn’t fix anything; it simply reports which blocks are damaged, allowing you to take the appropriate steps to recover (e.g., restore from backup, isolate affected tables, or investigate hardware issues). Always consider enabling checksums in production environments for better observability and earlier corruption detection.

    Shell
   
   sudo systemctl stop postgresql
pg_checksums -c -D /var/lib/pgsql/data
Checksum verification failed in file "base/16384/12571", block 128

Best Practice: Enable checksum verification when you initialize each new database. To enable pg_checksums on a new cluster, include the --data-checksums option when you invoke initdb:

    Shell
   
   initdb --data-checksums -D /var/lib/pgsql/data

Step 2: Stop PostgreSQL Immediately

When you find data corruption, you should prevent further damage by halting the service:

    Shell
   
   sudo systemctl stop postgresql
postgresql.service - PostgreSQL RDBMS  
   Active: inactive (dead)

This will prevent PostgreSQL from continuing to write to WAL, potentially worsening data loss.

Step 3: Restore from a Known Good Backup

pgBackRest is a robust and efficient backup and restore solution for PostgreSQL that supports full, differential, and incremental backups, with compression, encryption, parallel processing, and offsite storage (for example, S3). pgBackRest is designed to handle large-scale environments with high performance and minimal impact on the database server. pgBackRest also simplifies disaster recovery by offering automated restore processes, archive management, and point-in-time recovery (PITR) capabilities.

Clean and Restore the Cluster with pgBackRest

Before you restore, take a backup of the corrupted (old) data directory:

    Shell
   
   cp -rf /var/lib/pgsql/data /var/lib/pgsql/data_backup

After confirming that the backup is saved, wipe the old data directory:

    Shell
   
   rm -rf /var/lib/pgsql/data/*

Then, restore from your last known good backup:

    Shell
   
   pgbackrest --stanza=main restore --db-path=/var/lib/pgsql/data
INFO: restore command begin  
INFO: restored file base/16384/12571 (16MB, 50%) checksum verified  
INFO: restore command end: completed successfully

Then, correct ownership:

After restoring your database, ensure the data directory is correctly owned by the PostgreSQL user:

    Shell
   
   chown -R postgres:postgres /var/lib/pgsql/data

Step 4: Use Point-in-Time Recovery (PITR)

Using a backup strategy that supports Point-in-time recovery will allow you to stop right before corruption occurs.

Configure Recovery

Add the following commands to your postgresql.conf file:

    Shell
   
   restore_command = 'cp /mnt/backup/wal/%f %p'
recovery_target_time = '2025-03-25 13:59:00 UTC'

Create the recovery trigger:

    Shell
   
   touch /var/lib/pgsql/data/recovery.signal

When you start PostgreSQL, you can watch the server recover to the point in time that you specified in the recovery_target_time parameter:

    Shell
   
 

   sudo systemctl start postgresql
LOG: starting point-in-time recovery to "2025-03-25 13:59:00 UTC"  
LOG: restored log file "000000010000000000000005" from archive  
LOG: consistent recovery state reached  
LOG: recovery stopping before commit of transaction 123  
LOG: database system is ready to accept connections
  

Best Practice: Using a backup strategy that supports point-in-time recovery allows you to return to a clean state, just before corruption.

Step 5: Salvage What You Can

If you don’t have a backup but some tables still work, you can use pg_dump and other Postgres tools to extract what you can.

First, use pg_dump to save the definitions of any readable tables and their data:

    Shell
   
   pg_dump -t customers mydb > customers.sql

    SQL
   
   SELECT count(*) FROM customers;

Then, create a new cluster:

    Shell
   
   initdb -D /var/lib/pgsql/new_data --data-checksums
pg_ctl -D /var/lib/pgsql/new_data -l logfile start

Then, restore the salvaged data into your new cluster:

    Shell
   
   createdb -h /var/lib/pgsql/new_data newdb
psql -d newdb < customers.sql

Best Practice: Maintain a dependable backup strategy for any data that you can't afford to lose. In a crisis, you can use these steps to restore salvaged data, but the restoration may not be complete, and you will still need to manually review and recreate schema objects that may have been damaged. These steps will leave you with a partial recovery in a clean environment.

Step 6: Use pg_resetwal as the Last Resort

pg_resetwal is a low-level PostgreSQL utility used to forcibly reset a database cluster's write-ahead log (WAL), typically used as a last resort when the server cannot start due to missing or corrupted WAL files. This tool should be used cautiously, as it bypasses normal crash recovery and may lead to data inconsistency or loss of recent transactions. It is only safe to run when you are sure the data files are in a consistent state or when you're attempting to salvage partial data from a failed system.

Only use this tool if all else fails. It resets WAL records, risking transaction loss and corruption.

    Shell
   
   pg_resetwal -f /var/lib/pgsql/data
sudo systemctl start postgresql
LOG: WAL reset performed  
LOG: database system is ready to accept connections

Note: Data added since the last checkpoint may be lost; you should proceed only after consulting experts.

Step 7: Prevent Future Corruption

Don’t let this happen again. PostgreSQL 17 gives you excellent tools to stay protected. In summary, the best practices that can help you recover from a disaster are:

Enable checksums when you initialize your cluster.

    Shell
   
   initdb --data-checksums -D /var/lib/pgsql/data

Automate backups with pgBackRest.

    Shell
   
   pgbackrest --stanza=main --type=full backup  
pgbackrest --stanza=main --type=incr backup

Run regular integrity checks with pg_amcheck.

    Shell
   
   pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +%F).log

Create a simple cron job to run pg_amcheck with the command:

    Shell
   
   0 2 * * * pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +\%F).log 2>&1

Step 8: Embrace High Availability and WAL Archiving

If you have configured a replication solution that allows you to configure high availability and maintain backup nodes, you can promote a replica if the primary fails:

    Shell
   
   pg_ctl promote -D /var/lib/pgsql/standby

Ensure that you have configured WAL Archiving for PITR; in your postgresql.conf file, set:

    Shell
   
   archive_mode = on 
archive_command = 'cp %p /mnt/wal_archive/%f'

Conclusion

Disaster recovery in PostgreSQL demands quick action and careful planning, and PostgreSQL 17 significantly strengthens your ability to respond. You can handle even the most critical failures with integrated tools like pg_amcheck for live corruption detection, pgBackRest for reliable backups and PITR, and pg_resetwal for last-resort recovery. Whether restoring from a clean backup, recovering to a point just before the disaster, or salvaging data from a damaged cluster, this post walks you through every step with actual commands and practical advice.

Remember that recovery doesn’t start when something breaks. It begins with preparation. Make it a goal to turn your PostgreSQL database into a resilient, self-defending system by enabling data checksums, automating backups, monitoring for corruption, and setting up high availability with WAL archiving. In PostgreSQL, disaster may strike, but recovery can occur with the right tools and approach.

Data corruption Disaster recovery PostgreSQL

Published at DZone with permission of Dr. Ibrar Ahmed. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending