Skip to main content
友田 陽大
PostgreSQL operations & reliability
PostgreSQL
アーキテクチャ設計

PostgreSQL backup & PITR in practice (pg_dump / continuous archiving / WAL / Point-in-Time Recovery, v18-ready)

A practical guide to designing PostgreSQL backup and recovery at production quality. Faithful to the official documentation, it explains: choosing among the three methods of logical (pg_dump), physical, and continuous archiving; real pg_dump/pg_restore commands; configuring PITR (recovery to any point in time) with WAL archiving and pg_basebackup and the recovery procedure; PostgreSQL 17's incremental backup; the importance of recovery drills; and managed automatic PITR.

Published
Reading time
9 min read
Author
友田 陽大
Share

The most important thing in a discussion of backup isn't the details of the commands. It's "when it matters, can it really be restored?" A dump you thought was being taken was corrupt, no one knew the recovery procedure, restoring took a whole day — these are all caused by not testing recovery.

This article is a guide for designing PostgreSQL backup and recovery on the premise that "it can be restored." It explains, faithful to the official documentation, from choosing among the three methods to configuring PITR (recovery to any point in time) and the actual recovery procedure. A deep dive on §2 of the production-operations guide — the first area to nail down in operations.

Rules for this article: backup methods, commands, the PITR spec, and PostgreSQL 18/17 features are based on the PostgreSQL 18 official documentation (as of June 2026). The terms RPO/RTO are used as general operations-design vocabulary (not terms from the official docs). Managed (RDS, etc.) automation is noted explicitly as a vendor feature.


1. First, think in metrics: RPO and RTO

Before method selection, articulate the requirements with two metrics.

  • RPO (Recovery Point Objective): how much data loss you tolerate at failure. "At most 15 minutes," "zero," etc.
  • RTO (Recovery Time Objective): how fast you need to recover from a failure. "Within 1 hour," etc.

These two decide the method. With "nightly dump only," RPO is at most 24 hours (you lose everything since the last dump). With "can't lose even a minute," you need continuous archiving or replication. Turn the requirements into numbers before choosing a method — that's the order.


2. The three backup methods

PostgreSQL backup is officially three methods.

MethodWhat it can doWhen it suitsLimit
Logical (pg_dump)Consistent snapshot while running. SQL/portable, selective restore, cross-versionMigrating a single DB, partial restore, small–mid scalePer-DB. Doesn't include roles/tablespaces
Physical (file copy)Copy the data files wholesaleSimple whole-cluster duplicationServer must be stopped. Cluster-level only
Continuous archiving (WAL)Base backup + WAL offload for recovery to any point in time (PITR)Production core, low RPO, large scaleConfiguration is complex, cluster-level

The official docs clearly warn about physical file backups: "To get a usable backup you must stop the database server. A half-measure like merely disallowing all connections won't work." So for a physical backup while running, use the pg_basebackup below or a filesystem snapshot.


3. Logical backup: pg_dump / pg_dumpall

pg_dump is the easiest and most portable method. Per the official docs, "pg_dump's dumps are internally consistent, representing a snapshot at the moment pg_dump started. pg_dump does not block other operations while running." It can be taken safely while running.

# 1) プレーンSQL(小規模・人間が読める)
pg_dump appdb > appdb.sql
psql -X -d appdb < appdb.sql                      # 復元

# 2) カスタム形式(推奨:圧縮・選択復元・並列復元が可能。復元は pg_restore)
pg_dump -Fc appdb > appdb.dump
pg_restore -d appdb appdb.dump

# 3) ディレクトリ形式 + 並列ダンプ/復元(大きいDBを速く)
pg_dump -Fd -j 4 -f appdb.dir appdb
pg_restore -d appdb -j 4 appdb.dir

The most important pitfall: pg_dump targets only a single DB and does not include cluster-wide definitions like roles (users) and tablespaces. Take those separately with pg_dumpall.

# クラスタ全体(ロール・テーブルスペース込み)
pg_dumpall > cluster.sql
# ロール/テーブルスペースだけ(グローバルのみ)— pg_dump と組み合わせる定番
pg_dumpall --globals-only > globals.sql

You thought "pg_dump is perfect," but the restore target had no roles and all the privileges were stripped — a common accident. Operate pg_dumpall -g (globals) + pg_dump -Fc (each DB) as a set.

In PostgreSQL 18, pg_dump/pg_restore gained --statistics (migrating optimizer statistics) and selective options like --no-data/--no-schema/--sequence-data.


4. Continuous archiving and PITR: roll back to any point in time

The main option for low RPO is continuous archiving. The official definition: "This technique enables Point-in-Time Recovery (PITR). The database can be recovered to any point in time after the moment the base backup was taken."

The mechanism is "base backup (the foundation) + ongoing WAL offload (all subsequent changes)." At recovery, restore the foundation, then replay WAL to the target point.

4.1 Configuration: archive the WAL

# postgresql.conf
wal_level = replica           # replica 以上(archive を含む)
archive_mode = on
# 成功時のみ 0 を返し、既存ファイルを上書きしない(安全装置)
archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f'

Two ironclad rules the official docs stress:

  1. "The archive command must return zero if and only if it succeeds." Returning 0 makes PostgreSQL delete/reuse that WAL, so a false success invites WAL loss = unrecoverability.
  2. "It should be designed not to overwrite an existing archive file." An important safety feature protecting archive integrity from administrator mistakes (the test ! -f above is that).

In real operations, replace cp with transfer to pgBackRest, WAL-G, or cloud storage (S3, etc.). These handle parallelism, compression, encryption, retention policies, and even restore verification.

4.2 Base backup: pg_basebackup

# 稼働中にベースバックアップ(クラスタ全体)。-X stream で自己完結、-R で復旧設定を生成
pg_basebackup -D /backup/base -Ft -z -X stream -c fast -R
#   -Ft: tar形式 / -z: 圧縮 / -X stream: WALを並列ストリーム同梱 / -c fast: 即時チェックポイント
#   -R: standby.signal と接続設定を書き出す(スタンバイ構築にも流用)

Official: "pg_basebackup takes a base backup of a running cluster. It can be run without affecting other clients."

4.3 Recovery: replay to the target point

At failure, expand the base backup, write the recovery configuration, and start; PostgreSQL replays WAL and rolls back to the target point.

# 復旧先の postgresql.conf(または postgresql.auto.conf)
restore_command = 'cp /mnt/archive/%f %p'       # アーカイブからWALを取り出す
recovery_target_time = '2026-06-24 09:41:00+09' # ここまで再生(誤DELETEの直前など)
recovery_target_action = 'promote'              # 到達したら通常稼働へ昇格
# データディレクトリに recovery.signal を置くと「目標復旧→昇格」モードで起動する
touch /var/lib/postgresql/18/main/recovery.signal
pg_ctl start

Official key points:

  • "The one you absolutely must specify is restore_command" — how to retrieve WAL from the archive.
  • The recovery target is one of recovery_target_time / _lsn / _xid / _name (the practical ones are time and a named restore point).
  • With recovery.signal it's "recover to the target and promote"; with standby.signal it starts as a standby (the HA article). The signal file is auto-deleted on completion.
  • The stop point must be after the base backup's end time (you can't roll back to a point during the backup).
  • The backup_label file is "not mere information but essential to the correct operation of the recovery process" — don't delete it.

Rescue from a mistake: PITR's true value is being able to "roll back to just before an erroneous DELETE/DROP." Set recovery_target_time to just before the incident and you can unwind a human error. This is a feat a simple snapshot can't do.


5. Incremental backup (PostgreSQL 17+)

PostgreSQL 17 introduced incremental backup (note it's not a PG18 feature). It takes only the diff since the last backup, reducing capacity and time.

# フル → 増分(前回のマニフェストを参照)
pg_basebackup -D /backup/full -X stream
pg_basebackup -D /backup/incr1 --incremental=/backup/full/backup_manifest -X stream

# 増分は単独では使えない。pg_combinebackup でフルと合成してから復元する
pg_combinebackup /backup/full /backup/incr1 -o /backup/restored

Official: "An incremental backup can't be used directly. You must first combine it with the earlier backups it depends on, using pg_combinebackup." PostgreSQL 18 added -k/--link (hard links) to pg_combinebackup and tar-format verification support to pg_verifybackup.


6. Guarantee "it can be restored": automate the recovery drill

The only correct verification of a backup strategy is to actually try restoring. Build this into CI or a periodic job.

#!/usr/bin/env bash
# 定期実行する復旧訓練(最新バックアップを隔離環境に戻し、健全性を検証)
set -euo pipefail

RESTORE_DIR=$(mktemp -d)
trap 'rm -rf "$RESTORE_DIR"' EXIT          # 後始末(冪等・再実行安全)

# 1) 最新のフル+増分を合成して復元
pg_combinebackup /backup/full /backup/incr-latest -o "$RESTORE_DIR/data"

# 2) 隔離ポートで起動し、復旧完了を待つ
pg_ctl -D "$RESTORE_DIR/data" -o "-p 5499" start
until pg_isready -p 5499; do sleep 1; done

# 3) スモークチェック:行数や重要テーブルの存在を検証(戻ったことの証拠)
psql -p 5499 -d appdb -At -c "SELECT count(*) FROM orders" | grep -Eq '^[0-9]+$'

# 4) 停止(健全なら成功。失敗すれば非0で落ち、アラートが鳴る)
pg_ctl -D "$RESTORE_DIR/data" stop
echo "restore drill OK"

Observability: put the success/failure of this drill job on monitoring and alert immediately, treating failure = the backup is broken. Confirming daily not "is the backup being taken" but "can it be restored" is real reliability.


7. The managed case (RDS/Aurora/Cloud SQL)

Managed services automate the continuous archiving and PITR up to here (vendor feature, outside the official PostgreSQL docs). With automatic snapshots + ongoing WAL offload, "restore to a specific timestamp" is possible from the console or API. There's no need to operate archive_command or pg_basebackup by hand.

That said, a logical backup (pg_dump) is worth considering separately. Managed snapshots center on cluster-level physical restore, and a logical dump is effective for "restoring just a specific table to a separate environment," "cross-version migration," and "external storage independent of a managed failure." As defense in depth, combining managed PITR + a periodic pg_dump stored externally is solid.


8. Conclusion

  • "Can be restored" is the virtue. Turn RPO/RTO into numbers before choosing a method, and automate the recovery drill.
  • Three methods: logical (pg_dump, while running, portable, per-DB) / physical (must stop) / continuous archiving (PITR).
  • Low RPO is continuous archiving. archive_command returns 0 only on success and forbids overwrite; take the base with pg_basebackup; recovery is restore_command + recovery_target_time + recovery.signal.
  • pg_dump doesn't include roles/tablespaces — use pg_dumpall -g alongside it.
  • Incremental backup is PG17+ (--incremental + pg_combinebackup).
  • Managed automates PITR, but use a logical dump stored externally for defense in depth.

Once "don't break it" is nailed down, next is "don't stop even when it falls" — go to streaming replication and high availability.


References (PostgreSQL 18 official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading