Servers always break. Disks die, AZs go down, kernels panic. Betting everything on a single PostgreSQL is the same as signing a contract to "pay in downtime" for the failure that will surely come.
The basis of high availability (HA) is replication — always have a nearly-latest replica on another server and switch over when it counts. This article explains how to build HA with PostgreSQL physical streaming replication and the reality of the most-misunderstood "automatic failover," faithful to the official docs. It's a deep dive of production-operations guide §3.
Rules for this article: replication specs, parameter defaults, and PostgreSQL 18 changes are based on the PostgreSQL 18 official documentation (as of June 2026). Failover-automation tools (Patroni, etc.) are within the scope the official docs collectively call "external tools," with specific names noted as community knowledge.
1. HA options: physical vs logical vs shared disk
PostgreSQL HA has multiple approaches (the official comparison).
| Method | Mechanism | Suited for | Constraint |
|---|---|---|---|
| Physical streaming | WAL to the standby byte-by-byte | Whole-cluster HA/DR, read replicas | Same major version, whole-replication only |
| Logical replication | Row changes via publish/subscribe | Table-level, cross-version, bidirectional | Needs conflict resolution, DDL not replicated (separate article) |
| Shared disk | Share a single storage | — | The storage itself is a SPOF, no replication |
This article's main subject is physical streaming replication. The official explanation:
Streaming replication allows a standby server to stay more up-to-date than is possible with file-based log shipping. The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled.
It's "nearly real-time, byte-level replication." Being physical, the standby is block-for-block identical to the primary — the same major version is the premise.
2. Building physical streaming replication
The flow of the minimal setup.
2.1 Primary-side configuration
# primary postgresql.conf
wal_level = replica # replica 以上(既定)
max_wal_senders = 10 # 同時WAL送信プロセス数(既定 10)
max_replication_slots = 10 # レプリケーションスロット数(既定 10)
-- 専用のレプリケーションロール(公式推奨:REPLICATION + LOGIN の最小権限)
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD '***';
-- スロットを作っておくと、スタンバイが遅れてもWALを保持してくれる(§5)
SELECT pg_create_physical_replication_slot('standby1');
# primary pg_hba.conf(replication は専用キーワード。TLS強制+SCRAM)
hostssl replication replicator 10.0.0.0/24 scram-sha-256
2.2 Building the standby
# プライマリからベースバックアップを取得し、復旧設定まで書き出す(-R)
pg_basebackup -h primary.internal -U replicator -D /var/lib/postgresql/18/main \
-Ft -z -X stream -c fast -R -S standby1
# -R: standby.signal と primary_conninfo を自動生成 / -S: 上で作ったスロットを使う
The postgresql.auto.conf that -R generates is written like this.
primary_conninfo = 'host=primary.internal user=replicator passfile=... sslmode=verify-full'
primary_slot_name = 'standby1'
And because there's a standby.signal file, on startup it comes up in standby mode and keeps receiving the primary's WAL. Thanks to hot_standby (default on), you can run read queries on the standby (§4).
pg_ctl -D /var/lib/postgresql/18/main start
3. Sync vs async: a durability-vs-latency trade-off
"Whether to wait for the standby's replication before returning a commit" is sync/async. The default is async — officially, "if no synchronous standby name is specified in synchronous_standby_names, synchronous replication is disabled and commits don't wait for replication. This is the default."
Choose the durability level with the synchronous_commit value (default on). The lower you go, the stronger and slower.
| Value | What the commit waits for | What's protected |
|---|---|---|
off | Waits for nothing (not even local flush) | Fastest. Can lose the latest few on a crash (no corruption) |
local | Only local disk flush | Single-node durability |
remote_write | The standby receives and OS-writes the WAL | Survives a standby PostgreSQL crash (an OS crash is separate) |
on (default) | The standby disk-flushes the WAL | Not lost unless the primary + all sync standbys are corrupted simultaneously |
remote_apply | The standby applies (makes visible) the WAL | Immediately reflected in the standby's queries. Slowest |
# 「1件も失えない」要件:同期レプリケーション
synchronous_standby_names = 'standby1' # このスタンバイの確認を待つ
synchronous_commit = on # 受信+flush を待つ
Design decision: if "commit = absolutely never lost" is a requirement, like finance/payments, sync (
onor higher). General web apps are often fine with async (latency-first, accepting a slight data-loss risk).remote_applyis only when you want immediate reflection on the read replica — the commit delay grows large.
4. Read replicas: offload reads (and its price)
A hot standby can take read-only queries. Official: "when hot_standby is true, it accepts connections once recovery reaches a consistent state. All connections are strictly read-only (not even writing to a temp table)." The standard configuration of offloading heavy aggregations/reports to the replica and letting the primary concentrate on writes.
But there's the inherent problem of replication conflicts. Official: "the most common cause of conflict between standby queries and WAL replay is 'early cleanup.' … The primary's cleanup (VACUUM) may delete row versions that a standby's transaction can still see."
The standby is forced to either "delay WAL replay" or "cancel the conflicting query." Adjust with two knobs.
# ノブ1:競合時、どれだけWAL適用を遅らせるか(超えたらクエリをキャンセル)
max_standby_streaming_delay = 30s
# ノブ2:スタンバイの実行中クエリをプライマリにフィードバックし、VACUUMの早期クリーンを抑止
hot_standby_feedback = on # 既定 off
The official warning: hot_standby_feedback = on can eliminate conflicts but "delays cleanup of dead rows on the primary, potentially causing table bloat." It's a trade-off — hot_standby_feedback if you run long analytics queries on the replica, max_standby_streaming_delay if you want to avoid primary bloat — design the acceptable delay.
5. The WAL-retention safety device: slots vs wal_keep_size
If the primary deletes WAL the standby needs while the standby is behind/down, replication breaks. Two mechanisms prevent this.
wal_keep_size(default 0): the minimum size to retain inpg_wal. A standby that falls behind beyond this loses the needed WAL and disconnects. Just a fixed buffer.- Replication slot: reliably retains WAL until the standby consumes it. Safer.
The slot is safer but has a fatal side effect. Official (max_slot_wal_keep_size, default -1): "if -1, a replication slot may retain unlimited WAL." That is, leaving a standby down causes the primary's pg_wal to swell unboundedly and eat the disk.
# スロットは使うが、保持上限を設けて pg_wal の暴走を防ぐ
max_slot_wal_keep_size = 64GB # これを超えたら、遅れたスロットは無効化(複製は切れるがDBは守る)
The iron rule of operations: a slot is "convenient but dangerous." Always set a cap with
max_slot_wal_keep_sizeand monitor the lag inpg_replication_slots. PostgreSQL 18 addedidle_replication_slot_timeout, letting you auto-invalidate slots idle for a long time.
6. Failover: PostgreSQL doesn't do it automatically
This is the most misunderstood point. The operation to "promote" a standby itself is easy.
-- スタンバイをプライマリに昇格(手動フェイルオーバー)
SELECT pg_promote();
-- または: pg_ctl promote -D <datadir>
But who decides "when to promote"? The official docs state clearly.
PostgreSQL does not provide the system software required to detect a primary failure and notify the standby.
That is, automatic failover isn't in the core. Plus the danger of split-brain (both think they're the primary):
A mechanism is needed to tell the old primary "you're no longer the primary." This is known as STONITH (Shoot The Other Node In The Head), and is needed to avoid a situation where both systems think they're the primary, leading to data loss.
The official docs only state that "many such tools exist and are well-integrated with OS features (like IP-address migration)." In practice, you build it with the following external tools (community standard):
- Patroni (integrates with etcd/Consul, the de facto)
- repmgr, pg_auto_failover
This is managed's biggest value (operations guide §8). RDS/Aurora Multi-AZ does this "failure detection + automatic promotion + endpoint switching + STONITH-equivalent" for you. For self-operation, the real cost is building and operating this automatic-failover foundation, more than building replication.
7. Monitoring: watch the lag and the slots
-- プライマリ側:各スタンバイの遅延(write/flush/replay lag)
SELECT application_name, state, sync_state,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- スロットの滞留(restart_lsn が進まない=WALが溜まる)
SELECT slot_name, active, wal_status,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
replay_lag continuously increasing = the standby can't keep up (insufficient I/O or replication conflicts). retained_wal keeps increasing = the slot is stuck and pg_wal is swelling — always set a cap and alerts.
8. Summary
- Build HA with physical streaming replication:
wal_level=replica, a dedicated role,pg_basebackup -R, a slot. - Sync/async is a durability-vs-latency trade-off. The default is async. Choose
synchronous_commitby requirement. - Read replicas can offload reads, but watch for replication conflicts (
max_standby_streaming_delay/hot_standby_feedback). - The core has no automatic failover (officially stated). Do detection, promotion, and STONITH with an external tool like Patroni, or managed.
- Slots are safe but can run away — set a cap with
max_slot_wal_keep_sizeand monitor the lag.
Once "don't stop on failure" is settled, next is "don't stop on change" — non-stop schema changes, and cross-version logical replication & zero-downtime upgrade.