PostgreSQL production-operations guide (v18): the 5 principles of don't break, don't stop, don't clog, protect, and evolve

Raising a database's performance (the performance-tuning overview) and continuing to safely operate a database in production are different things. What's asked in the latter is not "speed" but — can you restore it when it breaks, does it stay up when it goes down, does it not clog when congested, can you protect it when attacked, and can you evolve it without stopping.

This article is the whole map for operating PostgreSQL in production. It rearranges backup, HA, connections, monitoring, security, and upgrades on the premise that incidents happen. It leaves the specifics to individual articles and concentrates on the decisions and priorities of operations. It's useful both for buyers discerning "is it OK to entrust Postgres" in contract development, and for engineers taking over operations.

Rules for this article: specs, defaults, and PostgreSQL 18 new features are based on the PostgreSQL 18 official documentation (as of June 2026). Vendor-specific matters for PgBouncer, RDS, etc., are noted as such. Managed services do much of the operation for you, so some of the manual work described here becomes unnecessary (the boundary is shown in the text too).

1. The 5 questions of production operation

Operations design is 80% complete if you can answer the following five. From the top, in order of "fatal to lose."

#	Question	If it fails	The weapon	Deep dive
1	Don't break	Data loss (= business-ending)	Backup, PITR	§2 / backup article
2	Don't stop (on failure)	Downtime, SPOF	Replication, HA	HA article
3	Don't stop (on change)	Total outage on deploy	Zero-downtime DDL	non-stop DDL article
4	Don't clog	Connection exhaustion, slowdown	Connection pooling, monitoring	connection-pooling article
5	Protect	Information leak, unauthorized access	Least privilege, TLS, SCRAM	security article
+	Evolve	Pickled, EOL	Non-stop upgrade	logical-replication article

2. Don't break: for backups, "it can be restored" is justice

From the heaviest question. The purpose of a backup is not "to take it" but "to be able to restore." An untested restore procedure is the same as nonexistent.

PostgreSQL backups officially come in three methods (details in the backup & PITR article).

Method	What it can do	Limit
Logical (`pg_dump`)	A consistent snapshot while running. Portable, selective restore, cross-version	Per DB. Roles/tablespaces need separate `pg_dumpall`
Physical (files)	Copy the whole cluster	Server stop required (or `pg_basebackup`)
Continuous archiving (WAL)	A base backup + continuous WAL evacuation to restore to any point (PITR)	Complex to set up, per cluster

The decisive difference is RPO (how much you lose).

A nightly pg_dump → at worst, lose everything since the last dump (up to 24 hours).
Continuous archiving → officially "can restore to any point since the base backup." Since WAL is continuously evacuated, data loss can be made nearly zero.

Implication for buyers/operators: "only a nightly dump" is dangerous for a production core DB. "Don't break" holds only on the premise of continuous archiving (or managed PITR), plus "regularly conducting recovery drills." Managed (RDS/Aurora/Cloud SQL) automates snapshot + WAL archiving and provides PITR without you hand-operating archive_command or pg_basebackup (a vendor feature).

3. Don't stop (on failure): replication and HA

Servers break. Disks die, AZs go down. Not creating a SPOF (single point of failure) is the core of "don't stop."

PostgreSQL's main HA is physical streaming replication — flow the primary's WAL to a standby and keep a nearly-latest replica waiting (details in the HA article). You can choose sync/async and offload reads to read replicas.

Here's the most important fact (officially stated):

PostgreSQL does not provide the system software to detect a primary failure and notify the standby.

In other words, the PostgreSQL core doesn't have built-in automatic failover. Failure detection, promotion (pg_promote()), IP reassignment, and split-brain prevention (STONITH) need to be built with an external tool like Patroni.

This is managed's biggest value. RDS/Aurora Multi-AZ does this "detection + automatic failover + endpoint switching" for you. For self-operation, you need to estimate the build/operation cost of Patroni, etc., and this one point alone can be a reason to choose managed.

Sync/async is a trade-off between durability and latency. synchronous_commit's default is on, and if synchronous_standby_names is empty it's async (commits don't wait for the standby = fast, but you can lose the latest commits on total primary loss). If the requirement is "can't lose a single record," choose sync and accept the latency increase.

4. Don't stop (on change): zero-downtime DDL

Not just failures — your own deploys also stop production. Many ALTER TABLEs take an ACCESS EXCLUSIVE lock and, per the official definition, block even SELECT.

The danger is the "lock-wait chain." If a migration enters lock-wait behind a long-running query, all new queries pile up behind it, becoming a de facto total outage. Countermeasures (details in the non-stop DDL article):

-- マイグレーションは「世界を止める前に諦める」。短い lock_timeout で即失敗させ、リトライする
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN note text;   -- 定数デフォルトなら PG11+ で書き換え不要＝高速

Adding a column: a constant default doesn't rewrite the table and is fast. A volatile default rewrites the whole thing (avoid it).
NOT NULL / constraints: two stages of ADD CONSTRAINT ... NOT VALID (commits immediately) → VALIDATE CONSTRAINT (verifies with a weak lock).
Indexes: CREATE INDEX CONCURRENTLY (doesn't stop writes).
PostgreSQL 18 made it possible to attach NOT VALID to a NOT NULL constraint, making non-stop NOT NULL addition more natural.

5. Don't clog: connection pooling and monitoring

A connection is a "process"

PostgreSQL forks one OS process per connection (the official architecture). max_connections's default is 100. Raising it isn't the solution — each backend consumes memory, and work_mem is multiplied per backend too.

The right answer is connection pooling. Place a pooler (PgBouncer, etc.) between app↔DB and reuse a small number of physical connections. In particular, serverless (Lambda/edge) must not connect directly — on a spike it breaks max_connections from a connection storm and goes down. A transaction-mode pooler (PgBouncer / RDS Proxy / Supabase Supavisor) is essential (details and caveats in the connection-pooling article).

Monitor by "symptom"

At minimum, watch this constantly.

-- 詰まりの兆候：長時間 active / idle in transaction / ロック待ち
SELECT pid, state, wait_event_type, wait_event,
       now() - xact_start AS xact_age, substring(query,1,50) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_start;

pg_stat_statements: identify heavy queries (overview §2).
pg_stat_user_tables: bloat (n_dead_tup) and autovacuum effectiveness (MVCC/VACUUM article).
Replication lag: pg_stat_replication's replay_lag.
Connection usage: the number used against max_connections.
Wraparound risk: age(datfrozenxid) (MVCC/VACUUM article §7).

6. Protect: least privilege, network, encryption

Security starts from "don't trust the client" (details in the security-hardening article). Just the essentials.

Don't run the app as a superuser. Officially, a superuser "bypasses all permission checks." Create an app-dedicated least-privilege role (only the needed DML).
pg_hba.conf is decided by the first match from the top (official). Use scram-sha-256, force TLS with hostssl, and narrow by source IP, DB, and user. trust is strictly forbidden.
Passwords with SCRAM. password_encryption's default is scram-sha-256. MD5 is deprecated in PG18 (CREATE/ALTER ROLE warns, with future removal planned).
The client uses sslmode=verify-full. require encrypts but doesn't verify the certificate = can't prevent a man-in-the-middle attack.
PostgreSQL 18 adds OAuth 2.0 authentication (the oauth method in pg_hba.conf).

7. Evolve: upgrade without stopping

PostgreSQL majors are released yearly, and each major reaches EOL in about 5 years. Pickling is a vulnerability and EOL risk. Upgrades are two choices (details in the logical-replication & upgrade article).

pg_upgrade (in-place): a short stop suffices. PG18 shortens it further with parallel checks (--jobs), --swap (directory swap for the fastest), and statistics carry-over.
Logical replication (minimal stop): logically replicate to a new-version instance and switch over once it catches up. Officially "a few seconds of downtime" suffices. But DDL and sequences aren't replicated, so you need to manually advance sequences at switchover (this pitfall is detailed in the article).

8. Managed vs self-hosting: choose by operational load

Managed does much of the "manual work" above for you. The decision axis is "operational load" more than performance or cost.

Operations item	Self-hosting	Managed (RDS/Aurora/Cloud SQL, etc.)
Backup/PITR	Build `archive_command` / `pg_basebackup` yourself	Auto snapshot + PITR
HA/automatic failover	Build Patroni, etc., yourself (outside the core)	Automatic with Multi-AZ
Minor upgrade/patch	Plan and apply yourself	Done in a maintenance window
Monitoring	Yourself (Prometheus, etc.)	Integrated into CloudWatch, etc.
Tuning freedom	Maximum (extensions, OS settings free)	Some restrictions (allowlist, parameter groups)
Cost	Server actual cost (ops labor separate)	Premium (operations included)

Decision guideline: with a small team and thin ops staff, it's close to "managed, the only choice." The labor to correctly build automatic failover and PITR yourself is especially large. Conversely, if special extensions, OS-level control, or extreme cost optimization is a requirement, self-host. Much of "I want to entrust Postgres" commissioning is the need to outsource this operational load.

9. Production-operations checklist

Before release, can you answer these questions?

Can you restore: is continuous archiving/PITR enabled, and have you conducted a recovery drill.
Doesn't stop on failure: is there a standby, and is the failover procedure (auto/manual) established.
Doesn't stop on deploy: are migrations written with lock_timeout + non-stop patterns.
Doesn't clog: is there a connection pooler, and is there headroom in max_connections.
Is it protected: is the app a least-privilege role. TLS forced, SCRAM, minimal pg_hba.
Can it evolve: is the major-upgrade procedure (acceptable stop time) decided.
Is it visible: are heavy queries, bloat, replication lag, connection usage, and wraparound risk monitored.

10. Summary

Production operation is the 5 questions — don't break / don't stop (failure, change) / don't clog / protect / evolve.
For backups, "it can be restored" is justice. Continuous archiving/PITR + recovery drills.
The core has no automatic failover. Supplement with an external tool (Patroni, etc.) or managed.
A connection is a process. Serverless requires a transaction-mode pooler.
Change without stopping too. lock_timeout, NOT VALID→VALIDATE, CONCURRENTLY.
Managed is a means to outsource operational load. Strong for small teams.
PG18 boosts operations with async I/O, faster upgrades (--swap), and OAuth auth.

In this operations series, I go deep on each layer with real code and procedures. First, the heart of incident prevention — backup & PITR and connection pooling.

PostgreSQL production-operations guide (v18): the 5 principles of don't break, don't stop, don't clog, protect, and evolve

1. The 5 questions of production operation

2. Don't break: for backups, "it can be restored" is justice

3. Don't stop (on failure): replication and HA

4. Don't stop (on change): zero-downtime DDL

5. Don't clog: connection pooling and monitoring

A connection is a "process"

Monitor by "symptom"

6. Protect: least privilege, network, encryption

7. Evolve: upgrade without stopping

8. Managed vs self-hosting: choose by operational load

9. Production-operations checklist

10. Summary

References (PostgreSQL 18 official documentation)

PostgreSQL connection pooling in practice (PgBouncer / RDS Proxy / Supavisor, the transaction-mode traps, serverless support)

PostgreSQL backup & PITR in practice (pg_dump / continuous archiving / WAL / Point-in-Time Recovery, v18-ready)

PostgreSQL streaming replication and high availability (HA, sync/async, read replicas, failover, v18)

PostgreSQL logical replication in practice (publish/subscribe, CDC, cross-version zero-downtime major upgrade, v18)

Also worth reading

PostgreSQL production performance-tuning overview (v18 support): speed it up in the correct order of measure → index → execution plan → memory → VACUUM

Practical PostgreSQL index design (B-tree / GIN / GiST / BRIN, composite column order, covering, partial, expression indexes, CONCURRENTLY)

How to read PostgreSQL EXPLAIN ANALYZE and improve slow queries (reading the plan, the meaning of each node, auto_explain, v18 support)

1. The 5 questions of production operation

2. Don't break: for backups, "it can be restored" is justice

3. Don't stop (on failure): replication and HA

4. Don't stop (on change): zero-downtime DDL

5. Don't clog: connection pooling and monitoring

A connection is a "process"

Monitor by "symptom"

6. Protect: least privilege, network, encryption

7. Evolve: upgrade without stopping

8. Managed vs self-hosting: choose by operational load

9. Production-operations checklist

10. Summary

References (PostgreSQL 18 official documentation)

Related articles

PostgreSQL connection pooling in practice (PgBouncer / RDS Proxy / Supavisor, the transaction-mode traps, serverless support)

PostgreSQL backup & PITR in practice (pg_dump / continuous archiving / WAL / Point-in-Time Recovery, v18-ready)

PostgreSQL streaming replication and high availability (HA, sync/async, read replicas, failover, v18)

PostgreSQL logical replication in practice (publish/subscribe, CDC, cross-version zero-downtime major upgrade, v18)

Also worth reading

PostgreSQL production performance-tuning overview (v18 support): speed it up in the correct order of measure → index → execution plan → memory → VACUUM

Practical PostgreSQL index design (B-tree / GIN / GiST / BRIN, composite column order, covering, partial, expression indexes, CONCURRENTLY)

How to read PostgreSQL EXPLAIN ANALYZE and improve slow queries (reading the plan, the meaning of each node, auto_explain, v18 support)