# PostgreSQL production-operations guide (v18): the 5 principles of don't break, don't stop, don't clog, protect, and evolve

> A systematic guide to safely operating PostgreSQL in production. From backup and PITR (don't break), replication/HA and zero-downtime changes (don't stop), connection pooling and monitoring (don't clog), role/TLS/SCRAM security (protect), to major upgrades (evolve), it explains with real code and operational procedures faithful to the official docs (v18). Also the decision axes between managed (RDS/Aurora/Cloud SQL) and self-hosting.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: PostgreSQL, アーキテクチャ設計, パフォーマンス
- URL: https://tomodahinata.com/en/blog/postgresql-production-operations-guide
- Category: PostgreSQL operations & reliability

## Key points

- Production operation is answering five questions: ① don't break (backup/PITR) ② don't stop (HA, zero-downtime changes) ③ don't clog (connection pooling, monitoring) ④ protect (least privilege, TLS, SCRAM) ⑤ evolve (non-stop upgrade).
- For backups, 'it's taken' isn't justice; 'it can be restored' is. A nightly pg_dump loses up to 24 hours. With continuous archiving (WAL), you can restore to any point (PITR).
- The PostgreSQL core doesn't provide automatic failover (officially stated). Build detection, promotion, and IP switching with an external tool like Patroni. Managed services do this for you.
- One process is forked per connection. Serverless connecting directly goes down from a connection storm. A transaction-mode pooler (PgBouncer/RDS Proxy) is essential.
- Production schema changes can stop even SELECT with an ACCESS EXCLUSIVE lock. Make it non-stop with lock_timeout + retry, NOT VALID → VALIDATE, and CONCURRENTLY. PG18 adds async I/O, faster upgrades (--swap), and OAuth auth.

---

Raising a database's performance ([the performance-tuning overview](/blog/postgresql-performance-tuning-production-guide)) and **continuing to safely operate a database in production** are different things. What's asked in the latter is not "speed" but — **can you restore it when it breaks, does it stay up when it goes down, does it not clog when congested, can you protect it when attacked, and can you evolve it without stopping.**

This article is the **whole map** for operating PostgreSQL in production. It rearranges backup, HA, connections, monitoring, security, and upgrades **on the premise that incidents happen.** It leaves the specifics to individual articles and concentrates on the **decisions and priorities** of operations. It's useful both for buyers discerning "is it OK to entrust Postgres" in contract development, and for engineers taking over operations.

> **Rules for this article**: specs, defaults, and PostgreSQL 18 new features are based on the **PostgreSQL 18 official documentation (as of June 2026).** Vendor-specific matters for PgBouncer, RDS, etc., are noted as such. **Managed services do much of the operation for you**, so some of the manual work described here becomes unnecessary (the boundary is shown in the text too).

---

## 1. The 5 questions of production operation

Operations design is 80% complete if you can answer the following five. From the top, in order of "fatal to lose."

| # | Question | If it fails | The weapon | Deep dive |
| --- | --- | --- | --- | --- |
| 1 | **Don't break** | Data loss (= business-ending) | Backup, PITR | [§2](#) / [backup article](/blog/postgresql-backup-pitr-pg-dump-wal-archiving-guide) |
| 2 | **Don't stop (on failure)** | Downtime, SPOF | Replication, HA | [HA article](/blog/postgresql-streaming-replication-high-availability-failover-guide) |
| 3 | **Don't stop (on change)** | Total outage on deploy | Zero-downtime DDL | [non-stop DDL article](/blog/postgresql-zero-downtime-schema-migration-lock-safe-ddl-guide) |
| 4 | **Don't clog** | Connection exhaustion, slowdown | Connection pooling, monitoring | [connection-pooling article](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide) |
| 5 | **Protect** | Information leak, unauthorized access | Least privilege, TLS, SCRAM | [security article](/blog/postgresql-security-hardening-roles-privileges-ssl-scram-guide) |
| + | **Evolve** | Pickled, EOL | Non-stop upgrade | [logical-replication article](/blog/postgresql-logical-replication-cdc-zero-downtime-upgrade-guide) |

---

## 2. Don't break: for backups, "it can be restored" is justice

From the heaviest question. **The purpose of a backup is not "to take it" but "to be able to restore."** An untested restore procedure is the same as nonexistent.

PostgreSQL backups officially come in three methods (details in the [backup & PITR article](/blog/postgresql-backup-pitr-pg-dump-wal-archiving-guide)).

| Method | What it can do | Limit |
| --- | --- | --- |
| **Logical (`pg_dump`)** | A consistent snapshot while running. Portable, selective restore, cross-version | Per DB. Roles/tablespaces need separate `pg_dumpall` |
| **Physical (files)** | Copy the whole cluster | **Server stop required** (or `pg_basebackup`) |
| **Continuous archiving (WAL)** | A base backup + continuous WAL evacuation to **restore to any point (PITR)** | Complex to set up, per cluster |

The decisive difference is **RPO (how much you lose).**

- **A nightly `pg_dump`** → at worst, lose everything since the last dump (up to 24 hours).
- **Continuous archiving** → officially "can **restore to any point** since the base backup." Since WAL is continuously evacuated, data loss can be made **nearly zero.**

> **Implication for buyers/operators**: "only a nightly dump" is dangerous for a production core DB. "Don't break" holds only on the premise of **continuous archiving (or managed PITR)**, plus "**regularly conducting recovery drills.**" Managed (RDS/Aurora/Cloud SQL) automates snapshot + WAL archiving and provides PITR without you hand-operating `archive_command` or `pg_basebackup` (a vendor feature).

---

## 3. Don't stop (on failure): replication and HA

Servers break. Disks die, AZs go down. **Not creating a SPOF (single point of failure)** is the core of "don't stop."

PostgreSQL's main HA is **physical streaming replication** — flow the primary's WAL to a standby and keep a nearly-latest replica waiting (details in the [HA article](/blog/postgresql-streaming-replication-high-availability-failover-guide)). You can choose sync/async and offload reads to read replicas.

Here's **the most important fact** (officially stated):

> PostgreSQL does **not provide the system software** to detect a primary failure and notify the standby.

In other words, the PostgreSQL core **doesn't have built-in automatic failover.** Failure detection, promotion (`pg_promote()`), IP reassignment, and **split-brain prevention (STONITH)** need to be **built with an external tool** like Patroni.

> **This is managed's biggest value.** RDS/Aurora Multi-AZ does this "detection + automatic failover + endpoint switching" for you. For self-operation, you need to estimate the build/operation cost of Patroni, etc., and **this one point alone can be a reason to choose managed.**

Sync/async is a **trade-off between durability and latency.** `synchronous_commit`'s default is `on`, and if `synchronous_standby_names` is empty it's **async** (commits don't wait for the standby = fast, but you can lose the latest commits on total primary loss). If the requirement is "can't lose a single record," choose sync and accept the latency increase.

---

## 4. Don't stop (on change): zero-downtime DDL

Not just failures — **your own deploys** also stop production. Many `ALTER TABLE`s take an **`ACCESS EXCLUSIVE` lock** and, per the official definition, **block even SELECT.**

The danger is the "lock-wait chain." If a migration enters lock-wait behind a long-running query, **all new queries pile up behind it**, becoming a de facto total outage. Countermeasures (details in the [non-stop DDL article](/blog/postgresql-zero-downtime-schema-migration-lock-safe-ddl-guide)):

```sql
-- マイグレーションは「世界を止める前に諦める」。短い lock_timeout で即失敗させ、リトライする
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN note text;   -- 定数デフォルトなら PG11+ で書き換え不要＝高速
```

- **Adding a column**: a constant default doesn't rewrite the table and is fast. A `volatile` default rewrites the whole thing (avoid it).
- **NOT NULL / constraints**: two stages of `ADD CONSTRAINT ... NOT VALID` (commits immediately) → `VALIDATE CONSTRAINT` (verifies with a weak lock).
- **Indexes**: `CREATE INDEX CONCURRENTLY` (doesn't stop writes).
- PostgreSQL 18 made it **possible to attach `NOT VALID` to a `NOT NULL` constraint**, making non-stop NOT NULL addition more natural.

---

## 5. Don't clog: connection pooling and monitoring

### A connection is a "process"

PostgreSQL **forks one OS process** per connection (the official architecture). `max_connections`'s default is **100.** Raising it isn't the solution — each backend consumes memory, and `work_mem` is **multiplied per backend** too.

The right answer is **connection pooling.** Place a pooler (PgBouncer, etc.) between app↔DB and **reuse a small number of physical connections.** In particular, **serverless (Lambda/edge) must not connect directly** — on a spike it breaks `max_connections` from a connection storm and goes down. **A transaction-mode pooler (PgBouncer / RDS Proxy / Supabase Supavisor) is essential** (details and caveats in the [connection-pooling article](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide)).

### Monitor by "symptom"

At minimum, watch this constantly.

```sql
-- 詰まりの兆候：長時間 active / idle in transaction / ロック待ち
SELECT pid, state, wait_event_type, wait_event,
       now() - xact_start AS xact_age, substring(query,1,50) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_start;
```

- **`pg_stat_statements`**: identify heavy queries ([overview §2](/blog/postgresql-performance-tuning-production-guide)).
- **`pg_stat_user_tables`**: bloat (`n_dead_tup`) and autovacuum effectiveness ([MVCC/VACUUM article](/blog/postgresql-mvcc-transaction-isolation-vacuum-autovacuum-guide)).
- **Replication lag**: `pg_stat_replication`'s `replay_lag`.
- **Connection usage**: the number used against `max_connections`.
- **Wraparound risk**: `age(datfrozenxid)` ([MVCC/VACUUM article §7](/blog/postgresql-mvcc-transaction-isolation-vacuum-autovacuum-guide)).

---

## 6. Protect: least privilege, network, encryption

Security starts from "don't trust the client" (details in the [security-hardening article](/blog/postgresql-security-hardening-roles-privileges-ssl-scram-guide)). Just the essentials.

- **Don't run the app as a superuser.** Officially, a superuser "**bypasses all permission checks.**" Create an app-dedicated **least-privilege role** (only the needed DML).
- **`pg_hba.conf` is decided by the first match from the top** (official). Use `scram-sha-256`, force TLS with `hostssl`, and narrow by source IP, DB, and user. `trust` is strictly forbidden.
- **Passwords with SCRAM.** `password_encryption`'s default is `scram-sha-256`. **MD5 is deprecated in PG18** (`CREATE/ALTER ROLE` warns, with future removal planned).
- **The client uses `sslmode=verify-full`.** `require` encrypts but **doesn't verify the certificate** = can't prevent a man-in-the-middle attack.
- PostgreSQL 18 adds **OAuth 2.0 authentication** (the `oauth` method in `pg_hba.conf`).

---

## 7. Evolve: upgrade without stopping

PostgreSQL majors are **released yearly, and each major reaches EOL in about 5 years.** Pickling is a vulnerability and EOL risk. Upgrades are two choices (details in the [logical-replication & upgrade article](/blog/postgresql-logical-replication-cdc-zero-downtime-upgrade-guide)).

- **`pg_upgrade` (in-place)**: a short stop suffices. PG18 shortens it further with **parallel checks (`--jobs`)**, **`--swap` (directory swap for the fastest)**, and statistics carry-over.
- **Logical replication (minimal stop)**: logically replicate to a new-version instance and switch over once it catches up. Officially "**a few seconds of downtime**" suffices. But **DDL and sequences aren't replicated**, so you need to manually advance sequences at switchover (this pitfall is detailed in the article).

---

## 8. Managed vs self-hosting: choose by operational load

Managed does much of the "manual work" above for you. The decision axis is "**operational load**" more than performance or cost.

| Operations item | Self-hosting | Managed (RDS/Aurora/Cloud SQL, etc.) |
| --- | --- | --- |
| Backup/PITR | Build `archive_command` / `pg_basebackup` yourself | Auto snapshot + PITR |
| HA/automatic failover | **Build Patroni, etc., yourself** (outside the core) | Automatic with Multi-AZ |
| Minor upgrade/patch | Plan and apply yourself | Done in a maintenance window |
| Monitoring | Yourself (Prometheus, etc.) | Integrated into CloudWatch, etc. |
| Tuning freedom | Maximum (extensions, OS settings free) | Some restrictions (allowlist, parameter groups) |
| Cost | Server actual cost (ops labor separate) | Premium (operations included) |

> **Decision guideline**: with a small team and thin ops staff, it's close to **"managed, the only choice."** The labor to correctly build automatic failover and PITR yourself is especially large. Conversely, if special extensions, OS-level control, or extreme cost optimization is a requirement, self-host. **Much of "I want to entrust Postgres" commissioning is the need to outsource this operational load.**

---

## 9. Production-operations checklist

Before release, can you answer these questions?

1. **Can you restore**: is continuous archiving/PITR enabled, and have you **conducted a recovery drill.**
2. **Doesn't stop on failure**: is there a standby, and is the failover procedure (auto/manual) established.
3. **Doesn't stop on deploy**: are migrations written with `lock_timeout` + non-stop patterns.
4. **Doesn't clog**: is there a connection pooler, and is there headroom in `max_connections`.
5. **Is it protected**: is the app a least-privilege role. TLS forced, SCRAM, minimal `pg_hba`.
6. **Can it evolve**: is the major-upgrade procedure (acceptable stop time) decided.
7. **Is it visible**: are heavy queries, bloat, replication lag, connection usage, and wraparound risk monitored.

---

## 10. Summary

- Production operation is the **5 questions** — don't break / don't stop (failure, change) / don't clog / protect / evolve.
- **For backups, "it can be restored" is justice.** Continuous archiving/PITR + recovery drills.
- **The core has no automatic failover.** Supplement with an external tool (Patroni, etc.) or managed.
- **A connection is a process.** Serverless requires a transaction-mode pooler.
- **Change without stopping too.** `lock_timeout`, `NOT VALID`→`VALIDATE`, `CONCURRENTLY`.
- **Managed is a means to outsource operational load.** Strong for small teams.
- **PG18** boosts operations with async I/O, faster upgrades (`--swap`), and OAuth auth.

In this operations series, I go deep on each layer with real code and procedures. First, the heart of incident prevention — [backup & PITR](/blog/postgresql-backup-pitr-pg-dump-wal-archiving-guide) and [connection pooling](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide).

---

### References (PostgreSQL 18 official documentation)

- [Chapter 25. Backup and Restore](https://www.postgresql.org/docs/18/backup.html)
- [Chapter 26. High Availability, Load Balancing, and Replication](https://www.postgresql.org/docs/18/high-availability.html)
- [Chapter 29. Logical Replication](https://www.postgresql.org/docs/18/logical-replication.html)
- [Chapter 22. Database Roles](https://www.postgresql.org/docs/18/user-manag.html)
- [pg_upgrade](https://www.postgresql.org/docs/18/pgupgrade.html)
- [PostgreSQL 18 Released!](https://www.postgresql.org/about/news/postgresql-18-released-3142/)
