Raising a database's performance (the performance-tuning overview) and continuing to safely operate a database in production are different things. What's asked in the latter is not "speed" but — can you restore it when it breaks, does it stay up when it goes down, does it not clog when congested, can you protect it when attacked, and can you evolve it without stopping.
This article is the whole map for operating PostgreSQL in production. It rearranges backup, HA, connections, monitoring, security, and upgrades on the premise that incidents happen. It leaves the specifics to individual articles and concentrates on the decisions and priorities of operations. It's useful both for buyers discerning "is it OK to entrust Postgres" in contract development, and for engineers taking over operations.
Rules for this article: specs, defaults, and PostgreSQL 18 new features are based on the PostgreSQL 18 official documentation (as of June 2026). Vendor-specific matters for PgBouncer, RDS, etc., are noted as such. Managed services do much of the operation for you, so some of the manual work described here becomes unnecessary (the boundary is shown in the text too).
1. The 5 questions of production operation
Operations design is 80% complete if you can answer the following five. From the top, in order of "fatal to lose."
| # | Question | If it fails | The weapon | Deep dive |
|---|---|---|---|---|
| 1 | Don't break | Data loss (= business-ending) | Backup, PITR | §2 / backup article |
| 2 | Don't stop (on failure) | Downtime, SPOF | Replication, HA | HA article |
| 3 | Don't stop (on change) | Total outage on deploy | Zero-downtime DDL | non-stop DDL article |
| 4 | Don't clog | Connection exhaustion, slowdown | Connection pooling, monitoring | connection-pooling article |
| 5 | Protect | Information leak, unauthorized access | Least privilege, TLS, SCRAM | security article |
| + | Evolve | Pickled, EOL | Non-stop upgrade | logical-replication article |
2. Don't break: for backups, "it can be restored" is justice
From the heaviest question. The purpose of a backup is not "to take it" but "to be able to restore." An untested restore procedure is the same as nonexistent.
PostgreSQL backups officially come in three methods (details in the backup & PITR article).
| Method | What it can do | Limit |
|---|---|---|
Logical (pg_dump) | A consistent snapshot while running. Portable, selective restore, cross-version | Per DB. Roles/tablespaces need separate pg_dumpall |
| Physical (files) | Copy the whole cluster | Server stop required (or pg_basebackup) |
| Continuous archiving (WAL) | A base backup + continuous WAL evacuation to restore to any point (PITR) | Complex to set up, per cluster |
The decisive difference is RPO (how much you lose).
- A nightly
pg_dump→ at worst, lose everything since the last dump (up to 24 hours). - Continuous archiving → officially "can restore to any point since the base backup." Since WAL is continuously evacuated, data loss can be made nearly zero.
Implication for buyers/operators: "only a nightly dump" is dangerous for a production core DB. "Don't break" holds only on the premise of continuous archiving (or managed PITR), plus "regularly conducting recovery drills." Managed (RDS/Aurora/Cloud SQL) automates snapshot + WAL archiving and provides PITR without you hand-operating
archive_commandorpg_basebackup(a vendor feature).
3. Don't stop (on failure): replication and HA
Servers break. Disks die, AZs go down. Not creating a SPOF (single point of failure) is the core of "don't stop."
PostgreSQL's main HA is physical streaming replication — flow the primary's WAL to a standby and keep a nearly-latest replica waiting (details in the HA article). You can choose sync/async and offload reads to read replicas.
Here's the most important fact (officially stated):
PostgreSQL does not provide the system software to detect a primary failure and notify the standby.
In other words, the PostgreSQL core doesn't have built-in automatic failover. Failure detection, promotion (pg_promote()), IP reassignment, and split-brain prevention (STONITH) need to be built with an external tool like Patroni.
This is managed's biggest value. RDS/Aurora Multi-AZ does this "detection + automatic failover + endpoint switching" for you. For self-operation, you need to estimate the build/operation cost of Patroni, etc., and this one point alone can be a reason to choose managed.
Sync/async is a trade-off between durability and latency. synchronous_commit's default is on, and if synchronous_standby_names is empty it's async (commits don't wait for the standby = fast, but you can lose the latest commits on total primary loss). If the requirement is "can't lose a single record," choose sync and accept the latency increase.
4. Don't stop (on change): zero-downtime DDL
Not just failures — your own deploys also stop production. Many ALTER TABLEs take an ACCESS EXCLUSIVE lock and, per the official definition, block even SELECT.
The danger is the "lock-wait chain." If a migration enters lock-wait behind a long-running query, all new queries pile up behind it, becoming a de facto total outage. Countermeasures (details in the non-stop DDL article):
-- マイグレーションは「世界を止める前に諦める」。短い lock_timeout で即失敗させ、リトライする
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN note text; -- 定数デフォルトなら PG11+ で書き換え不要=高速
- Adding a column: a constant default doesn't rewrite the table and is fast. A
volatiledefault rewrites the whole thing (avoid it). - NOT NULL / constraints: two stages of
ADD CONSTRAINT ... NOT VALID(commits immediately) →VALIDATE CONSTRAINT(verifies with a weak lock). - Indexes:
CREATE INDEX CONCURRENTLY(doesn't stop writes). - PostgreSQL 18 made it possible to attach
NOT VALIDto aNOT NULLconstraint, making non-stop NOT NULL addition more natural.
5. Don't clog: connection pooling and monitoring
A connection is a "process"
PostgreSQL forks one OS process per connection (the official architecture). max_connections's default is 100. Raising it isn't the solution — each backend consumes memory, and work_mem is multiplied per backend too.
The right answer is connection pooling. Place a pooler (PgBouncer, etc.) between app↔DB and reuse a small number of physical connections. In particular, serverless (Lambda/edge) must not connect directly — on a spike it breaks max_connections from a connection storm and goes down. A transaction-mode pooler (PgBouncer / RDS Proxy / Supabase Supavisor) is essential (details and caveats in the connection-pooling article).
Monitor by "symptom"
At minimum, watch this constantly.
-- 詰まりの兆候:長時間 active / idle in transaction / ロック待ち
SELECT pid, state, wait_event_type, wait_event,
now() - xact_start AS xact_age, substring(query,1,50) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_start;
pg_stat_statements: identify heavy queries (overview §2).pg_stat_user_tables: bloat (n_dead_tup) and autovacuum effectiveness (MVCC/VACUUM article).- Replication lag:
pg_stat_replication'sreplay_lag. - Connection usage: the number used against
max_connections. - Wraparound risk:
age(datfrozenxid)(MVCC/VACUUM article §7).
6. Protect: least privilege, network, encryption
Security starts from "don't trust the client" (details in the security-hardening article). Just the essentials.
- Don't run the app as a superuser. Officially, a superuser "bypasses all permission checks." Create an app-dedicated least-privilege role (only the needed DML).
pg_hba.confis decided by the first match from the top (official). Usescram-sha-256, force TLS withhostssl, and narrow by source IP, DB, and user.trustis strictly forbidden.- Passwords with SCRAM.
password_encryption's default isscram-sha-256. MD5 is deprecated in PG18 (CREATE/ALTER ROLEwarns, with future removal planned). - The client uses
sslmode=verify-full.requireencrypts but doesn't verify the certificate = can't prevent a man-in-the-middle attack. - PostgreSQL 18 adds OAuth 2.0 authentication (the
oauthmethod inpg_hba.conf).
7. Evolve: upgrade without stopping
PostgreSQL majors are released yearly, and each major reaches EOL in about 5 years. Pickling is a vulnerability and EOL risk. Upgrades are two choices (details in the logical-replication & upgrade article).
pg_upgrade(in-place): a short stop suffices. PG18 shortens it further with parallel checks (--jobs),--swap(directory swap for the fastest), and statistics carry-over.- Logical replication (minimal stop): logically replicate to a new-version instance and switch over once it catches up. Officially "a few seconds of downtime" suffices. But DDL and sequences aren't replicated, so you need to manually advance sequences at switchover (this pitfall is detailed in the article).
8. Managed vs self-hosting: choose by operational load
Managed does much of the "manual work" above for you. The decision axis is "operational load" more than performance or cost.
| Operations item | Self-hosting | Managed (RDS/Aurora/Cloud SQL, etc.) |
|---|---|---|
| Backup/PITR | Build archive_command / pg_basebackup yourself | Auto snapshot + PITR |
| HA/automatic failover | Build Patroni, etc., yourself (outside the core) | Automatic with Multi-AZ |
| Minor upgrade/patch | Plan and apply yourself | Done in a maintenance window |
| Monitoring | Yourself (Prometheus, etc.) | Integrated into CloudWatch, etc. |
| Tuning freedom | Maximum (extensions, OS settings free) | Some restrictions (allowlist, parameter groups) |
| Cost | Server actual cost (ops labor separate) | Premium (operations included) |
Decision guideline: with a small team and thin ops staff, it's close to "managed, the only choice." The labor to correctly build automatic failover and PITR yourself is especially large. Conversely, if special extensions, OS-level control, or extreme cost optimization is a requirement, self-host. Much of "I want to entrust Postgres" commissioning is the need to outsource this operational load.
9. Production-operations checklist
Before release, can you answer these questions?
- Can you restore: is continuous archiving/PITR enabled, and have you conducted a recovery drill.
- Doesn't stop on failure: is there a standby, and is the failover procedure (auto/manual) established.
- Doesn't stop on deploy: are migrations written with
lock_timeout+ non-stop patterns. - Doesn't clog: is there a connection pooler, and is there headroom in
max_connections. - Is it protected: is the app a least-privilege role. TLS forced, SCRAM, minimal
pg_hba. - Can it evolve: is the major-upgrade procedure (acceptable stop time) decided.
- Is it visible: are heavy queries, bloat, replication lag, connection usage, and wraparound risk monitored.
10. Summary
- Production operation is the 5 questions — don't break / don't stop (failure, change) / don't clog / protect / evolve.
- For backups, "it can be restored" is justice. Continuous archiving/PITR + recovery drills.
- The core has no automatic failover. Supplement with an external tool (Patroni, etc.) or managed.
- A connection is a process. Serverless requires a transaction-mode pooler.
- Change without stopping too.
lock_timeout,NOT VALID→VALIDATE,CONCURRENTLY. - Managed is a means to outsource operational load. Strong for small teams.
- PG18 boosts operations with async I/O, faster upgrades (
--swap), and OAuth auth.
In this operations series, I go deep on each layer with real code and procedures. First, the heart of incident prevention — backup & PITR and connection pooling.