PostgreSQL's "speed" and "correctness" both ride on MVCC (Multi-Version Concurrency Control). Tune without understanding MVCC and you can't see the cause of problems like "it bloats somehow," "serialization errors come out," "VACUUM can't keep up."
This article explains, from the mechanism of MVCC, through the correct way to choose transaction isolation levels, a job queue with row locks, and the heart of bloat measures VACUUM/autovacuum, as production-operation practice faithful to the PostgreSQL official documentation. It's the deep dive into §4 (VACUUM) and §7 (HOT updates) of the performance overview.
Rules for this article: MVCC's behavior, the guarantees of isolation levels, VACUUM's spec, autovacuum's defaults, and the PostgreSQL 18 changes are all based on the PostgreSQL 18 official documentation (as of June 2026). A managed DB may override the autovacuum defaults, so confirm the actual state with
SHOWandpg_stat_user_tables.
1. MVCC: reads and writes don't block each other
The official states the core of PostgreSQL's concurrency control in one sentence.
In MVCC, the locks acquired for a query (read) and the locks acquired for writing don't conflict. Therefore reading never blocks writing, and writing never blocks reading.
What realizes this is the snapshot. Each statement (or each transaction) sees "a version of the database at a point in time." Even if someone else updates while you're aggregating a long report, your query keeps seeing a consistent past cross-section.
The mechanism: rows are managed by "versions"
Each row (tuple) has hidden system columns.
xmin: the transaction ID (XID) that inserted that row version.xmax: the XID that invalidated it by delete/update (0 if not deleted).ctid: a physical pointer to the new version of the same row (versions chain).
This is the starting point of bloat. The official says —
UPDATEandDELETEdon't immediately delete the old version of the row. Because it can't be deleted while it might still be visible to other transactions.
That is, UPDATE is not "overwrite" but "append a new row version + stamp xmax on the old version." The old version becomes a dead tuple at the point it's no longer visible to any transaction. What cleans up this dead tuple is VACUUM (§5).
-- 隠し列を見るとバージョン管理が分かる
SELECT ctid, xmin, xmax, * FROM accounts WHERE id = 1;
-- UPDATE すると同じ id でも ctid(物理位置)が変わり、古い版がデッドタプルになる
2. Transaction isolation levels: the default and the "true guarantee"
The SQL standard defines 4 levels, but PostgreSQL internally has 3 (Read Uncommitted behaves as Read Committed). The default is READ COMMITTED.
The anomalies each level prevents (PostgreSQL's actual behavior):
| Level | Dirty read | Non-repeatable read | Phantom read | Serialization anomaly |
|---|---|---|---|---|
| Read Committed (default) | doesn't happen | can happen | can happen | can happen |
| Repeatable Read | doesn't happen | doesn't happen | doesn't happen ※ | can happen |
| Serializable | doesn't happen | doesn't happen | doesn't happen | doesn't happen |
※ Here's PostgreSQL's characteristic. The official says "PostgreSQL's Repeatable Read implementation doesn't allow phantom reads" — stronger than the standard's minimum requirement, implemented as snapshot isolation.
When to use what
- READ COMMITTED (default): each statement sees a snapshot "at the statement's start." Most web apps are enough with this.
- REPEATABLE READ: the whole transaction sees a consistent snapshot "at the first statement's point." For reports/aggregations and balance calculations needing a consistent cross-section across multiple queries.
- SERIALIZABLE: guarantees the same result as serial execution (SSI = Serializable Snapshot Isolation). The strongest, but throws a serialization failure on conflict.
BEGIN ISOLATION LEVEL REPEATABLE READ;
-- このトランザクション内のすべての SELECT は同じスナップショットを見る
SELECT sum(amount) FROM ledger WHERE account_id = 1;
-- ... 他のクエリ ...
COMMIT;
SERIALIZABLE assumes "retry"
The most important point the official emphasizes: in SERIALIZABLE (and REPEATABLE READ), a serialization failure happens on conflict and always returns SQLSTATE 40001. This is normal behavior, not a bug, and the app must retry that transaction wholesale.
// SQLSTATE 40001(直列化失敗)と 40P01(デッドロック)は「リトライすれば成功する」エラー。
// トランザクション全体を冪等に再実行する汎用ラッパー。
import type { PoolClient } from "pg";
import { pool } from "./db";
const RETRYABLE = new Set(["40001", "40P01"]); // serialization_failure / deadlock_detected
export async function withSerializableRetry<T>(
work: (tx: PoolClient) => Promise<T>,
maxAttempts = 5,
): Promise<T> {
for (let attempt = 1; ; attempt++) {
const tx = await pool.connect();
try {
await tx.query("BEGIN ISOLATION LEVEL SERIALIZABLE");
const result = await work(tx);
await tx.query("COMMIT");
return result;
} catch (err: unknown) {
await tx.query("ROLLBACK").catch(() => {});
const code = (err as { code?: string }).code;
if (code && RETRYABLE.has(code) && attempt < maxAttempts) {
// 指数バックオフ+ジッタで競合を散らす
await new Promise((r) => setTimeout(r, 2 ** attempt * 5 + Math.random() * 10));
continue;
}
throw err;
} finally {
tx.release();
}
}
}
The design implication: SERIALIZABLE is a powerful tool that can guarantee correctness without increasing locks, but the handling "always retry 40001" is a precondition. Use it without preparing this and you get an unstable system that occasionally fails.
3. Row locks: a job queue with FOR UPDATE and SKIP LOCKED
Even with MVCC, you need an explicit row lock to "definitely update this row myself." In order of strength (strong→weak):
| Lock | Use |
|---|---|
FOR UPDATE | exclusive lock on the premise of updating/deleting the row. Strongest |
FOR NO KEY UPDATE | for an update that doesn't change key columns. Doesn't block FOR KEY SHARE |
FOR SHARE | shared lock. Prevents other updates/deletes but allows shared reads |
FOR KEY SHARE | weakest. Blocks only updates that change the key value (used in foreign-key verification) |
The job-queue standard: FOR UPDATE SKIP LOCKED
The correct answer for "multiple workers fighting over work from a queue table" is SKIP LOCKED. The official definition:
With
SKIP LOCKED, rows that can't be locked immediately are skipped. … Because skipping locked rows doesn't give a consistent view of the data, it's not suited for general use, but it can be used to avoid lock contention when multiple consumers access a queue-like table.
By this, each worker can obtain "the next work not grabbed by another worker" without waiting.
-- 複数ワーカーが安全に仕事を取り合う(互いに待たず、同じ行を二重処理しない)
WITH next_job AS (
SELECT id
FROM jobs
WHERE status = 'queued'
ORDER BY priority DESC, created_at
FOR UPDATE SKIP LOCKED -- 他ワーカーがロック中の行は飛ばす
LIMIT 1
)
UPDATE jobs j
SET status = 'running', started_at = now()
FROM next_job
WHERE j.id = next_job.id
RETURNING j.*;
// ワーカー側:1行を掴んで処理し、結果に応じて done/failed に遷移(トランザクションで囲む)
export async function pullAndRunOneJob(run: (job: Job) => Promise<void>) {
const tx = await pool.connect();
try {
await tx.query("BEGIN");
const { rows } = await tx.query<Job>(`
WITH next_job AS (
SELECT id FROM jobs WHERE status = 'queued'
ORDER BY priority DESC, created_at
FOR UPDATE SKIP LOCKED LIMIT 1
)
UPDATE jobs j SET status='running', started_at=now()
FROM next_job WHERE j.id = next_job.id
RETURNING j.*`);
if (rows.length === 0) { await tx.query("COMMIT"); return null; }
const job = rows[0];
await run(job); // 実処理(冪等に設計する)
await tx.query("UPDATE jobs SET status='done', finished_at=now() WHERE id=$1", [job.id]);
await tx.query("COMMIT");
return job;
} catch (e) {
await tx.query("ROLLBACK").catch(() => {});
throw e;
} finally {
tx.release();
}
}
NOWAIT (error immediately without waiting) and advisory locks (app-defined lightweight locks, pg_advisory_xact_lock, etc.; they don't produce table bloat and are fast) are also effective depending on the situation.
4. Why it bloats
As in §1, UPDATE/DELETE leave dead tuples. When dead tuples accumulate:
- The table/index physically swells (wastes disk and cache).
- Scans read "live rows + dead rows," so they slow down.
- The index-only scan's visibility map gets dirty and
Heap Fetchesincreases (EXPLAIN article).
The first step of mitigation is the HOT update (overview §7.2). If "an update that doesn't change indexed columns" and "there's free space on the same page," it can reclaim the dead version during normal operation without creating a new index entry. For a heavily-updated table, lower fillfactor to ensure free space.
-- デッドタプルの溜まり具合と autovacuum の最終実行を監視
SELECT relname, n_live_tup, n_dead_tup,
round(n_dead_tup::numeric / nullif(n_live_tup,0), 3) AS dead_ratio,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
5. VACUUM and VACUUM FULL: similar but different
VACUUM does 4 jobs (official): ① make the dead-tuple space reusable, ② update planner statistics, ③ update the visibility map (speed up index-only scans), ④ protect from transaction-ID wraparound (§7).
Here I correct a fatally common misconception.
- Normal
VACUUM: it only marks the dead-tuple space as "reusable" and doesn't return the disk to the OS (except the special case of a completely-empty trailing page). It can run concurrently during production operation and doesn't stopSELECT/INSERT/UPDATE/DELETE. VACUUM FULL: it rewrites the whole table to eliminate dead space and also returns space to the OS. But it takes anACCESS EXCLUSIVElock, during which that table is completely unusable.
VACUUM (VERBOSE, ANALYZE) orders; -- 通常運用。並行可能。領域は再利用可能になる
-- VACUUM FULL orders; -- ⚠ テーブルを排他ロック。本番のオンライン中は基本使わない
The production iron rule: even if bloat is serious, don't lightly fire
VACUUM FULL(the table stops). To physically shrink online, considerpg_repack(an extension). Not accumulating bloat in the first place = making autovacuum work is the right path.
6. Make autovacuum work (with PG18's new feature)
autovacuum is a background mechanism that auto-VACUUMs/ANALYZEs tables where dead tuples have accumulated to a certain amount (default on). The trigger threshold is the official formula:
vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × row count
default = 50 + 0.2 (= 20%) × reltuples
The main defaults:
| Parameter | Default |
|---|---|
autovacuum_vacuum_threshold | 50 |
autovacuum_vacuum_scale_factor | 0.2 (20%) |
autovacuum_analyze_scale_factor | 0.1 (10%) |
autovacuum_max_workers | 3 |
autovacuum_naptime | 1min |
autovacuum_vacuum_cost_delay | 2ms |
The huge-table pitfall and PostgreSQL 18's solution
"20%" is reasonable for small tables but fatal for huge tables. For a 100-million-row table, VACUUM doesn't run "until 20 million rows are dead," and during that time bloat keeps progressing.
Conventionally, you addressed this by lowering scale_factor per table.
-- 巨大で更新の多い表は、行数比ではなく小さめのしきい値で頻繁に VACUUM
ALTER TABLE big_events SET (
autovacuum_vacuum_scale_factor = 0.02, -- 2% で発火
autovacuum_vacuum_cost_delay = 0 -- VACUUM を速く回す(I/O に余裕がある場合)
);
PostgreSQL 18 solved this head-on. The new parameter autovacuum_vacuum_max_threshold (default 100 million) sets a cap on the formula's computed value above. The official explanation:
Specifies the maximum number of updated/deleted tuples needed to trigger a VACUUM on a single table (a cap on the value computed by
autovacuum_vacuum_thresholdandautovacuum_vacuum_scale_factor).
The effect: even on a huge table, VACUUM runs "once dead tuples reach 100 million," and the "wait until 20%" problem is mitigated by default. Just upgrading to PG18 makes huge-table bloat less likely.
7. Transaction-ID wraparound: the scariest failure
This is not a performance story but a data-loss story. XID is 32-bit. The official warning:
Because transaction IDs are finite in size (32-bit), a long-running cluster (over 4 billion transactions) falls into transaction-ID wraparound. The XID counter wraps back to zero, past transactions suddenly look "in the future," and their output becomes invisible. In short, catastrophic data loss.
What prevents this is the freeze — freezing sufficiently-old row versions with a special XID regarded as "always in the past." What drives the freeze is autovacuum, and here's what's important: the official says the wraparound-prevention autovacuum runs even with autovacuum disabled (default autovacuum_freeze_max_age = 200 million transactions).
-- 各DBの「あと何トランザクションで危険か」を監視(age が大きいほど危険)
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;
-- age が autovacuum_freeze_max_age(既定2億)に近づいたら要対応
The operational implication: "autovacuum is heavy so turn it off" is strictly forbidden. Even turned off, the wraparound-prevention VACUUM runs, and what's more, with the accumulated amount, a huge anti-wraparound VACUUM runs all at once someday and saturates I/O. The correct answer is to "tune autovacuum to fit the workload," not "turn it off."
8. Monitoring checklist
-- ① 肥大化と autovacuum の効き(§4 のクエリ)
-- ② 長時間 idle in transaction(VACUUM を妨げ、デッドタプルを掃除できなくする元凶)
SELECT pid, state, now() - xact_start AS xact_age, substring(query,1,60) AS query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;
-- ③ 周回リスク(§7 のクエリ)
-- ④ ブロックしているロック
SELECT blocked.pid AS blocked_pid, blocking.pid AS blocking_pid,
substring(blocked.query,1,40) AS blocked_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
In particular, leaving idle in transaction unattended is MVCC's biggest operational accident. Because it keeps holding an old snapshot, VACUUM can't clean up "dead tuples newer than the oldest running transaction," and bloat doesn't stop. On the app side, keeping transactions short and always COMMIT/ROLLBACK directly ties to performance too.
9. Conclusion
- MVCC: reads and writes don't block each other. The price is that
UPDATE/DELETEproduce dead tuples (bloat). - Isolation levels: default READ COMMITTED. PostgreSQL's REPEATABLE READ also prevents phantoms. Use SERIALIZABLE by always retrying 40001.
- Row locks: for a job queue,
FOR UPDATE SKIP LOCKEDis the standard. - VACUUM ≠ VACUUM FULL: normal VACUUM only makes space reusable (doesn't return to OS, can run concurrently). FULL shrinks but takes an exclusive lock — basically not used in production.
- autovacuum: tune it, don't turn it off. PG18's
autovacuum_vacuum_max_threshold(100 million) mitigates huge-table bloat by default. - The wraparound-prevention freeze runs even with autovacuum off. Leaving
idle in transactionunattended is the biggest operational accident.
Once MVCC's behavior sinks in, "bloat," "serialization errors," and "VACUUM clogging" become visible connected by causality. Next, on to JSONB in practice, which puts semi-structured data into an RDB.