# Practical PostgreSQL MVCC, transaction isolation, and VACUUM/autovacuum guide (bloat, row locks, wraparound prevention, v18 support)

> A practical explanation of MVCC, the foundation of PostgreSQL's correctness and performance. Faithful to the official docs, with real code it explains transaction isolation levels (Read Committed / Repeatable Read / Serializable) and retrying serialization failures, a job queue with FOR UPDATE SKIP LOCKED, dead tuples and bloat, the difference between VACUUM and VACUUM FULL, tuning autovacuum and PostgreSQL 18's autovacuum_vacuum_max_threshold, and preventing transaction-ID wraparound.

- Published: 2026-06-23
- Author: 友田 陽大
- Tags: PostgreSQL, アーキテクチャ設計, パフォーマンス
- URL: https://tomodahinata.com/en/blog/postgresql-mvcc-transaction-isolation-vacuum-autovacuum-guide
- Category: PostgreSQL internals & performance
- Pillar guide: https://tomodahinata.com/en/blog/postgresql-performance-tuning-production-guide

## Key points

- MVCC = 'reads don't block writes, writes don't block reads.' UPDATE doesn't delete the old row but appends a new version, so dead tuples (bloat) are born.
- The default is READ COMMITTED. PostgreSQL's REPEATABLE READ also prevents phantoms (snapshot isolation). SERIALIZABLE throws a serialization failure (SQLSTATE 40001), so app-side retry is mandatory.
- For a job queue, SELECT ... FOR UPDATE SKIP LOCKED is the standard. Multiple workers don't fight over the same row and avoid lock contention.
- VACUUM only makes dead tuples 'reusable' and doesn't return them to the OS. VACUUM FULL rewrites and shrinks but takes an ACCESS EXCLUSIVE lock (basically not used in production).
- autovacuum's trigger is '50 + 0.2 × row count.' PostgreSQL 18's autovacuum_vacuum_max_threshold (default 100M) caps the problem of too-late triggering on huge tables. The wraparound-prevention freeze runs even with autovacuum off.

---

PostgreSQL's "speed" and "correctness" both ride on **MVCC (Multi-Version Concurrency Control).** Tune without understanding MVCC and you **can't see the cause** of problems like "it bloats somehow," "serialization errors come out," "VACUUM can't keep up."

This article explains, from the mechanism of MVCC, through the correct way to choose transaction isolation levels, a job queue with row locks, and the heart of bloat measures VACUUM/autovacuum, as production-operation practice faithful to the PostgreSQL official documentation. It's the deep dive into §4 (VACUUM) and §7 (HOT updates) of the [performance overview](/blog/postgresql-performance-tuning-production-guide).

> **Rules for this article**: MVCC's behavior, the guarantees of isolation levels, VACUUM's spec, autovacuum's defaults, and the PostgreSQL 18 changes are all based on the **PostgreSQL 18 official documentation (as of June 2026).** A managed DB may override the autovacuum defaults, so confirm the actual state with `SHOW` and `pg_stat_user_tables`.

---

## 1. MVCC: reads and writes don't block each other

The official states the core of PostgreSQL's concurrency control in one sentence.

> In MVCC, **the locks acquired for a query (read) and the locks acquired for writing don't conflict. Therefore reading never blocks writing, and writing never blocks reading.**

What realizes this is the **snapshot.** Each statement (or each transaction) sees "**a version of the database at a point in time.**" Even if someone else updates while you're aggregating a long report, your query keeps seeing a consistent past cross-section.

### The mechanism: rows are managed by "versions"

Each row (tuple) has hidden system columns.

- **`xmin`**: the transaction ID (XID) that **inserted** that row version.
- **`xmax`**: the XID that **invalidated it by delete/update** (0 if not deleted).
- **`ctid`**: a physical pointer to the **new version** of the same row (versions chain).

This is the starting point of bloat. The official says —

> `UPDATE` and `DELETE` **don't immediately delete the old version of the row.** Because it can't be deleted while it might still be visible to other transactions.

That is, **`UPDATE` is not "overwrite" but "append a new row version + stamp xmax on the old version."** The old version becomes a **dead tuple** at the point it's no longer visible to any transaction. What cleans up this dead tuple is VACUUM (§5).

```sql
-- 隠し列を見るとバージョン管理が分かる
SELECT ctid, xmin, xmax, * FROM accounts WHERE id = 1;
-- UPDATE すると同じ id でも ctid（物理位置）が変わり、古い版がデッドタプルになる
```

---

## 2. Transaction isolation levels: the default and the "true guarantee"

The SQL standard defines 4 levels, but PostgreSQL internally has **3** (Read Uncommitted behaves as Read Committed). **The default is READ COMMITTED.**

The anomalies each level **prevents** (PostgreSQL's actual behavior):

| Level | Dirty read | Non-repeatable read | Phantom read | Serialization anomaly |
| --- | --- | --- | --- | --- |
| Read Committed (default) | doesn't happen | can happen | can happen | can happen |
| Repeatable Read | doesn't happen | doesn't happen | **doesn't happen** ※ | can happen |
| Serializable | doesn't happen | doesn't happen | doesn't happen | doesn't happen |

※ Here's PostgreSQL's characteristic. The official says "PostgreSQL's Repeatable Read implementation **doesn't allow phantom reads**" — stronger than the standard's minimum requirement, implemented as **snapshot isolation.**

### When to use what

- **READ COMMITTED (default)**: each statement sees a snapshot "at the statement's start." Most web apps are enough with this.
- **REPEATABLE READ**: the whole transaction sees a consistent snapshot "at the first statement's point." For **reports/aggregations and balance calculations needing a consistent cross-section across multiple queries.**
- **SERIALIZABLE**: **guarantees the same result as serial execution** (SSI = Serializable Snapshot Isolation). The strongest, but **throws a serialization failure** on conflict.

```sql
BEGIN ISOLATION LEVEL REPEATABLE READ;
-- このトランザクション内のすべての SELECT は同じスナップショットを見る
SELECT sum(amount) FROM ledger WHERE account_id = 1;
-- ... 他のクエリ ...
COMMIT;
```

### SERIALIZABLE assumes "retry"

The most important point the official emphasizes: in SERIALIZABLE (and REPEATABLE READ), a **serialization failure** happens on conflict and always returns **SQLSTATE `40001`.** This is **normal behavior, not a bug**, and the app must **retry that transaction wholesale.**

```ts
// SQLSTATE 40001（直列化失敗）と 40P01（デッドロック）は「リトライすれば成功する」エラー。
// トランザクション全体を冪等に再実行する汎用ラッパー。
import type { PoolClient } from "pg";
import { pool } from "./db";

const RETRYABLE = new Set(["40001", "40P01"]); // serialization_failure / deadlock_detected

export async function withSerializableRetry<T>(
  work: (tx: PoolClient) => Promise<T>,
  maxAttempts = 5,
): Promise<T> {
  for (let attempt = 1; ; attempt++) {
    const tx = await pool.connect();
    try {
      await tx.query("BEGIN ISOLATION LEVEL SERIALIZABLE");
      const result = await work(tx);
      await tx.query("COMMIT");
      return result;
    } catch (err: unknown) {
      await tx.query("ROLLBACK").catch(() => {});
      const code = (err as { code?: string }).code;
      if (code && RETRYABLE.has(code) && attempt < maxAttempts) {
        // 指数バックオフ＋ジッタで競合を散らす
        await new Promise((r) => setTimeout(r, 2 ** attempt * 5 + Math.random() * 10));
        continue;
      }
      throw err;
    } finally {
      tx.release();
    }
  }
}
```

> **The design implication**: SERIALIZABLE is a powerful tool that can guarantee correctness without increasing locks, but the handling "**always retry 40001**" is **a precondition.** Use it without preparing this and you get an unstable system that occasionally fails.

---

## 3. Row locks: a job queue with FOR UPDATE and SKIP LOCKED

Even with MVCC, you need an **explicit row lock** to "definitely update this row myself." In order of strength (strong→weak):

| Lock | Use |
| --- | --- |
| `FOR UPDATE` | exclusive lock on the premise of updating/deleting the row. Strongest |
| `FOR NO KEY UPDATE` | for an update that doesn't change key columns. Doesn't block `FOR KEY SHARE` |
| `FOR SHARE` | shared lock. Prevents other updates/deletes but allows shared reads |
| `FOR KEY SHARE` | weakest. Blocks only updates that change the key value (used in foreign-key verification) |

### The job-queue standard: FOR UPDATE SKIP LOCKED

The correct answer for "multiple workers fighting over work from a queue table" is `SKIP LOCKED`. The official definition:

> With `SKIP LOCKED`, rows that can't be locked immediately are **skipped.** … Because skipping locked rows doesn't give a consistent view of the data, it's not suited for general use, but it can be used **to avoid lock contention when multiple consumers access a queue-like table.**

By this, each worker can obtain "the next work not grabbed by another worker" **without waiting.**

```sql
-- 複数ワーカーが安全に仕事を取り合う（互いに待たず、同じ行を二重処理しない）
WITH next_job AS (
  SELECT id
  FROM jobs
  WHERE status = 'queued'
  ORDER BY priority DESC, created_at
  FOR UPDATE SKIP LOCKED          -- 他ワーカーがロック中の行は飛ばす
  LIMIT 1
)
UPDATE jobs j
SET status = 'running', started_at = now()
FROM next_job
WHERE j.id = next_job.id
RETURNING j.*;
```

```ts
// ワーカー側：1行を掴んで処理し、結果に応じて done/failed に遷移（トランザクションで囲む）
export async function pullAndRunOneJob(run: (job: Job) => Promise<void>) {
  const tx = await pool.connect();
  try {
    await tx.query("BEGIN");
    const { rows } = await tx.query<Job>(`
      WITH next_job AS (
        SELECT id FROM jobs WHERE status = 'queued'
        ORDER BY priority DESC, created_at
        FOR UPDATE SKIP LOCKED LIMIT 1
      )
      UPDATE jobs j SET status='running', started_at=now()
      FROM next_job WHERE j.id = next_job.id
      RETURNING j.*`);
    if (rows.length === 0) { await tx.query("COMMIT"); return null; }
    const job = rows[0];
    await run(job);                                  // 実処理（冪等に設計する）
    await tx.query("UPDATE jobs SET status='done', finished_at=now() WHERE id=$1", [job.id]);
    await tx.query("COMMIT");
    return job;
  } catch (e) {
    await tx.query("ROLLBACK").catch(() => {});
    throw e;
  } finally {
    tx.release();
  }
}
```

`NOWAIT` (error immediately without waiting) and **advisory locks** (app-defined lightweight locks, `pg_advisory_xact_lock`, etc.; they don't produce table bloat and are fast) are also effective depending on the situation.

---

## 4. Why it bloats

As in §1, `UPDATE`/`DELETE` leave dead tuples. When dead tuples accumulate:

- **The table/index physically swells** (wastes disk and cache).
- Scans read "live rows + dead rows," so they **slow down.**
- The index-only scan's visibility map gets dirty and `Heap Fetches` increases ([EXPLAIN article](/blog/postgresql-explain-analyze-slow-query-optimization-guide)).

The first step of mitigation is the **HOT update** ([overview §7.2](/blog/postgresql-performance-tuning-production-guide)). If "an update that doesn't change indexed columns" and "there's free space on the same page," it can reclaim the dead version during normal operation without creating a new index entry. For a heavily-updated table, lower `fillfactor` to ensure free space.

```sql
-- デッドタプルの溜まり具合と autovacuum の最終実行を監視
SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup::numeric / nullif(n_live_tup,0), 3) AS dead_ratio,
       last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
```

---

## 5. VACUUM and VACUUM FULL: similar but different

VACUUM does 4 jobs (official): ① **make** the dead-tuple space **reusable**, ② update planner statistics, ③ update the visibility map (speed up index-only scans), ④ **protect from transaction-ID wraparound** (§7).

Here I correct a **fatally common misconception.**

- **Normal `VACUUM`**: it **only marks** the dead-tuple space as **"reusable"** and **doesn't return the disk to the OS** (except the special case of a completely-empty trailing page). It can run concurrently during production operation and doesn't stop `SELECT`/`INSERT`/`UPDATE`/`DELETE`.
- **`VACUUM FULL`**: it **rewrites the whole table** to eliminate dead space and **also returns space to the OS.** But it takes an **`ACCESS EXCLUSIVE` lock**, during which that table is **completely unusable.**

```sql
VACUUM (VERBOSE, ANALYZE) orders;   -- 通常運用。並行可能。領域は再利用可能になる
-- VACUUM FULL orders;              -- ⚠ テーブルを排他ロック。本番のオンライン中は基本使わない
```

> **The production iron rule**: even if bloat is serious, don't lightly fire `VACUUM FULL` (the table stops). To physically shrink online, consider `pg_repack` (an extension). **Not accumulating bloat in the first place = making autovacuum work** is the right path.

---

## 6. Make autovacuum work (with PG18's new feature)

autovacuum is a background mechanism that auto-VACUUMs/ANALYZEs tables where dead tuples have accumulated to a certain amount (default `on`). The **trigger threshold** is the official formula:

```text
vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × row count
        default =            50               +          0.2 (= 20%)            × reltuples
```

The main defaults:

| Parameter | Default |
| --- | --- |
| `autovacuum_vacuum_threshold` | 50 |
| `autovacuum_vacuum_scale_factor` | 0.2 (20%) |
| `autovacuum_analyze_scale_factor` | 0.1 (10%) |
| `autovacuum_max_workers` | 3 |
| `autovacuum_naptime` | 1min |
| `autovacuum_vacuum_cost_delay` | 2ms |

### The huge-table pitfall and PostgreSQL 18's solution

"20%" is reasonable for small tables but **fatal for huge tables.** For a 100-million-row table, VACUUM doesn't run "until 20 million rows are dead," and during that time bloat keeps progressing.

Conventionally, you addressed this by lowering `scale_factor` per table.

```sql
-- 巨大で更新の多い表は、行数比ではなく小さめのしきい値で頻繁に VACUUM
ALTER TABLE big_events SET (
  autovacuum_vacuum_scale_factor = 0.02,    -- 2% で発火
  autovacuum_vacuum_cost_delay   = 0        -- VACUUM を速く回す（I/O に余裕がある場合）
);
```

**PostgreSQL 18 solved this head-on.** The new parameter **`autovacuum_vacuum_max_threshold` (default 100 million)** sets a **cap** on the formula's computed value above. The official explanation:

> Specifies the **maximum** number of updated/deleted tuples needed to trigger a VACUUM on a single table (a cap on the value computed by `autovacuum_vacuum_threshold` and `autovacuum_vacuum_scale_factor`).

The effect: even on a huge table, VACUUM runs "once dead tuples reach 100 million," and the "wait until 20%" problem is mitigated by default. **Just upgrading to PG18 makes huge-table bloat less likely.**

---

## 7. Transaction-ID wraparound: the scariest failure

This is not a performance story but a **data-loss story.** XID is 32-bit. The official warning:

> Because transaction IDs are finite in size (32-bit), a long-running cluster (over 4 billion transactions) falls into **transaction-ID wraparound.** The XID counter wraps back to zero, past transactions suddenly look "in the future," and their output becomes **invisible.** In short, **catastrophic data loss.**

What prevents this is the **freeze** — freezing sufficiently-old row versions with a special XID regarded as "always in the past." What drives the freeze is autovacuum, and **here's what's important**: the official says the wraparound-prevention autovacuum **runs even with `autovacuum` disabled** (default `autovacuum_freeze_max_age` = **200 million** transactions).

```sql
-- 各DBの「あと何トランザクションで危険か」を監視（age が大きいほど危険）
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY xid_age DESC;
-- age が autovacuum_freeze_max_age（既定2億）に近づいたら要対応
```

> **The operational implication**: "autovacuum is heavy so turn it off" is **strictly forbidden.** Even turned off, the wraparound-prevention VACUUM runs, and what's more, with the accumulated amount, **a huge anti-wraparound VACUUM runs all at once someday and saturates I/O.** The correct answer is to "**tune autovacuum to fit the workload**," not "turn it off."

---

## 8. Monitoring checklist

```sql
-- ① 肥大化と autovacuum の効き（§4 のクエリ）
-- ② 長時間 idle in transaction（VACUUM を妨げ、デッドタプルを掃除できなくする元凶）
SELECT pid, state, now() - xact_start AS xact_age, substring(query,1,60) AS query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;

-- ③ 周回リスク（§7 のクエリ）
-- ④ ブロックしているロック
SELECT blocked.pid AS blocked_pid, blocking.pid AS blocking_pid,
       substring(blocked.query,1,40) AS blocked_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
```

In particular, **leaving `idle in transaction` unattended** is MVCC's biggest operational accident. Because it keeps holding an old snapshot, VACUUM can't clean up "dead tuples newer than the oldest running transaction," and bloat doesn't stop. On the app side, **keeping transactions short and always COMMIT/ROLLBACK** directly ties to performance too.

---

## 9. Conclusion

- **MVCC**: reads and writes don't block each other. The price is that `UPDATE`/`DELETE` produce **dead tuples (bloat).**
- **Isolation levels**: default READ COMMITTED. PostgreSQL's REPEATABLE READ also prevents phantoms. Use SERIALIZABLE by **always retrying 40001.**
- **Row locks**: for a job queue, `FOR UPDATE SKIP LOCKED` is the standard.
- **VACUUM ≠ VACUUM FULL**: normal VACUUM only makes space reusable (doesn't return to OS, can run concurrently). FULL shrinks but takes an exclusive lock — basically not used in production.
- **autovacuum**: tune it, don't turn it off. **PG18's `autovacuum_vacuum_max_threshold` (100 million)** mitigates huge-table bloat by default.
- The **wraparound-prevention** freeze runs even with autovacuum off. Leaving `idle in transaction` unattended is the biggest operational accident.

Once MVCC's behavior sinks in, "bloat," "serialization errors," and "VACUUM clogging" become visible **connected by causality.** Next, on to [JSONB in practice](/blog/postgresql-jsonb-operators-gin-index-design-guide), which puts semi-structured data into an RDB.

---

### References (PostgreSQL 18 official documentation)

- [Chapter 13. Concurrency Control (MVCC)](https://www.postgresql.org/docs/18/mvcc-intro.html)
- [13.2. Transaction Isolation](https://www.postgresql.org/docs/18/transaction-iso.html)
- [13.3. Explicit Locking (FOR UPDATE / SKIP LOCKED)](https://www.postgresql.org/docs/18/explicit-locking.html)
- [25.1. Routine Vacuuming (VACUUM, wraparound prevention)](https://www.postgresql.org/docs/18/routine-vacuuming.html)
- [19.10. Automatic Vacuuming (autovacuum defaults)](https://www.postgresql.org/docs/18/runtime-config-autovacuum.html)
