Skip to main content
友田 陽大
PostgreSQL operations & reliability
PostgreSQL
アーキテクチャ設計

PostgreSQL zero-downtime schema change (lock-safe DDL, lock_timeout, NOT VALID→VALIDATE, CONCURRENTLY, v18-ready)

A practical guide to changing the schema of production PostgreSQL with no downtime. It explains, in real code faithful to the official documentation: the mechanism by which an ACCESS EXCLUSIVE lock stops every SELECT, how not to stop the world with lock_timeout, adding columns (PG11+'s fast path), the two-stage NOT VALID→VALIDATE for NOT NULL/constraints, CONCURRENTLY indexes, the shadow-column strategy for type changes, a lock-level cheat sheet, and PostgreSQL 18's NOT NULL NOT VALID.

Published
Reading time
10 min read
Author
友田 陽大
Share

A migration of "just adding one column" took production fully down — if you operate PostgreSQL, it's a story you'll surely hear someday. The cause is locks. Many ALTER TABLEs take a powerful lock, and moreover, when made to wait behind a long-running query, they drag all queries behind them to a halt.

This article is a practical playbook for changing the schema with no downtime in production. It explains, faithful to the official documentation, from the lock mechanism through safe procedures per operation to the lock_timeout safety device. If a client is judging "is it safe to entrust migrations to this person," this is the watershed. It's a deep dive on §4 of the production-operations guide.

Rules for this article: lock levels, the behavior of each DDL, and PostgreSQL 18's new features are based on the PostgreSQL 18 official documentation (as of June 2026). The failure mechanism of "a lock wait stops all subsequent queries" is a derivation from the lock-conflict table and the 'wait indefinitely' spec (I note that there isn't a single sentence about it in the official docs).


1. The true identity of the danger: ACCESS EXCLUSIVE stops even SELECT

PostgreSQL's table locks have 8 levels, and the strongest is ACCESS EXCLUSIVE. Official:

ACCESS EXCLUSIVE conflicts with locks of all modes (ACCESS SHARE … ACCESS EXCLUSIVE all). This mode guarantees that the holder is the only transaction accessing the table in any way.

And the decisive sentence:

The only lock that blocks a SELECT (without FOR UPDATE/SHARE) is the ACCESS EXCLUSIVE lock.

ALTER TABLE officially "acquires ACCESS EXCLUSIVE unless stated otherwise." In other words, many schema changes can stop even reads while running.

Why it develops into a "total outage"

This is the essence. When ALTER TABLE requests a lock, if there's a transaction SELECTing the table for a long time, the ALTER can't take the lock and waits (official: "waits indefinitely until the conflicting lock is released").

The problem is behind that waiting ALTER. A newly arriving SELECT conflicts with the ACCESS EXCLUSIVE the ALTER is requesting, so it's made to wait behind the ALTER. As a result, just one long-running query + one ALTER dams up all new queries — this is the true identity of "total outage from adding a column."

  • The behavior of "subsequent ones pile up in a wait queue" is a derivation from the lock-conflict table and the 'wait indefinitely' spec (there's no direct sentence in the official docs). But it's a reality repeatedly observed in real operations and serves as the basis for the countermeasure.

2. The starting safety device: lock_timeout

The first step of the countermeasure is to make the migration "give up before stopping the world." Use lock_timeout. Official:

Aborts a statement that exceeds the specified time while waiting to acquire a lock. … Zero (default) disables the timeout. … Setting it in postgresql.conf is not recommended (because it affects all sessions).

→ Set a short lock_timeout only when running the migration, fail immediately if it can't take the lock, and retry. This prevents "waiting behind a long-running query and going to total outage."

-- マイグレーションの定石:短いロックタイムアウトで即失敗 → 後でリトライ
SET lock_timeout = '3s';
SET statement_timeout = '0';      -- 文自体の実行は別管理(lock_timeout は「待ち」だけ)
ALTER TABLE orders ADD COLUMN memo text;
// マイグレーションランナー側:lock_timeout で失敗したら指数バックオフでリトライ
// (混雑の谷を狙って、短いロック窓を捕まえる)
const LOCK_NOT_AVAILABLE = "55P03"; // lock_not_available

export async function runWithLockRetry(sql: string, maxAttempts = 10) {
  for (let attempt = 1; ; attempt++) {
    const tx = await pool.connect();
    try {
      await tx.query("BEGIN");
      await tx.query("SET LOCAL lock_timeout = '3s'");
      await tx.query(sql);                 // 強いロックを要するDDL
      await tx.query("COMMIT");
      return;
    } catch (e: unknown) {
      await tx.query("ROLLBACK").catch(() => {});
      if ((e as { code?: string }).code === LOCK_NOT_AVAILABLE && attempt < maxAttempts) {
        await new Promise((r) => setTimeout(r, Math.min(2 ** attempt * 100, 30_000)));
        continue;                          // ロックが空くタイミングを待って再挑戦
      }
      throw e;
    } finally {
      tx.release();
    }
  }
}

Another preventive measure: don't leave long-running transactions hanging. If an idle in transaction session keeps holding a lock, the DDL never succeeds even with lock_timeout (MVCC/VACUUM article §8).


3. Safe recipes per operation

3.1 Adding a column: a constant default is fast, volatile is dangerous

Official: "With a non-volatile DEFAULT, ADD COLUMN evaluates the default value at statement-execution time and stores it in the table's metadata. … It's returned when an existing row is accessed. No table rewrite is needed, and the ALTER is very fast even on a huge table."

Conversely, cases where a rewrite occurs (official): "Adding a volatile DEFAULT (e.g., clock_timestamp()), a stored generated column, an identity column, or a constrained domain type fully rewrites the table and indexes. Adding a virtual generated column needs no rewrite."

-- ✅ 高速:定数デフォルト(書き換えなし・PG11以降)
ALTER TABLE users ADD COLUMN plan text NOT NULL DEFAULT 'free';

-- ❌ 危険:volatile デフォルトは全行書き換え(巨大表で長時間ロック)
-- ALTER TABLE users ADD COLUMN created_at timestamptz DEFAULT clock_timestamp();
-- → 代わりに「null許容で追加 → バッチで埋める → 後でNOT NULL」(§3.2)

3.2 Adding NOT NULL: avoid the full-table scan

SET NOT NULL scans the whole table to confirm there are no NULLs (locking during that time). The official docs state the workaround clearly: "If a valid CHECK constraint exists that proves 'no NULLs exist,' the table scan is skipped."

PostgreSQL 18 adds an even more straightforward path — NOT VALID can be attached to a NOT NULL constraint.

-- 旧来の道:CHECK を NOT VALID で足し、VALIDATE してから SET NOT NULL(スキャンを分割)
ALTER TABLE orders ADD CONSTRAINT orders_amount_nn CHECK (amount IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_amount_nn;   -- 弱いロックで検証
ALTER TABLE orders ALTER COLUMN amount SET NOT NULL;       -- 検証済みCHECKによりスキャン省略

-- PG18の道:NOT NULL を NOT VALID で追加 → VALIDATE(既存行の検証を切り離せる)
-- ALTER TABLE orders ALTER COLUMN amount SET NOT NULL NOT VALID;
-- ALTER TABLE orders VALIDATE CONSTRAINT ...;

3.3 Constraints and foreign keys safely: NOT VALID → VALIDATE

When adding a constraint to a large table, locking throughout the verification scan is dangerous. The two stages the official docs provide:

Attach NOT VALID and ADD CONSTRAINT can commit immediately without scanning the table. Then verify existing rows with VALIDATE CONSTRAINT. The verification doesn't lock out concurrent updatesVALIDATE takes only a SHARE UPDATE EXCLUSIVE lock.

-- 1) 即時:新規行にだけ制約を効かせる(既存行は未検証=スキャンなし)
ALTER TABLE orders
  ADD CONSTRAINT orders_amount_positive CHECK (amount >= 0) NOT VALID;

-- 2) 後で:既存行を検証(書き込みを止めずに)
ALTER TABLE orders VALIDATE CONSTRAINT orders_amount_positive;

Foreign keys too. And, official: "Most ADD table_constraints require ACCESS EXCLUSIVE, but ADD FOREIGN KEY gets by with SHARE ROW EXCLUSIVE." Even so, the two stages NOT VALIDVALIDATE are safe.

ALTER TABLE orders
  ADD CONSTRAINT orders_user_fk FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;

3.4 Indexes: don't stop writes with CONCURRENTLY

-- 書き込みを止めずに索引を作る([索引設計の記事]も参照)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

Cautions (official, see also index-design article §6): slow due to two scans, cannot run inside a transaction block, and on failure an INVALID index remains (DROP INDEX CONCURRENTLY and re-run).

3.5 Type change: the shadow-column strategy

ALTER COLUMN ... TYPE officially "usually fully rewrites the table and indexes" (the exception is text↔varchar). It's dangerous on a huge table, so add a new column and migrate in stages.

-- 1) 新型の列を追加(高速・nullで)
ALTER TABLE events ADD COLUMN payload_jsonb jsonb;
-- 2) アプリを「両方書く(dual-write)」に。既存はバッチで少しずつ変換(lock_timeout+小バッチ)
--    UPDATE events SET payload_jsonb = payload_text::jsonb WHERE id BETWEEN $1 AND $2;
-- 3) 切替:読みを新列へ。検証後、旧列を DROP
ALTER TABLE events DROP COLUMN payload_text;

This "add a new column → dual-write → batch conversion → switch → drop the old column" is a general-purpose pattern effective for all heavy changes like type changes, column renames, and normalization. From the MVCC standpoint too, rewriting DDL (TRUNCATE or ALTER that involves a rewrite) is not MVCC-safe (official), so it's worth avoiding it and migrating in stages.


4. Lock-level cheat sheet

Knowing "which lock this operation takes" lets you avoid danger (official, excerpt).

LockMain operationsConcurrent readConcurrent write
ACCESS EXCLUSIVEDROP TABLE, TRUNCATE, ALTER COLUMN TYPE, many ALTER TABLEs
SHARE (standard index creation)CREATE INDEX (without CONCURRENTLY)
SHARE ROW EXCLUSIVEADD FOREIGN KEY, CREATE TRIGGER
SHARE UPDATE EXCLUSIVEVACUUM, ANALYZE, CREATE INDEX CONCURRENTLY, VALIDATE CONSTRAINT, SET STATISTICS, DETACH PARTITION CONCURRENTLY

The aim is the bottom-right — leaning toward SHARE UPDATE EXCLUSIVE (stops neither reads nor writes). CONCURRENTLY and NOT VALIDVALIDATE are the tools to replace a strong lock with a weak lock.


5. PostgreSQL 18's no-downtime DDL enhancements

PG18 practically improved online DDL (official release notes).

  • NOT VALID can be attached to a NOT NULL constraint (§3.2) — a no-downtime NOT NULL addition becomes straightforward.
  • NOT NULL is stored in pg_constraint — NOT NULL constraints can be named, and can also be attached to foreign tables.
  • NOT VALID foreign keys are allowed on partitioned tables — staged constraint addition becomes possible.
  • NOT ENFORCED constraints (a CHECK/foreign key that's "defined but not enforced") — flexibility during a transition period.
  • Virtual generated columns by default — computed at read time, so adding one needs no rewrite (stored needs a rewrite).

6. Migration checklist

When reviewing a production migration, answer these questions.

  1. Did you set lock_timeout? (fail immediately + retry if it can't take the lock)
  2. Does a table rewrite occur? (avoid volatile defaults, type changes, stored generated columns)
  3. Did you split the full-table scan? (NOT VALIDVALIDATE, NOT NULL via CHECK)
  4. Is the index CONCURRENTLY? (outside a transaction, and cleaning up the INVALID on failure)
  5. Did you stage heavy changes with shadow columns?
  6. Is a long-running transaction (idle in transaction) running?
  7. Is it one logical change per migration? (a huge bundle of DDLs has high risk on failure)

7. Conclusion

  • Many ALTER TABLEs are ACCESS EXCLUSIVE = stop even SELECT. Waiting behind a long-running query develops into a total outage.
  • lock_timeout + retry to "make it give up before stopping the world."
  • Add columns with a constant default (fast), and avoid volatile/type changes (rewrites).
  • Constraints are NOT VALIDVALIDATE, indexes are CONCURRENTLY — replace a strong lock with a weak lock.
  • Stage heavy changes with shadow columns (add → dual-write → batch conversion → switch → drop).
  • PG18 makes no-downtime DDL even more straightforward with NOT NULL ... NOT VALID, etc.

Once "don't stop on changes" is solid, the remaining corner of operations is "protect" — go to security hardening (roles, privileges, TLS, SCRAM).


References (PostgreSQL 18 official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading