A migration of "just adding one column" took production fully down — if you operate PostgreSQL, it's a story you'll surely hear someday. The cause is locks. Many ALTER TABLEs take a powerful lock, and moreover, when made to wait behind a long-running query, they drag all queries behind them to a halt.
This article is a practical playbook for changing the schema with no downtime in production. It explains, faithful to the official documentation, from the lock mechanism through safe procedures per operation to the lock_timeout safety device. If a client is judging "is it safe to entrust migrations to this person," this is the watershed. It's a deep dive on §4 of the production-operations guide.
Rules for this article: lock levels, the behavior of each DDL, and PostgreSQL 18's new features are based on the PostgreSQL 18 official documentation (as of June 2026). The failure mechanism of "a lock wait stops all subsequent queries" is a derivation from the lock-conflict table and the 'wait indefinitely' spec (I note that there isn't a single sentence about it in the official docs).
1. The true identity of the danger: ACCESS EXCLUSIVE stops even SELECT
PostgreSQL's table locks have 8 levels, and the strongest is ACCESS EXCLUSIVE. Official:
ACCESS EXCLUSIVE conflicts with locks of all modes (ACCESS SHARE … ACCESS EXCLUSIVE all). This mode guarantees that the holder is the only transaction accessing the table in any way.
And the decisive sentence:
The only lock that blocks a SELECT (without FOR UPDATE/SHARE) is the ACCESS EXCLUSIVE lock.
ALTER TABLE officially "acquires ACCESS EXCLUSIVE unless stated otherwise." In other words, many schema changes can stop even reads while running.
Why it develops into a "total outage"
This is the essence. When ALTER TABLE requests a lock, if there's a transaction SELECTing the table for a long time, the ALTER can't take the lock and waits (official: "waits indefinitely until the conflicting lock is released").
The problem is behind that waiting ALTER. A newly arriving SELECT conflicts with the ACCESS EXCLUSIVE the ALTER is requesting, so it's made to wait behind the ALTER. As a result, just one long-running query + one ALTER dams up all new queries — this is the true identity of "total outage from adding a column."
- The behavior of "subsequent ones pile up in a wait queue" is a derivation from the lock-conflict table and the 'wait indefinitely' spec (there's no direct sentence in the official docs). But it's a reality repeatedly observed in real operations and serves as the basis for the countermeasure.
2. The starting safety device: lock_timeout
The first step of the countermeasure is to make the migration "give up before stopping the world." Use lock_timeout. Official:
Aborts a statement that exceeds the specified time while waiting to acquire a lock. … Zero (default) disables the timeout. … Setting it in
postgresql.confis not recommended (because it affects all sessions).
→ Set a short lock_timeout only when running the migration, fail immediately if it can't take the lock, and retry. This prevents "waiting behind a long-running query and going to total outage."
-- マイグレーションの定石:短いロックタイムアウトで即失敗 → 後でリトライ
SET lock_timeout = '3s';
SET statement_timeout = '0'; -- 文自体の実行は別管理(lock_timeout は「待ち」だけ)
ALTER TABLE orders ADD COLUMN memo text;
// マイグレーションランナー側:lock_timeout で失敗したら指数バックオフでリトライ
// (混雑の谷を狙って、短いロック窓を捕まえる)
const LOCK_NOT_AVAILABLE = "55P03"; // lock_not_available
export async function runWithLockRetry(sql: string, maxAttempts = 10) {
for (let attempt = 1; ; attempt++) {
const tx = await pool.connect();
try {
await tx.query("BEGIN");
await tx.query("SET LOCAL lock_timeout = '3s'");
await tx.query(sql); // 強いロックを要するDDL
await tx.query("COMMIT");
return;
} catch (e: unknown) {
await tx.query("ROLLBACK").catch(() => {});
if ((e as { code?: string }).code === LOCK_NOT_AVAILABLE && attempt < maxAttempts) {
await new Promise((r) => setTimeout(r, Math.min(2 ** attempt * 100, 30_000)));
continue; // ロックが空くタイミングを待って再挑戦
}
throw e;
} finally {
tx.release();
}
}
}
Another preventive measure: don't leave long-running transactions hanging. If an
idle in transactionsession keeps holding a lock, the DDL never succeeds even withlock_timeout(MVCC/VACUUM article §8).
3. Safe recipes per operation
3.1 Adding a column: a constant default is fast, volatile is dangerous
Official: "With a non-volatile DEFAULT, ADD COLUMN evaluates the default value at statement-execution time and stores it in the table's metadata. … It's returned when an existing row is accessed. No table rewrite is needed, and the ALTER is very fast even on a huge table."
Conversely, cases where a rewrite occurs (official): "Adding a volatile DEFAULT (e.g., clock_timestamp()), a stored generated column, an identity column, or a constrained domain type fully rewrites the table and indexes. Adding a virtual generated column needs no rewrite."
-- ✅ 高速:定数デフォルト(書き換えなし・PG11以降)
ALTER TABLE users ADD COLUMN plan text NOT NULL DEFAULT 'free';
-- ❌ 危険:volatile デフォルトは全行書き換え(巨大表で長時間ロック)
-- ALTER TABLE users ADD COLUMN created_at timestamptz DEFAULT clock_timestamp();
-- → 代わりに「null許容で追加 → バッチで埋める → 後でNOT NULL」(§3.2)
3.2 Adding NOT NULL: avoid the full-table scan
SET NOT NULL scans the whole table to confirm there are no NULLs (locking during that time). The official docs state the workaround clearly: "If a valid CHECK constraint exists that proves 'no NULLs exist,' the table scan is skipped."
PostgreSQL 18 adds an even more straightforward path — NOT VALID can be attached to a NOT NULL constraint.
-- 旧来の道:CHECK を NOT VALID で足し、VALIDATE してから SET NOT NULL(スキャンを分割)
ALTER TABLE orders ADD CONSTRAINT orders_amount_nn CHECK (amount IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_amount_nn; -- 弱いロックで検証
ALTER TABLE orders ALTER COLUMN amount SET NOT NULL; -- 検証済みCHECKによりスキャン省略
-- PG18の道:NOT NULL を NOT VALID で追加 → VALIDATE(既存行の検証を切り離せる)
-- ALTER TABLE orders ALTER COLUMN amount SET NOT NULL NOT VALID;
-- ALTER TABLE orders VALIDATE CONSTRAINT ...;
3.3 Constraints and foreign keys safely: NOT VALID → VALIDATE
When adding a constraint to a large table, locking throughout the verification scan is dangerous. The two stages the official docs provide:
Attach
NOT VALIDandADD CONSTRAINTcan commit immediately without scanning the table. Then verify existing rows withVALIDATE CONSTRAINT. The verification doesn't lock out concurrent updates —VALIDATEtakes only a SHARE UPDATE EXCLUSIVE lock.
-- 1) 即時:新規行にだけ制約を効かせる(既存行は未検証=スキャンなし)
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive CHECK (amount >= 0) NOT VALID;
-- 2) 後で:既存行を検証(書き込みを止めずに)
ALTER TABLE orders VALIDATE CONSTRAINT orders_amount_positive;
Foreign keys too. And, official: "Most ADD table_constraints require ACCESS EXCLUSIVE, but ADD FOREIGN KEY gets by with SHARE ROW EXCLUSIVE." Even so, the two stages NOT VALID → VALIDATE are safe.
ALTER TABLE orders
ADD CONSTRAINT orders_user_fk FOREIGN KEY (user_id) REFERENCES users(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;
3.4 Indexes: don't stop writes with CONCURRENTLY
-- 書き込みを止めずに索引を作る([索引設計の記事]も参照)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);
Cautions (official, see also index-design article §6): slow due to two scans, cannot run inside a transaction block, and on failure an INVALID index remains (DROP INDEX CONCURRENTLY and re-run).
3.5 Type change: the shadow-column strategy
ALTER COLUMN ... TYPE officially "usually fully rewrites the table and indexes" (the exception is text↔varchar). It's dangerous on a huge table, so add a new column and migrate in stages.
-- 1) 新型の列を追加(高速・nullで)
ALTER TABLE events ADD COLUMN payload_jsonb jsonb;
-- 2) アプリを「両方書く(dual-write)」に。既存はバッチで少しずつ変換(lock_timeout+小バッチ)
-- UPDATE events SET payload_jsonb = payload_text::jsonb WHERE id BETWEEN $1 AND $2;
-- 3) 切替:読みを新列へ。検証後、旧列を DROP
ALTER TABLE events DROP COLUMN payload_text;
This "add a new column → dual-write → batch conversion → switch → drop the old column" is a general-purpose pattern effective for all heavy changes like type changes, column renames, and normalization. From the MVCC standpoint too, rewriting DDL (
TRUNCATEorALTERthat involves a rewrite) is not MVCC-safe (official), so it's worth avoiding it and migrating in stages.
4. Lock-level cheat sheet
Knowing "which lock this operation takes" lets you avoid danger (official, excerpt).
| Lock | Main operations | Concurrent read | Concurrent write |
|---|---|---|---|
| ACCESS EXCLUSIVE | DROP TABLE, TRUNCATE, ALTER COLUMN TYPE, many ALTER TABLEs | ❌ | ❌ |
| SHARE (standard index creation) | CREATE INDEX (without CONCURRENTLY) | ✅ | ❌ |
| SHARE ROW EXCLUSIVE | ADD FOREIGN KEY, CREATE TRIGGER | ✅ | ❌ |
| SHARE UPDATE EXCLUSIVE | VACUUM, ANALYZE, CREATE INDEX CONCURRENTLY, VALIDATE CONSTRAINT, SET STATISTICS, DETACH PARTITION CONCURRENTLY | ✅ | ✅ |
The aim is the bottom-right — leaning toward SHARE UPDATE EXCLUSIVE (stops neither reads nor writes). CONCURRENTLY and NOT VALID→VALIDATE are the tools to replace a strong lock with a weak lock.
5. PostgreSQL 18's no-downtime DDL enhancements
PG18 practically improved online DDL (official release notes).
NOT VALIDcan be attached to aNOT NULLconstraint (§3.2) — a no-downtime NOT NULL addition becomes straightforward.NOT NULLis stored inpg_constraint— NOT NULL constraints can be named, and can also be attached to foreign tables.NOT VALIDforeign keys are allowed on partitioned tables — staged constraint addition becomes possible.NOT ENFORCEDconstraints (a CHECK/foreign key that's "defined but not enforced") — flexibility during a transition period.- Virtual generated columns by default — computed at read time, so adding one needs no rewrite (stored needs a rewrite).
6. Migration checklist
When reviewing a production migration, answer these questions.
- Did you set
lock_timeout? (fail immediately + retry if it can't take the lock) - Does a table rewrite occur? (avoid volatile defaults, type changes, stored generated columns)
- Did you split the full-table scan? (
NOT VALID→VALIDATE, NOT NULL via CHECK) - Is the index
CONCURRENTLY? (outside a transaction, and cleaning up the INVALID on failure) - Did you stage heavy changes with shadow columns?
- Is a long-running transaction (
idle in transaction) running? - Is it one logical change per migration? (a huge bundle of DDLs has high risk on failure)
7. Conclusion
- Many
ALTER TABLEs are ACCESS EXCLUSIVE = stop even SELECT. Waiting behind a long-running query develops into a total outage. lock_timeout+ retry to "make it give up before stopping the world."- Add columns with a constant default (fast), and avoid volatile/type changes (rewrites).
- Constraints are
NOT VALID→VALIDATE, indexes areCONCURRENTLY— replace a strong lock with a weak lock. - Stage heavy changes with shadow columns (add → dual-write → batch conversion → switch → drop).
- PG18 makes no-downtime DDL even more straightforward with
NOT NULL ... NOT VALID, etc.
Once "don't stop on changes" is solid, the remaining corner of operations is "protect" — go to security hardening (roles, privileges, TLS, SCRAM).