The Complete Guide to pgvector Tuning: Optimizing HNSW/IVFFlat Recall × Latency, and Quantization (halfvec, Binary Quantization) for Fast, Cheap, and Accurate

Getting vector search running in pgvector is surprisingly easy. CREATE EXTENSION vector, a vector(1024) column, ORDER BY embedding <=> $1 LIMIT 10 — with this, "somewhat" semantic search works.

The problem is beyond that. When the data grows to millions of rows, users hit it simultaneously, and the voice "why is this answer off the mark?" comes up from operations. What takes effect here is not how you write SQL but the tuning of indexes and quantization.

This article is an implementation guide to lifting pgvector from "working" to "working fast, cheap, and accurately in production." "Where to put the embeddings" and "the overall RAG design, hybrid search, idempotent ingest" are handled in the sister article Production RAG Built with pgvector, so this piece concentrates on what's beyond — HNSW/IVFFlat recall × latency optimization, memory reduction with quantization, and solving over-filtering. As subject matter, I mix in design decisions from a generative-AI voice chatbot I built (a RAG customer-service system consolidating business data and embeddings into PostgreSQL + pgvector).

The rules of this article: the SQL syntax, parameters, defaults, and quantization syntax are based on the pgvector official README / documentation (the v0.8.x series, as of June 2026). Because pgvector is actively updated (iterative scan was added in 0.8.0, hamming_distance / L1 HNSW support in 0.7.0), always double-check the latest values in the official documentation for your version before going to production. The code is shaped to be usable in real operation, but connection strings/API keys are assumed to be in environment variables (no hardcoding).

0. Mental model: tuning is choosing "where to stand in the triangle"

The first thing to hold in approximate-nearest-neighbor (ANN) search tuning is the map that the 3 quantities are in a trade-off relationship.

        Recall (再現率)
         ／      ＼
        ／        ＼
  Latency ──── Memory / Cost

Recall: of the "correct neighbors" that should be returned, what fraction you get. Directly tied to RAG's answer quality.
Latency: the response speed of one query. Decides user experience and throughput.
Memory / Cost: whether the index rides in RAM. Decided by dimension, row count, and quantization.

You can't maximize all 3 at the same time. ANN search is a technique that gains "speed" in exchange for slightly giving up "accuracy (exact kNN)." Tuning is nothing other than the work of deciding where in this triangle to stand your project, by measurement rather than feel.

So this article's order is intentional. (1) Understand the index internals → (2) "measure" recall → (3) speed up the build → (4) cut cost with quantization → (5) remove the filter trap → (6) operations. In particular, fiddling with parameters while skipping (2)'s "measure" is the act of turning the dial blindfolded. Build the verification path first — this is the biggest lever of production quality.

1. The internals of the 2 ANN indexes: HNSW and IVFFlat

pgvector's approximate indexes are 2 kinds, HNSW and IVFFlat. To move the parameters correctly, you need to know one level deeper what each does.

HNSW: traverse a multi-layer graph

HNSW (Hierarchical Navigable Small World) is a multi-layer graph that connects vectors as nodes and close vectors with edges. Search enters from the coarse graph of the upper layer, greedily traverses to close nodes, and descends to the lower layers.

-- コサイン距離なら vector_cosine_ops（演算子クラスは距離と必ず揃える）
CREATE INDEX ON doc_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Parameter	Default	What it decides	Raise it and
`m`	16	The max number of edges each node stretches (the graph's density)	recall↑, memory↑, build↓
`ef_construction`	64	The candidate-list width at build time	graph quality↑, build time↑
`hnsw.ef_search`	40	The candidate-list width at search time (runtime)	recall↑, speed↓

What's decisively important here is that search quality can be adjusted later with the runtime parameter hnsw.ef_search. Without rebuilding the index, you can turn the accuracy↔speed dial per session or per query.

SET hnsw.ef_search = 100;   -- 既定40。大きいほど高再現率・低速
SELECT id, content
FROM doc_chunks
ORDER BY embedding <=> $1
LIMIT 10;

HNSW's biggest operational advantage is, as the official explicitly states, "because there's no training step, you can create the index even with zero rows." That is, the straightforward order of "create the table → put the index on first → pour data in later" holds. This property is why it pairs well with production RAG where ingest continuously occurs. In the voice chatbot too, I chose HNSW from the premise that FAQ and product info are updated from time to time (= ingest doesn't stop).

IVFFlat: split into clusters, search only the close clusters

IVFFlat (Inverted File with Flat compression) pre-splits the vector space into lists clusters, and at search time searches only the clusters (probes of them) close to the query.

-- lists はクラスタ数。データを投入した“後”に作るのが鉄則
CREATE INDEX ON doc_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

Parameter	Default	The official guideline
`lists`	(required)	~1M rows: `rows / 1000` / over 1M rows: `sqrt(rows)`
`ivfflat.probes`	1	The number of clusters to search. The larger, the higher recall, the slower

SET ivfflat.probes = 10;   -- 既定1。lists に対して何個のクラスタを見るか

IVFFlat's decisive constraint: because it has a training step called clustering, recall breaks unless you create it after some data is in. Don't put it on an empty table. Furthermore, if the data volume changes greatly, reviewing lists (= a rebuild) is needed.

Decision table: when unsure, HNSW

Viewpoint	HNSW	IVFFlat
Query performance (recall × speed)	High	Relatively low
Build cost	Slow, lots of memory	Fast, little memory
Creation on an empty table	Possible (no training)	Not possible (create after loading)
Resilience to continuous add/update	Strong	Weak (needs `lists` re-adjustment)
Main use case	First choice for production RAG	Bulk-loading large data and operating in batch

The conclusion is simple: when unsure, HNSW. The reason is ETC (Easy To Change). Because you can put it on an empty table first, the operational flow can be fixed as "create → put in," and it doesn't depend on estimating the data volume. Choose IVFFlat only when build speed and memory are tight and you have a clear reason of bulk-load + batch update.

Beware the dimension cap: with the vector type, you can put an index up to 2,000 dimensions. If you use embeddings exceeding this, an expression index to the later halfvec (up to 4,000 dimensions) is an escape route.

2. Tune "after measuring" recall (verification-first)

This is the most important chapter in this article. Before raising/lowering ef_search or probes by guesswork, build a mechanism to quantify recall on your own data. ANN search is "fast but occasionally drops a correct answer." Whether that "occasionally" is within tolerance can't be known without measuring.

Get the exact kNN (the truth)

The denominator of recall = "the true answer" is obtained with a brute-force (exact kNN) that doesn't use the index. In pgvector, cut the planner's index scan and the brute-force distance computation = the exact ordering.

-- このトランザクション内だけインデックスを無効化し、厳密kNN（総当たり）を得る
BEGIN;
SET LOCAL enable_indexscan = off;
SET LOCAL enable_bitmapscan = off;
SELECT id FROM doc_chunks ORDER BY embedding <=> $1 LIMIT 10;
COMMIT;

The exact kNN gets heavier in proportion to the row count. It's realistic to run the recall evaluation on a sample of 100–200 representative queries. When you want the brute-force faster, you can parallelize, per the official, with SET max_parallel_workers_per_gather = 4;.

The recall harness: the match rate of ANN vs exact kNN

It's just "hit the same query with the ANN index on and with the exact kNN, and measure the overlap of the returned IDs (recall@k)." Run this while varying ef_search, and you get the recall↔latency curve on your own data.

"""pgvector の recall@k を計測するハーネス。
ANN（hnsw.ef_search を可変）と、インデックスを切った厳密kNN を突き合わせ、
ef_search ごとの再現率と p95 レイテンシを出す。チューニングは“この表”を見て決める。"""
from __future__ import annotations

import time
from statistics import quantiles

import psycopg  # 接続情報は環境変数 PGHOST/PGUSER/... 経由（ハードコード禁止）

K = 10
EF_GRID = (40, 80, 120, 200)  # 試す hnsw.ef_search の格子


def exact_topk(conn: psycopg.Connection, qvec: str, k: int) -> set[int]:
    """インデックスを切った厳密kNN ＝ 再現率の“正解”。"""
    with conn.cursor() as cur:
        cur.execute("SET LOCAL enable_indexscan = off")
        cur.execute("SET LOCAL enable_bitmapscan = off")
        cur.execute(
            "SELECT id FROM doc_chunks ORDER BY embedding <=> %s LIMIT %s",
            (qvec, k),
        )
        return {r[0] for r in cur.fetchall()}


def ann_topk(conn: psycopg.Connection, qvec: str, k: int, ef: int) -> tuple[set[int], float]:
    """ANN（HNSW）で取得し、IDの集合と所要時間(ms)を返す。"""
    with conn.cursor() as cur:
        cur.execute("SET LOCAL hnsw.ef_search = %s", (ef,))
        t0 = time.perf_counter()
        cur.execute(
            "SELECT id FROM doc_chunks ORDER BY embedding <=> %s LIMIT %s",
            (qvec, k),
        )
        rows = cur.fetchall()
        elapsed_ms = (time.perf_counter() - t0) * 1_000
    return {r[0] for r in rows}, elapsed_ms


def evaluate(conn: psycopg.Connection, query_vecs: list[str], k: int = K) -> None:
    """ef_search ごとに recall@k 平均と p95 レイテンシを表示。"""
    # 厳密kNN は ef に依存しないので一度だけ計算してキャッシュ
    truth = [exact_topk(conn, q, k) for q in query_vecs]

    for ef in EF_GRID:
        recalls: list[float] = []
        latencies: list[float] = []
        for q, gold in zip(query_vecs, truth):
            got, ms = ann_topk(conn, q, k, ef)
            recalls.append(len(got & gold) / k)  # 重なり / k ＝ recall@k
            latencies.append(ms)
        recall = sum(recalls) / len(recalls)
        p95 = quantiles(latencies, n=20)[18]  # 95パーセンタイル
        print(f"ef_search={ef:>3}  recall@{k}={recall:.3f}  p95={p95:6.1f}ms")

The output looks like this, for example (values are data-dependent, illustrative).

ef_search= 40  recall@10=0.892  p95=  3.1ms
ef_search= 80  recall@10=0.961  p95=  5.4ms
ef_search=120  recall@10=0.984  p95=  8.2ms
ef_search=200  recall@10=0.995  p95= 14.7ms

With this table, tuning becomes a decision. "RAG's answer quality needs recall@10 ≥ 0.95, the latency budget is 10ms" → choose ef_search=80~120 — like that. If recall just won't rise, raise the build-side m / ef_construction and rebuild the index.

Confirm with EXPLAIN whether the index is effective

Most of "it doesn't get faster" is caused by the ANN index not being used in the first place. Always confirm with EXPLAIN.

EXPLAIN (ANALYZE, BUFFERS)
SELECT id FROM doc_chunks ORDER BY embedding <=> $1 LIMIT 10;
--   → 出力に "Index Scan using ..._hnsw_..." が出ていればOK。
--     "Seq Scan" なら、距離演算子と演算子クラスの不一致、
--     ORDER BY の形崩れ、または式の不一致を疑う。

A pitfall: the index takes effect only on the form "ORDER BY embedding <operator> $1 LIMIT k." Outputting the distance in the SELECT clause is for display. If EXPLAIN shows Seq Scan even after raising ef_search, first fix the form.

3. Speed up the index build

HNSW build on millions of rows takes, left alone, tens of minutes to hours. Shorten it with the 3 levers the official lists.

-- 1) 構築用メモリを増やす。グラフ全体が収まると劇的に速い。
--    収まらないと "graph no longer fits into maintenance_work_mem" の NOTICE が出る＝遅くなる合図。
SET maintenance_work_mem = '8GB';

-- 2) 並列ワーカーを増やす（既定2）。+リーダーで構築を並列化。
SET max_parallel_maintenance_workers = 7;
-- ワーカー数を大きくするときは全体上限も忘れずに（既定8）
SET max_parallel_workers = 8;

-- 3) 初期データはロード後にインデックスを張る（公式の基本方針）
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops);

Let me organize the points faithfully to the official.

maintenance_work_mem: ideally secure a size where the HNSW graph fits whole. When it no longer fits, the official warns with a NOTICE. Keeping the dimension down (the dimensions=1024 strategy of the sister article) directly affects this cost too.
max_parallel_maintenance_workers: raise it from the default of 2 and it's built in parallel. Set it consulting your CPU cores.
For initial load, "load then build": when there's a large amount of initial data, loading with COPY then bulk-building is faster than inserting one row at a time into an empty index. This is a story of "initial migration" separate from continuous ingest.

Monitor the build progress

"Just waiting" through a long build is anxious. pgvector can output progress with a standard view.

SELECT phase,
       round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%"
FROM pg_stat_progress_create_index;
--   HNSW のフェーズ: "initializing" → "loading tuples"

The standard play for large initial loads: the initial insertion of text embeddings is fastest with COPY doc_chunks (...) FROM STDIN WITH (FORMAT BINARY). It's orders of magnitude faster than INSERT one row at a time, shrinking the pre-build load.

4. Cut memory and cost with quantization (the heart of the pgvector consolidation strategy)

This is the chapter that makes "consolidated into Postgres and still light" hold. vector(n) is 4 bytes per dimension (float32). At 1024 dimensions it's about 4KB/row for the body alone, with the HNSW graph riding on it. As the row count increases it pressures RAM, and the moment it overflows from shared_buffers, performance falls off a cliff.

The countermeasure is quantization — give up a little accuracy to greatly cut storage. pgvector achieves this with expression indexes.

4-1. halfvec: half storage with half precision

halfvec is 2 bytes per dimension (float16). Storage is nearly half, and the index supports up to 4,000 dimensions (vector is up to 2,000). For text embeddings, the accuracy degradation is often very small, and it's the first low-risk move you should try.

-- 列ごと halfvec にする場合
CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(1024));

-- 既存の vector 列はそのまま、インデックスだけ halfvec にキャストして作る（式インデックス）
CREATE INDEX ON doc_chunks
    USING hnsw ((embedding::halfvec(1024)) halfvec_cosine_ops);

-- 検索もキャストを合わせる（インデックスの式と一致させるのが条件）
SELECT id, content
FROM doc_chunks
ORDER BY embedding::halfvec(1024) <=> $1::halfvec(1024)
LIMIT 10;

4-2. Binary quantization: about 1/30 the size + accuracy recovery with re-ranking

The most aggressive cost reduction is binary quantization. Drop just each dimension's sign to 1 bit (binary_quantize). bit(1024) is about 128 bytes — about 1/30 against the about 4KB of vector(1024). A two-stage setup of coarsely narrowing super-fast with Hamming distance (<~>), then re-ranking only the top with the original vector to recover accuracy, is the standard play.

-- 粗い検索用：符号を1ビットに量子化した式インデックス（bit_hamming_ops + <~>）
CREATE INDEX ON doc_chunks
    USING hnsw ((binary_quantize(embedding)::bit(1024)) bit_hamming_ops);

-- 二段検索：① bit で広めに20件 → ② 元ベクトルのコサイン距離で上位10件に精密化
SELECT id, content
FROM (
    SELECT id, content, embedding
    FROM doc_chunks
    ORDER BY binary_quantize(embedding)::bit(1024) <~> binary_quantize($1)
    LIMIT 20                                   -- 粗いふるい（ハミング距離・激速）
) AS coarse
ORDER BY embedding <=> $1                       -- 精密な再ランク（元ベクトル・コサイン）
LIMIT 10;

Why it works: the 1st-stage bit search is orders of magnitude lighter in both computation and memory. To avoid drops, pick up broadly (LIMIT 20 ~ several times), and in the 2nd stage re-order by the original vector's accurate distance. Because only the bit version of the index needs to ride in RAM, it's easy to fit the index in memory even at large scale. Decide the re-rank width (LIMIT 20) while watching recall with Chapter 2's harness.

4-3. subvector (Matryoshka): coarse with the leading dimensions, precise with all dimensions

Matryoshka (MRL)-capable embeddings like OpenAI's text-embedding-3-* have the property that the more important information is packed toward the front. Using this, you can index with only the leading few hundred dimensions via subvector to narrow coarsely, and re-rank with all dimensions.

-- 先頭256次元だけでインデックス（粗い・軽い）
CREATE INDEX ON doc_chunks
    USING hnsw ((subvector(embedding, 1, 256)::vector(256)) vector_cosine_ops);

-- 二段検索：先頭256次元で20件 → 全1024次元で再ランク
SELECT id, content
FROM (
    SELECT id, content, embedding
    FROM doc_chunks
    ORDER BY subvector(embedding, 1, 256)::vector(256) <=> subvector($1, 1, 256)
    LIMIT 20
) AS coarse
ORDER BY embedding <=> $1
LIMIT 10;

A quick-reference of quantization

Method	Storage guideline (1024 dim)	Accuracy	Use case
`vector` (uncompressed, float32)	About 4KB/row	Baseline	Mid-scale row count, want straightforward accuracy
`halfvec` (float16)	About 2KB/row	Nearly equivalent	Try first. Halve memory at low risk
`subvector` + re-rank	The coarse stage is small	High (recovered by re-rank)	MRL embeddings (text-embedding-3, etc.) at large scale
`binary_quantize` + re-rank	About 128B/row (coarse stage)	Medium~high (recovered by re-rank)	Super-large-scale. Want to fit the index in RAM

Design decision: quantization is not "the strongest (binary) right away," but in the order halfvec → measure recall → if insufficient, two-stage search (subvector / binary). Chapter 2's harness directly becomes the judgment material. The voice chatbot design with the dimension kept to 1024 meshes with this quantization lever, making it possible to keep fitting the index inside Postgres without adding a dedicated vector DB.

5. The filter trap: "over-filtering" and the iterative scan (0.8.0+)

Production RAG almost always involves narrowing by metadata. You want to semantic-search only the documents "of this tenant," "of this category," "published." But here is the most easily misunderstood trap specific to ANN indexes.

What is over-filtering

The example the official lists is exactly the core.

When a condition matches 10% of the whole, with HNSW's default hnsw.ef_search = 40, on average only about 4 rows satisfy the condition.

The reason is that an ANN index works in the order of "first gather ef_search candidates close to the query, then apply the WHERE." If 10% of the 40 candidates satisfy the filter condition, you're left with on average 4 — insufficient for LIMIT 10. This is the true nature of "I added a filter and the results suddenly decreased / the accuracy dropped."

Use the solution by distinction by selectivity

Let me organize the official's measures by the filter's selectivity (what fraction remains).

-- (a) フィルタ列に通常インデックス：選択率が低い（=ほとんど落ちる）クエリで有効。
--     プランナが ANN ではなく B-tree を選び、絞ってから距離計算する方が速いことがある。
CREATE INDEX ON doc_chunks (tenant_id);
CREATE INDEX ON doc_chunks (location_id, category_id);   -- 複合条件なら複合インデックス

-- (b) 部分インデックス：少数の固定値で絞るとき（例：特定テナント専用）。
CREATE INDEX ON doc_chunks
    USING hnsw (embedding vector_cosine_ops)
    WHERE (tenant_id = 123);

-- (c) パーティショニング：多数の値で分割するとき（テナントが多い等）。
CREATE TABLE doc_chunks (embedding vector(1024), tenant_id int /* ... */)
    PARTITION BY LIST (tenant_id);

The iterative scan (iterative scans, pgvector 0.8.0)

And the highlight of 0.8.0 is the iterative scan. When over-filtering happens, it automatically additionally scans the index until enough results are gathered. The default is off (opt-in).

-- HNSW：strict_order（距離順を厳密に保つ） or relaxed_order（多少前後するが再現率↑）
SET hnsw.iterative_scan = strict_order;
SET hnsw.max_scan_tuples = 20000;       -- 走査する最大タプル数（既定20000）
SET hnsw.scan_mem_multiplier = 2;       -- 使用メモリを work_mem の倍数で（既定1）

-- IVFFlat：relaxed_order をサポート
SET ivfflat.iterative_scan = relaxed_order;
SET ivfflat.max_probes = 100;           -- 反復で探索するクラスタ数の上限

relaxed_order can earn recall, but the result may deviate slightly from the strict distance order. If you need the strict ordering, the standard play is, per the official, to receive it with a MATERIALIZED CTE and re-order.

-- relaxed_order の結果を、確実に距離順へ整え直す
SET hnsw.iterative_scan = relaxed_order;

WITH relaxed AS MATERIALIZED (
    SELECT id, content, embedding <=> $1 AS distance
    FROM doc_chunks
    WHERE tenant_id = 123
    ORDER BY distance
    LIMIT 20
)
SELECT id, content, distance
FROM relaxed
ORDER BY distance + 0      -- "+ 0" は Postgres 17+ で並べ直しが消えるのを防ぐためのイディオム
LIMIT 10;

The judgment guideline: ① first put an index on the filter column and look at EXPLAIN. If the planner is cleverly narrowing with B-tree, that's often enough. ② If over-filtering still remains, enable iterative_scan with strict_order. ③ If you need even more recall, relaxed_order + a MATERIALIZED CTE. At any stage, always confirm recall with Chapter 2's harness.

Close access control on the "DB side"

A metadata filter is a performance problem and at the same time a security boundary. With a dedicated vector DB it tends to become "search all then filter in the app," a hotbed of information leaks, but with Postgres consolidation you can enforce "only your own tenant's rows physically return" on the DB side with row-level security (RLS) (see also Supabase RLS Design).

ALTER TABLE doc_chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON doc_chunks
    USING (tenant_id = current_setting('app.tenant_id', true)::int);

No matter how the app errs, other tenants' rows don't enter the candidates — that is, you implement the principle of least privilege as a DB invariant.

6. Operations: HNSW's VACUUM, REINDEX, and replication

Tuning isn't "build and done." In production where updates and deletes continue, you need to know the operational quirks.

VACUUM tends to get heavy: the official says "HNSW index VACUUM can take time. REINDEX first and it gets faster." On a table with many deletes, the operation of rebuilding the index with REINDEX INDEX CONCURRENTLY then VACUUM is effective.

REINDEX INDEX CONCURRENTLY doc_chunks_embedding_idx;   -- 無停止で再構築
VACUUM doc_chunks;

Replication support: because pgvector's index rides on the WAL, it's replicated as-is with streaming replication. A configuration that offloads search to a read replica can be built straightforwardly.
Continuous monitoring items:
- Index size (\di+ / pg_relation_size) fits in shared_buffers. A leading indicator that performance falls off a cliff when it overflows.
- A proxy metric for recall: periodically run Chapter 2's harness and check whether recall has dropped due to a change in data distribution. If it dropped, review ef_search / m.
- The buffer-hit rate of EXPLAIN (ANALYZE, BUFFERS): increasing disk reads is a sign of insufficient memory.

A PII caveat: search query bodies can contain personal information. For observability, don't leave queries or chunk bodies in raw logs. Keep what you record to metadata like "query ID, the Top-k IDs, distance, latency, recall."

7. Summary: the pgvector tuning cheat sheet

A quick reference for when you're unsure.

Mental model: the triangle of recall × latency × memory. Move it after measuring.
Index: when unsure, HNSW (can put it on an empty table first = strong at continuous ingest). Search quality is the rebuild-free hnsw.ef_search, build quality is m / ef_construction. For bulk-load + batch operation, IVFFlat (match lists to the data volume).
Always measure recall: the recall@k of ANN vs the exact kNN with the index off. Confirm Index Scan with EXPLAIN. If Seq Scan, first fix the form.
Speeding up the build: make maintenance_work_mem large, increase max_parallel_maintenance_workers, and build after loading. Progress is pg_stat_progress_create_index.
Cost reduction with quantization: try from halfvec (half) → if insufficient, binary_quantize + re-rank with the original vector (about 1/30) or subvector (Matryoshka) + re-rank.
The filter trap: over-filtering happens with "gather candidates first, then WHERE." Use a filter-column index / partial index / partitioning by distinction by selectivity, and the iterative scan (0.8.0) auto-additionally-scans. If you need strict order, relaxed_order + a MATERIALIZED CTE (+ 0).
Access control: a metadata filter = a security boundary. Close it on the DB side with RLS.
Operations: HNSW VACUUM is heavy → REINDEX CONCURRENTLY first. Continuously monitor index size and recall.

Tuning vector search isn't the work of searching for a magic parameter. It comes down to "build the verification path that measures recall first, and decide where in the triangle to stand." What makes pgvector superior is that you can make that decision inside the familiar PostgreSQL, completing it with existing tools like EXPLAIN, expression indexes, and RLS.

In the generative-AI voice chatbot, I consolidated business data and embeddings into PostgreSQL + pgvector without adding a dedicated vector DB and saw a production RAG with text-embedding-3-large (1024 dimensions), HNSW, and top-10 search through to design, implementation, and operation, with one person × generative AI (Claude Code). I chose dimension, index, quantization, and filter based on measurement, achieving both speed and accuracy on top of the operational simplicity of "keeping the data store to one."

"pgvector is slow / accuracy doesn't come out / cost is ballooning" — from isolating the cause to index design, quantization, filter optimization, and operations, I can accompany you end-to-end. If you want to start from the overall RAG design, the sister article Production RAG Built with pgvector is also for you. Feel free to consult us, even from the requirements-organizing stage.

Reference (Official Documentation)

pgvector (GitHub, README) — the vector / halfvec / bit / sparsevec types and dimension caps, distance operators, the HNSW / IVFFlat creation syntax and parameters (m / ef_construction / hnsw.ef_search / lists / ivfflat.probes), the iterative scan (hnsw.iterative_scan / max_scan_tuples / scan_mem_multiplier / ivfflat.max_probes), quantization (halfvec expression index / binary_quantize + bit_hamming_ops / subvector), filters and over-filtering, build speedup (maintenance_work_mem / max_parallel_maintenance_workers), pg_stat_progress_create_index, VACUUM/REINDEX
pgvector CHANGELOG — the per-version added features, like the iterative scan (0.8.0) and hamming_distance / jaccard_distance and L1 HNSW support (0.7.0)
OpenAI Embeddings guide — text-embedding-3-large / -small, dimension shortening via the dimensions parameter (Matryoshka), the premise of subvector re-ranking

The Complete Guide to pgvector Tuning: Optimizing HNSW/IVFFlat Recall × Latency, and Quantization (halfvec, Binary Quantization) for Fast, Cheap, and Accurate

0. Mental model: tuning is choosing "where to stand in the triangle"

1. The internals of the 2 ANN indexes: HNSW and IVFFlat

HNSW: traverse a multi-layer graph

IVFFlat: split into clusters, search only the close clusters

Decision table: when unsure, HNSW

2. Tune "after measuring" recall (verification-first)

Get the exact kNN (the truth)

The recall harness: the match rate of ANN vs exact kNN

Confirm with EXPLAIN whether the index is effective

3. Speed up the index build

Monitor the build progress

4. Cut memory and cost with quantization (the heart of the pgvector consolidation strategy)

4-1. halfvec: half storage with half precision

4-2. Binary quantization: about 1/30 the size + accuracy recovery with re-ranking

4-3. subvector (Matryoshka): coarse with the leading dimensions, precise with all dimensions

A quick-reference of quantization

5. The filter trap: "over-filtering" and the iterative scan (0.8.0+)

What is over-filtering

Use the solution by distinction by selectivity

The iterative scan (iterative scans, pgvector 0.8.0)

Close access control on the "DB side"

6. Operations: HNSW's VACUUM, REINDEX, and replication

7. Summary: the pgvector tuning cheat sheet

Reference (Official Documentation)

Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

Getting started with pgvector: from installation to your first vector search (Docker, Supabase, AWS RDS/Aurora, Neon, Cloud SQL, Azure)

pgvector vs dedicated vector DBs (Pinecone / Qdrant / Weaviate / Milvus): an in-depth comparison and tech-selection guide

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

Also worth reading

Flask Performance Optimization in Practice: Caching with Flask-Caching (Redis), Rate Limiting with Flask-Limiter, and N+1 and Connection Pools

Automatically detecting telop typos in TV programs: OCR × speech recognition cross-check, Cloud Workflows parallelization, and hybrid-OCR cost optimization

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

0. Mental model: tuning is choosing "where to stand in the triangle"

1. The internals of the 2 ANN indexes: HNSW and IVFFlat

HNSW: traverse a multi-layer graph

IVFFlat: split into clusters, search only the close clusters

Decision table: when unsure, HNSW

2. Tune "after measuring" recall (verification-first)

Get the exact kNN (the truth)

The recall harness: the match rate of ANN vs exact kNN

Confirm with EXPLAIN whether the index is effective

3. Speed up the index build

Monitor the build progress

4. Cut memory and cost with quantization (the heart of the pgvector consolidation strategy)

4-1. halfvec: half storage with half precision

4-2. Binary quantization: about 1/30 the size + accuracy recovery with re-ranking

4-3. subvector (Matryoshka): coarse with the leading dimensions, precise with all dimensions

A quick-reference of quantization

5. The filter trap: "over-filtering" and the iterative scan (0.8.0+)

What is over-filtering

Use the solution by distinction by selectivity

The iterative scan (iterative scans, pgvector 0.8.0)

Close access control on the "DB side"

6. Operations: HNSW's VACUUM, REINDEX, and replication

7. Summary: the pgvector tuning cheat sheet

Reference (Official Documentation)

Related articles

Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

Getting started with pgvector: from installation to your first vector search (Docker, Supabase, AWS RDS/Aurora, Neon, Cloud SQL, Azure)

pgvector vs dedicated vector DBs (Pinecone / Qdrant / Weaviate / Milvus): an in-depth comparison and tech-selection guide

The reliability of structured output: why constrained decoding still doesn't give you 'correct output,' and production design

Also worth reading

Flask Performance Optimization in Practice: Caching with Flask-Caching (Redis), Rate Limiting with Flask-Limiter, and N+1 and Connection Pools

Automatically detecting telop typos in TV programs: OCR × speech recognition cross-check, Cloud Workflows parallelization, and hybrid-OCR cost optimization

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code