Production RAG Built with pgvector: A Design That Consolidates into PostgreSQL Without Adding a Dedicated Vector DB (HNSW, Hybrid Search, Idempotent Ingest)

"I want a mechanism where AI answers grounded in internal documents" — the requirement of RAG (Retrieval-Augmented Generation), looking only at the first sentence, is simple. But the moment you try to put it into production, the first big fork comes.

Where do you put the embedding vectors? Do you add a dedicated vector DB (Pinecone, etc.), or do you consolidate into PostgreSQL, which already holds the business data?

This article is an implementation guide for the latter — using pgvector to consolidate business data and embeddings into a single RDB and build production RAG. The dedicated-vector-DB-side design is handled in a separate article, Production RAG with LangChain × Pinecone, so this piece digs into the option of "leaning on the business RDB without adding a dedicated vector DB" in code faithful to the official documentation, more practically than the official. As subject matter, I'll mix in design decisions from a generative-AI voice chatbot I built — a RAG customer-service system that consolidates business data and embeddings into PostgreSQL + pgvector.

The rules of this article: the SQL syntax, operators, and index parameters are based on the pgvector official README, and the embedding-model specs on the OpenAI official documentation (both as of June 2026). Since versions and pricing get revised, always double-check the latest values in the official before going to production. The code is shaped to be usable in real operation, but secrets are assumed to be in environment variables (hardcoding connection strings/API keys is strictly forbidden).

0. The first fork: why Postgres + pgvector instead of a dedicated vector DB

Introduce a dedicated vector DB while leaving this ambiguous, and you take on the cost that takes effect later: the data store you operate increases by one. First, let me organize the decision axes.

Decision axis	Dedicated vector DB (Pinecone, etc.)	PostgreSQL + pgvector
Transactional consistency	A separate system from business data. Can't two-phase commit, and writes can drift	Can update the business row and the embedding in the same transaction
Operational burden	One more DB (monitoring, backup, permissions, incident response)	Consolidated into existing Postgres operations. Backup and PITR all at once
Metadata filters	A product-specific filter DSL	SQL itself (JOIN, WHERE, full-text search all usable)
Access control	Designed separately	Reuse row-level security (RLS) and existing permission design
Cost	The monthly fee of the dedicated service piles up	Inside Postgres. Keep the dimension down and the storage increase is limited too
Ultra-large scale, ultra-low latency	The dedicated DB has the edge (hundreds of millions of vectors / single-digit ms)	Sufficiently practical up to several million~tens of millions of rows. Beyond that needs design

If you regard "the embedding, after all, as one of that document's attributes," the necessity of carving it out into a dedicated DB disappears in many projects. The value of having users, orders, documents, and their embeddings within one transaction boundary takes effect in production operation. You soft-delete a RAG source document, but the vector remains in a separate DB and "information you thought you deleted" mixes into the answer — having a dedicated DB as a separate system can structurally cause such consistency accidents. With the same RDB, DELETE and UPDATE both close in the same commit.

Honestly, the "cases pgvector isn't suited for" too

Hundreds-of-millions-of-vectors scale, sub-millisecond latency SLA: this domain is the turf of a dedicated vector DB or a dedicated ANN service. pgvector can fight well up to tens of millions of rows too, but don't push it at extreme scale.
The team doesn't operate Postgres in the first place: if the merit of consolidation (unifying operations) doesn't take effect, the advantage thins.

Conversely, for projects where "Postgres already exists and the vectors fit within several million~tens of millions of rows" — i.e., the vast majority of B2B SaaS and internal tools — consolidating into pgvector becomes the first choice. This is YAGNI itself. There's no need to split your data store into two now for hundreds of millions of vectors that may or may not come.

That I leaned on PostgreSQL + pgvector without adding a dedicated vector DB in the voice chatbot was this judgment too. I placed, as the top operational priority, being able to handle business data (FAQ, products, support history) and the embeddings for its semantic search in the same DB, the same transaction.

1. Basics: extension, table, distance operators, nearest-neighbor search

First, a minimal configuration faithful to the official README. From CREATE EXTENSION to nearest-neighbor search, this is the foundation.

-- 1. 拡張を有効化（スーパーユーザー or 権限が必要）
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. チャンク（分割したドキュメント片）と埋め込みを格納するテーブル
CREATE TABLE doc_chunks (
    id            bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    document_id   bigint      NOT NULL,            -- 元ドキュメントへの参照（業務行とJOINできる）
    content       text        NOT NULL,            -- チャンク本文
    content_hash  bytea       NOT NULL,            -- 冪等インジェスト用（第4章）
    embedding     vector(1024) NOT NULL,           -- 埋め込み。次元は固定（第3章で1024を選ぶ理由）
    embed_model   text        NOT NULL,            -- 再埋め込み運用のためのモデル名（第6章）
    metadata      jsonb       NOT NULL DEFAULT '{}',-- テナント・カテゴリ等のフィルタ用
    created_at    timestamptz NOT NULL DEFAULT now()
);

The 1024 in vector(1024) is the number of dimensions. This must match the embedding model's output dimension (the reason is Chapter 3). pgvector's vector type supports up to 2,000 dimensions when you index it (if you need higher dimensions, use the halfvec described later, up to 4,000 dimensions).

Distance operators (get this wrong and search breaks)

pgvector's distance operators are as follows. The iron rule is to align the distance the embedding model assumes with the operator and the index's operator class.

Operator	Distance	Main use
`<->`	L2 (Euclidean) distance	Non-normalized vectors in general
`<#>`	Negative inner product	Inner-product-based models
`<=>`	Cosine distance	The standard for text embeddings (measures meaning by direction)
`<+>`	L1 (taxicab) distance	L1 space
`<~>`	Hamming distance	`bit` type (binary vectors)
`<%>`	Jaccard distance	`bit` type

Since text embeddings like OpenAI's text-embedding-3-* are standardly handled with cosine similarity, this article consistently uses <=> (cosine distance).

Caution: <#> returns the "negative inner product." It's the sign to preserve the ordering "the smaller as a distance, the closer." When showing the score to people, be careful about the sign's interpretation.

k-nearest-neighbor search (kNN)

Nearest-neighbor search takes the straightforward form ORDER BY <distance operator> LIMIT k.

-- クエリ埋め込み $1 に意味が近いチャンクを上位5件
SELECT id, document_id, content,
       embedding <=> $1 AS distance      -- 小さいほど近い（コサイン距離）
FROM doc_chunks
ORDER BY embedding <=> $1                 -- インデックスはこの ORDER BY に効く
LIMIT 5;

The point is to write the distance operator directly in the ORDER BY. The HNSW/IVFFlat index described later takes effect precisely for this form of "ORDER BY embedding <operator> ... LIMIT k." Outputting the distance in the SELECT clause is for display, and what makes the index effective is the ORDER BY side — understand this and you won't be confused.

2. Index: how to choose between HNSW and IVFFlat

Full-scanning millions of rows every time breaks latency. pgvector's approximate-nearest-neighbor (ANN) indexes are of 2 kinds. This choice decides RAG's search performance and operability.

HNSW

-- HNSW：グラフ構造のインデックス。コサイン距離なら vector_cosine_ops
CREATE INDEX ON doc_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

m: the max number of edges each node connects (default 16). The larger, the higher the recall but more memory.
ef_construction: the search width at build time (default 64). The larger, the higher-quality the graph but the slower the build.
The accuracy/speed at search time is tuned with runtime parameters (no index rebuild needed):

SET hnsw.ef_search = 100;  -- 探索幅。デフォルト40。大きいほど高再現率・低速

HNSW's biggest operational advantage is, as the official explicitly states, "unlike IVFFlat, it has no training step, so you can create the index even with zero rows." In other words, you can put the index on at table-creation time and pour data in later. This property is why it pairs well with production RAG where ingest continuously occurs.

IVFFlat

-- IVFFlat：転置ファイル。lists はクラスタ数。データ投入後に作成する
CREATE INDEX ON doc_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

lists: the number of clusters (lists). The official guideline is "rows / 1000 up to 1 million rows, sqrt(rows) above 1 million rows."
The runtime parameter at search time:

SET ivfflat.probes = 10;  -- 探索するリスト数。デフォルト1。大きいほど高再現率・低速

IVFFlat's decisive constraint: as the official says "one key to good recall is to create the index after some data is in the table," IVFFlat has a training step (clustering), so it needs to be created after data is loaded. Put it on an empty table and recall breaks.

Decision table: HNSW vs IVFFlat

The official expression is "HNSW has good query performance but is slow to build and uses a lot of memory," "IVFFlat is fast to build and uses little memory but has inferior query performance." Let me translate this into operational language.

Viewpoint	HNSW	IVFFlat
Query performance (recall × speed)	High	Relatively low
Build cost	Slow, lots of memory	Fast, little memory
Creation on an empty table	Possible (no training step)	Not possible (create after loading data)
Resilience to continuous add/update	Strong (can put it on first)	Weak (situations arise needing rebuild/retraining)
Robustness to hard-to-see data scale	Strong	Premised on matching `lists` to the data volume
Main use case	First choice for production RAG	Bulk-loading large data and batch-updating

The practical conclusion: when unsure, HNSW. The reason is ETC (Easy To Change). With HNSW, you can put the index on an empty table first and then begin ingest, so the operational flow becomes the straightforward order of "create → put in" and doesn't depend on estimating the data volume. IVFFlat creates a procedure dependency of "put some in then create / review lists when the volume changes," which tends to become operational debt in RAG with continuous updates. Choose IVFFlat only when build speed and memory are tight and you have a clear reason to bulk-load data and batch-update.

In the voice chatbot, because the premise was that FAQ and product information are updated from time to time — i.e., ingest occurs continuously — I chose HNSW.

Memory caution: HNSW is fast only when the index is on memory. Secure enough maintenance_work_mem at build time, and in operation be mindful of whether the index size fits in shared_buffers. Keeping the dimension down (Chapter 3) directly affects this cost too.

3. How to choose the embedding dimension: why 1024 is the "default"

How to decide the 1024 in vector(1024). This is the place to design the trade-off between cost and accuracy.

OpenAI's embedding models' output dimensions are officially as follows.

text-embedding-3-large: 3072 dimensions
text-embedding-3-small: 1536 dimensions
text-embedding-ada-002: 1536 dimensions

And text-embedding-3-* has a dimensions parameter, and the official says "even shortening the embedding (cutting the trailing numbers) doesn't lose the property of representing concepts." It even states "text-embedding-3-large shortened to 256 dimensions outperforms the non-shortened text-embedding-ada-002 (1536 dimensions)."

What this means is — holding the full 3072 dimensions as-is is over-spec in many cases. The dimension takes effect on all of storage, index memory, and search computation. Drop 3072 dimensions to 1024 and the vector's storage and index memory become roughly 1/3. This meshes perfectly with the pgvector consolidation strategy — "fit it inside Postgres."

Design decision: I make using text-embedding-3-large at dimensions=1024 this article's standard. Leveraging large's expressiveness, at 1024 dimensions it fits comfortably within pgvector's indexable 2,000-dimension limit, and keeps cost down too. If accuracy is severe, raise it to 1536 or 3072; if cost is the top priority, lower it to small or 512 — in any case, always match the column-definition dimension and the dimensions parameter. Get this misaligned and you get a dimension-mismatch error on INSERT (being protected by the type is an advantage).

In the voice chatbot, I operated text-embedding-3-large at 1024 dimensions. Keeping the accuracy needed for customer-service semantic search while fitting the storage inside Postgres without adding a dedicated vector DB — keeping the dimension down to 1024 was the judgment to make this "consolidated and still light" hold.

If you need ultra-high dimensions, halfvec: the half-precision halfvec(n) supports up to 4,000 dimensions when indexed, and halves the storage. It's good to remember as an escape route for when you want to hold 1536 or 3072 as-is but keep memory down (the accuracy degradation needs evaluation).

4. Ingest: chunk design and "idempotent" upsert

The quality of search is 80% decided by ingest design. And what goes wrong most in production is the non-idempotency of ingest: "re-ingesting the same document put the embedding in twice" or "it failed midway and ended up in a half-baked state."

Principles of chunk design

Cut by units of meaning: cut at the boundaries of headings, paragraphs, and bullet lists. Chopping at a fixed character count splits sentences and concepts, dropping search accuracy.
Moderate overlap: slightly overlapping between chunks picks up context spanning a boundary. Overlapping too much causes duplication and cost increase.
Always carry metadata: document_id, tenant, category, update time. This becomes the foundation of Chapter 5's filtering and access control.

Idempotent upsert: prevent double insertion with a content hash

Guarantee "a chunk with the same content is 1 row no matter how many times you ingest it." The key is making the chunk body's content hash the unique key. Re-run, and if the content hasn't changed, it doesn't even hit the embedding API (= idempotent = cost saving).

-- (document_id, content_hash) を一意に。同内容の再投入は何度やっても1行
CREATE UNIQUE INDEX uq_doc_chunk
    ON doc_chunks (document_id, content_hash);

"""ドキュメントを分割し、内容ハッシュで冪等にアップサートする。
再実行しても二重投入せず、変わっていないチャンクは埋め込みAPIを叩かない。"""
import hashlib
import os

import psycopg
from openai import OpenAI

client = OpenAI()  # APIキーは環境変数 OPENAI_API_KEY から（ハードコード禁止）

EMBED_MODEL = "text-embedding-3-large"
EMBED_DIM = 1024  # ← vector(1024) と必ず一致させる


def content_hash(text: str) -> bytes:
    """正規化した本文の決定的ハッシュ。同じ内容 → 同じキー → 重複しない。"""
    normalized = text.strip()
    return hashlib.sha256(normalized.encode("utf-8")).digest()


def embed(texts: list[str]) -> list[list[float]]:
    """OpenAI 埋め込み。dimensions で 1024 に短縮（コストとストレージを抑える）。"""
    resp = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        dimensions=EMBED_DIM,  # text-embedding-3-* のみ対応
    )
    return [d.embedding for d in resp.data]


def ingest(conn: psycopg.Connection, document_id: int, chunks: list[str]) -> None:
    """チャンク群を冪等にアップサート。既存（同ハッシュ）は埋め込みを再生成しない。"""
    hashes = [content_hash(c) for c in chunks]

    # 1) 既に入っているハッシュを先に引く（= 埋め込みAPIの無駄打ちを避ける）
    with conn.cursor() as cur:
        cur.execute(
            "SELECT content_hash FROM doc_chunks "
            "WHERE document_id = %s AND content_hash = ANY(%s)",
            (document_id, hashes),
        )
        existing = {row[0] for row in cur.fetchall()}

    # 2) 新規チャンクだけ埋め込み生成（コスト効率：差分だけ課金）
    new_items = [
        (c, h) for c, h in zip(chunks, hashes) if h not in existing
    ]
    if not new_items:
        return  # 全部既存 → 何もしない（完全に冪等）

    vectors = embed([c for c, _ in new_items])

    # 3) ON CONFLICT で二重投入を構造的に排除（競合時も1行に収束）
    with conn.cursor() as cur:
        cur.executemany(
            """
            INSERT INTO doc_chunks
                (document_id, content, content_hash, embedding, embed_model)
            VALUES (%s, %s, %s, %s, %s)
            ON CONFLICT (document_id, content_hash) DO NOTHING
            """,
            [
                (document_id, c, h, v, EMBED_MODEL)
                for (c, h), v in zip(new_items, vectors)
            ],
        )
    conn.commit()  # 業務行の更新と同一トランザクションに束ねれば整合性も担保

Let me organize the design crux.

Make the content hash the key: if the body is identical, the same hash → with the UNIQUE constraint and ON CONFLICT DO NOTHING, it doesn't go in twice no matter how many times you run it. Guaranteeing idempotency not with app logic but with a DB constraint is solid (reliability).
Embed only the diff: pull the existing hashes first and hit client.embeddings.create only for the new portion. Since the embedding API is token-billed, this directly becomes cost reduction.
Can be bundled in the same transaction: this is the biggest fruit of consolidating into pgvector. If you put the update of the source document's business row (the documents table, etc.) and this chunk insertion in one commit, the inconsistency of "the business row was updated but the embedding is stale" structurally doesn't occur. A dedicated vector DB can't get this guarantee because it's a separate system.

DRY/SRP: separating content_hash (identity decision), embed (embedding generation), and ingest (persistence) lets you swap only embed for re-embedding (Chapter 6) too. Since chunk-splitting logic changes per project, putting it outside ingest (on the caller side) is ETC.

5. Hybrid search: vector × full-text search × metadata filter

Vector search alone doesn't reach production quality. Because it drops keywords you want an exact match for, like "part numbers," "proper nouns," and "dates," even when the meaning is close. The biggest strength of consolidating into pgvector is that here you can combine vector search and PostgreSQL's full-text search and WHERE filter in the same SQL. The part where a dedicated vector DB would interpose a separate system closes in 1 query.

Add a full-text-search column

-- 全文検索用の tsvector を生成列で持つ（本文から自動生成 → 同期ズレが起きない）
ALTER TABLE doc_chunks
    ADD COLUMN content_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('simple', content)) STORED;

CREATE INDEX idx_doc_chunks_tsv ON doc_chunks USING gin (content_tsv);

-- メタデータ（テナント・カテゴリ等）でのフィルタ用
CREATE INDEX idx_doc_chunks_meta ON doc_chunks USING gin (metadata);

Japanese full-text search can have weak word segmentation with to_tsvector('simple', ...). In production, consider introducing pg_bigm (bigram) or morphological analysis (textsearch_ja-family). Since this depends on the requirements and language, first run it with simple and measure the accuracy, then decide (YAGNI / measurement-first).

The hybrid search query: fuse the 2 systems with RRF

Because vector similarity and full-text search have different score scales, a simple weighted sum is unstable. RRF (Reciprocal Rank Fusion), which fuses by rank, is simple to implement and solid. It just "sums the reciprocals of the ranks in each list."

-- $1: クエリ埋め込み, $2: クエリ語(websearch構文), $3: テナントID
WITH vector_search AS (
    SELECT id,
           ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
    FROM doc_chunks
    WHERE metadata->>'tenant_id' = $3        -- メタデータフィルタ＝アクセス制御の一部
    ORDER BY embedding <=> $1
    LIMIT 50
),
keyword_search AS (
    SELECT id,
           ROW_NUMBER() OVER (
               ORDER BY ts_rank_cd(content_tsv, websearch_to_tsquery('simple', $2)) DESC
           ) AS rank
    FROM doc_chunks
    WHERE metadata->>'tenant_id' = $3
      AND content_tsv @@ websearch_to_tsquery('simple', $2)
    LIMIT 50
)
SELECT c.id, c.content,
       COALESCE(1.0 / (60 + v.rank), 0.0)
     + COALESCE(1.0 / (60 + k.rank), 0.0) AS rrf_score   -- RRF: 順位の逆数を融合
FROM doc_chunks c
LEFT JOIN vector_search  v ON v.id = c.id
LEFT JOIN keyword_search k ON k.id = c.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 10;

The essence is that this query completes in one SQL.

The vector side (vector_search): the top 50 by closeness of meaning.
The keyword side (keyword_search): the top 50 by full-text search. Picks up exact matches of part numbers and proper nouns.
Fuse with RRF: re-rank by the sum of 1 / (60 + rank). 60 is the standard smoothing constant, suppressing the prominence of a specific list.
Metadata filter = access control: by putting metadata->>'tenant_id' = $3 in the WHERE of both systems, other tenants' rows don't enter the candidates in the first place.

Close the metadata filter and access control on the "DB side"

With a dedicated vector DB, it tends to become "search across all tenants, then filter on the app side," which is a hotbed of information leaks. If you consolidate into Postgres, you can enforce "you can only see your own tenant's rows" at the DB level with row-level security (RLS).

-- RLS：アプリがどう間違えても、他テナントのチャンクは物理的に返らない
ALTER TABLE doc_chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON doc_chunks
    USING (metadata->>'tenant_id' = current_setting('app.tenant_id', true));

This is safety unique to pgvector consolidation. You can place the access-control boundary on the DB's invariant rather than the app's code (the principle of least privilege). The official also says to use, for filtering, "a normal index on the filter column," "a partial index," "table partitioning," and "enabling iterative index scans" as appropriate. For multi-tenant, if you narrow by a few fixed values a partial index takes effect, and if by many values partitioning does.

Filters and ANN compatibility: since HNSW/IVFFlat are approximate indexes, recall can drop if a strong WHERE greatly trims the candidates. Enabling the iterative index scans the official lists makes it "automatically scan more of the index as needed." For queries with a low filter selectivity (= low hit rate), a normal index on the filter column can be faster — measure first, then decide.

6. Re-embedding operations: so it doesn't break even when you change the model

What always comes in production RAG is the day "I want to change the embedding model." A new version of text-embedding-3-large came out, you want to raise the dimension from 1024 to 1536, you're switching to another provider — . What's fatal here is mixing old and new embeddings that aren't in the same space. Distance computation becomes meaningless, and search silently breaks.

The countermeasure is simple: give the embedding a version. Chapter 1's embed_model column is that. Holding an embedding_version that bundles model + dimension + normalization policy is even more solid.

-- どのモデル・次元で作った埋め込みかを必ず記録する
ALTER TABLE doc_chunks
    ADD COLUMN embedding_version text NOT NULL DEFAULT 'te3-large-1024-v1';

-- 検索は「現行バージョン」だけを対象にする（新旧混在を構造的に防ぐ）
-- WHERE embedding_version = 'te3-large-1024-v1' ... ORDER BY embedding <=> $1

The re-embedding strategy is safest with an online switch (blue/green).

Embed additionally: add a column for the new version (e.g., embedding_v2 vector(1536)) and re-embed existing chunks in the background to fill it. Since the body and content_hash are immutable, Chapter 4's idempotent-ingest mechanism can be reused as-is for the re-embedding job (just swap the embed function).
Run both systems in parallel: until re-embedding completes, search keeps using the old version. Being able to hold old and new in the same transaction boundary is the strength of consolidation — it doesn't become the tightrope of a dedicated DB's "migration to a separate system."
The switch is one flag: once everything is re-embedded, just switch the search's embedding_version to the new value. If there's a problem, roll back immediately.
Drop the old column: after confirming stability, DROP the old column and old index.

Cost: re-embedding incurs token billing for all chunks. Manage it with a content hash so it's done with only the diff (only changed chunks), and run a full re-embedding on a model change with a nightly batch + rate control. Keeping the dimension down to 1024 itself lightens both the cost and storage of re-embedding (Chapter 3's judgment takes effect here too).

7. Observability and cost: what to measure, where to tighten

If you operate pgvector RAG in production, continuously visualize "isn't search slow," "isn't accuracy dropping," "is cost reasonable."

What to measure

Search latency: confirm with EXPLAIN (ANALYZE, BUFFERS) whether the index is effective (Index Scan using ... hnsw). Grasp the curve of recall vs speed when you raise ef_search / probes.
A proxy metric for recall: for some queries, compare the ANN index's results with the results of a full scan (an exact kNN with the index disabled), and monitor the Top-k match rate. If recall drops, review ef_search / m.
Index size: monitor with \di+ or pg_relation_size. When it no longer fits in shared_buffers, it suddenly slows. A leading indicator against the increase of dimension and row count.
Embedding cost: at ingest time, log "the number of new embedding tokens." If idempotency (Chapter 4) is effective, this value should approach 0 on re-runs — if it doesn't become 0, it's a sign idempotency is broken.

Where to tighten cost

Keep the dimension down: as in Chapter 3, dimensions=1024. The most cost-effective lever, taking effect on all of storage, memory, and search computation.
Diff ingest: avoid re-embedding unchanged chunks with a content hash. Minimize the embedding API's token billing.
Fit the index in memory: keep the dimension and row count down, and design so the HNSW index rides in shared_buffers. Once swapping starts, both performance and cost worsen.
Don't add a dedicated DB: the very consolidation judgment (Chapter 0) makes the dedicated service's monthly fee and operational labor cost entirely unnecessary — this is the biggest.

Observability and PII: search query bodies can contain personal information. For observability, don't leave queries or chunk bodies in raw logs. Keep what you record to metadata like "query ID, the Top-k chunk IDs, distance, latency, token count," and not leaving PII in body logs is an absolute condition for internal-control projects.

8. Summary: the pgvector production-RAG cheat sheet

Finally, a quick reference for when you're unsure.

Where to put it: if Postgres already exists and it's several million~tens of millions of rows, consolidate into pgvector. For hundreds of millions of vectors / a sub-millisecond SLA, consider a dedicated vector DB.
Distance operator: for text embeddings, <=> (cosine distance) + vector_cosine_ops. Always align the operator, operator class, and the model's distance.
Index: when unsure, HNSW (can be put on an empty table first = strong at continuous ingest). If build cost takes effect with bulk-load + batch update, IVFFlat.
Dimension: text-embedding-3-large at dimensions=1024. Always match the column's vector(1024).
Ingest: idempotently with content hash + UNIQUE + ON CONFLICT DO NOTHING. Embed only the diff to keep cost down. Bundle in the same transaction as the business row.
Search: fuse vector + full-text search (tsvector) with RRF. Close access control on the DB side with metadata / RLS.
Re-embedding: version-manage with an embedding_version column. Don't mix old and new. Switch with blue/green.
Operation: measure recall, latency, index size, and embedding token count. Don't leave PII in body logs.

RAG looks like "a one-line requirement," but it's the work of designing the trade-offs of placement, distance, index, idempotency, consistency, and cost. And in production, the first move of whether to add a dedicated vector DB greatly affects the subsequent operational burden and cost.

In the generative-AI voice chatbot, I consolidated business data and embeddings into PostgreSQL + pgvector without adding a dedicated vector DB and saw a production RAG that runs semantic search with text-embedding-3-large (1024 dimensions) through to design, implementation, and operation, with one person × generative AI (Claude Code). I make idempotent ingest, hybrid search, and transactional consistency hold on top of the operational simplicity of "keeping the data store to one."

"Where and how to put RAG" — should you add a dedicated vector DB, or lean on Postgres? From that judgment through implementation and operation, I can accompany you end-to-end. Feel free to consult us, even from the requirements-organizing stage.

Reference (Official Documentation)

pgvector (GitHub, README) — CREATE EXTENSION vector, the vector/halfvec/bit/sparsevec types, the distance operators (<-> / <#> / <=> / <+> / <~> / <%>), the HNSW / IVFFlat creation syntax and trade-off, ef_search / probes, the lists guideline, filter strategies
OpenAI Embeddings guide — text-embedding-3-large (3072 dimensions) / text-embedding-3-small (1536 dimensions), dimension shortening via the dimensions parameter, client.embeddings.create

Production RAG Built with pgvector: A Design That Consolidates into PostgreSQL Without Adding a Dedicated Vector DB (HNSW, Hybrid Search, Idempotent Ingest)

0. The first fork: why Postgres + pgvector instead of a dedicated vector DB

Honestly, the "cases pgvector isn't suited for" too

1. Basics: extension, table, distance operators, nearest-neighbor search

Distance operators (get this wrong and search breaks)

k-nearest-neighbor search (kNN)

2. Index: how to choose between HNSW and IVFFlat

HNSW

IVFFlat

Decision table: HNSW vs IVFFlat

3. How to choose the embedding dimension: why 1024 is the "default"

4. Ingest: chunk design and "idempotent" upsert

Principles of chunk design

Idempotent upsert: prevent double insertion with a content hash

5. Hybrid search: vector × full-text search × metadata filter

Add a full-text-search column

The hybrid search query: fuse the 2 systems with RRF

Close the metadata filter and access control on the "DB side"

6. Re-embedding operations: so it doesn't break even when you change the model

7. Observability and cost: what to measure, where to tighten

What to measure

Where to tighten cost

8. Summary: the pgvector production-RAG cheat sheet

Reference (Official Documentation)

Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

Getting started with pgvector: from installation to your first vector search (Docker, Supabase, AWS RDS/Aurora, Neon, Cloud SQL, Azure)

pgvector vs dedicated vector DBs (Pinecone / Qdrant / Weaviate / Milvus): an in-depth comparison and tech-selection guide

The Complete Guide to pgvector Tuning: Optimizing HNSW/IVFFlat Recall × Latency, and Quantization (halfvec, Binary Quantization) for Fast, Cheap, and Accurate

Also worth reading

Vercel storage implementation guide: choosing Blob, Edge Config, and Marketplace (Neon/Upstash) correctly by use

Until you run generative-AI voice customer service 'in production': designing an unmanned kiosk with Bedrock × Whisper × Polly × pgvector

SQLAlchemy 2.0 Practical Guide: Designing a Type-Safe ORM Data-Access Layer That Survives Production

0. The first fork: why Postgres + pgvector instead of a dedicated vector DB

Honestly, the "cases pgvector isn't suited for" too

1. Basics: extension, table, distance operators, nearest-neighbor search

Distance operators (get this wrong and search breaks)

k-nearest-neighbor search (kNN)

2. Index: how to choose between HNSW and IVFFlat

HNSW

IVFFlat

Decision table: HNSW vs IVFFlat

3. How to choose the embedding dimension: why 1024 is the "default"

4. Ingest: chunk design and "idempotent" upsert

Principles of chunk design

Idempotent upsert: prevent double insertion with a content hash

5. Hybrid search: vector × full-text search × metadata filter

Add a full-text-search column

The hybrid search query: fuse the 2 systems with RRF

Close the metadata filter and access control on the "DB side"

6. Re-embedding operations: so it doesn't break even when you change the model

7. Observability and cost: what to measure, where to tighten

What to measure

Where to tighten cost

8. Summary: the pgvector production-RAG cheat sheet

Reference (Official Documentation)

Related articles

Building Production LLM Apps with Vercel AI SDK v6: Streaming, Tool Calling, Structured Output, and RAG in Real Code

Getting started with pgvector: from installation to your first vector search (Docker, Supabase, AWS RDS/Aurora, Neon, Cloud SQL, Azure)

pgvector vs dedicated vector DBs (Pinecone / Qdrant / Weaviate / Milvus): an in-depth comparison and tech-selection guide

The Complete Guide to pgvector Tuning: Optimizing HNSW/IVFFlat Recall × Latency, and Quantization (halfvec, Binary Quantization) for Fast, Cheap, and Accurate

Also worth reading

Vercel storage implementation guide: choosing Blob, Edge Config, and Marketplace (Neon/Upstash) correctly by use

Until you run generative-AI voice customer service 'in production': designing an unmanned kiosk with Bedrock × Whisper × Polly × pgvector

SQLAlchemy 2.0 Practical Guide: Designing a Type-Safe ORM Data-Access Layer That Survives Production