"I want a mechanism where AI answers grounded in internal documents" — the requirement of RAG (Retrieval-Augmented Generation), looking only at the first sentence, is simple. But the moment you try to put it into production, the first big fork comes.
Where do you put the embedding vectors? Do you add a dedicated vector DB (Pinecone, etc.), or do you consolidate into PostgreSQL, which already holds the business data?
This article is an implementation guide for the latter — using pgvector to consolidate business data and embeddings into a single RDB and build production RAG. The dedicated-vector-DB-side design is handled in a separate article, Production RAG with LangChain × Pinecone, so this piece digs into the option of "leaning on the business RDB without adding a dedicated vector DB" in code faithful to the official documentation, more practically than the official. As subject matter, I'll mix in design decisions from a generative-AI voice chatbot I built — a RAG customer-service system that consolidates business data and embeddings into PostgreSQL + pgvector.
The rules of this article: the SQL syntax, operators, and index parameters are based on the pgvector official README, and the embedding-model specs on the OpenAI official documentation (both as of June 2026). Since versions and pricing get revised, always double-check the latest values in the official before going to production. The code is shaped to be usable in real operation, but secrets are assumed to be in environment variables (hardcoding connection strings/API keys is strictly forbidden).
0. The first fork: why Postgres + pgvector instead of a dedicated vector DB
Introduce a dedicated vector DB while leaving this ambiguous, and you take on the cost that takes effect later: the data store you operate increases by one. First, let me organize the decision axes.
| Decision axis | Dedicated vector DB (Pinecone, etc.) | PostgreSQL + pgvector |
|---|---|---|
| Transactional consistency | A separate system from business data. Can't two-phase commit, and writes can drift | Can update the business row and the embedding in the same transaction |
| Operational burden | One more DB (monitoring, backup, permissions, incident response) | Consolidated into existing Postgres operations. Backup and PITR all at once |
| Metadata filters | A product-specific filter DSL | SQL itself (JOIN, WHERE, full-text search all usable) |
| Access control | Designed separately | Reuse row-level security (RLS) and existing permission design |
| Cost | The monthly fee of the dedicated service piles up | Inside Postgres. Keep the dimension down and the storage increase is limited too |
| Ultra-large scale, ultra-low latency | The dedicated DB has the edge (hundreds of millions of vectors / single-digit ms) | Sufficiently practical up to several million~tens of millions of rows. Beyond that needs design |
If you regard "the embedding, after all, as one of that document's attributes," the necessity of carving it out into a dedicated DB disappears in many projects. The value of having users, orders, documents, and their embeddings within one transaction boundary takes effect in production operation. You soft-delete a RAG source document, but the vector remains in a separate DB and "information you thought you deleted" mixes into the answer — having a dedicated DB as a separate system can structurally cause such consistency accidents. With the same RDB, DELETE and UPDATE both close in the same commit.
Honestly, the "cases pgvector isn't suited for" too
- Hundreds-of-millions-of-vectors scale, sub-millisecond latency SLA: this domain is the turf of a dedicated vector DB or a dedicated ANN service. pgvector can fight well up to tens of millions of rows too, but don't push it at extreme scale.
- The team doesn't operate Postgres in the first place: if the merit of consolidation (unifying operations) doesn't take effect, the advantage thins.
Conversely, for projects where "Postgres already exists and the vectors fit within several million~tens of millions of rows" — i.e., the vast majority of B2B SaaS and internal tools — consolidating into pgvector becomes the first choice. This is YAGNI itself. There's no need to split your data store into two now for hundreds of millions of vectors that may or may not come.
That I leaned on PostgreSQL + pgvector without adding a dedicated vector DB in the voice chatbot was this judgment too. I placed, as the top operational priority, being able to handle business data (FAQ, products, support history) and the embeddings for its semantic search in the same DB, the same transaction.
1. Basics: extension, table, distance operators, nearest-neighbor search
First, a minimal configuration faithful to the official README. From CREATE EXTENSION to nearest-neighbor search, this is the foundation.
-- 1. 拡張を有効化(スーパーユーザー or 権限が必要)
CREATE EXTENSION IF NOT EXISTS vector;
-- 2. チャンク(分割したドキュメント片)と埋め込みを格納するテーブル
CREATE TABLE doc_chunks (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
document_id bigint NOT NULL, -- 元ドキュメントへの参照(業務行とJOINできる)
content text NOT NULL, -- チャンク本文
content_hash bytea NOT NULL, -- 冪等インジェスト用(第4章)
embedding vector(1024) NOT NULL, -- 埋め込み。次元は固定(第3章で1024を選ぶ理由)
embed_model text NOT NULL, -- 再埋め込み運用のためのモデル名(第6章)
metadata jsonb NOT NULL DEFAULT '{}',-- テナント・カテゴリ等のフィルタ用
created_at timestamptz NOT NULL DEFAULT now()
);
The 1024 in vector(1024) is the number of dimensions. This must match the embedding model's output dimension (the reason is Chapter 3). pgvector's vector type supports up to 2,000 dimensions when you index it (if you need higher dimensions, use the halfvec described later, up to 4,000 dimensions).
Distance operators (get this wrong and search breaks)
pgvector's distance operators are as follows. The iron rule is to align the distance the embedding model assumes with the operator and the index's operator class.
| Operator | Distance | Main use |
|---|---|---|
<-> | L2 (Euclidean) distance | Non-normalized vectors in general |
<#> | Negative inner product | Inner-product-based models |
<=> | Cosine distance | The standard for text embeddings (measures meaning by direction) |
<+> | L1 (taxicab) distance | L1 space |
<~> | Hamming distance | bit type (binary vectors) |
<%> | Jaccard distance | bit type |
Since text embeddings like OpenAI's text-embedding-3-* are standardly handled with cosine similarity, this article consistently uses <=> (cosine distance).
Caution:
<#>returns the "negative inner product." It's the sign to preserve the ordering "the smaller as a distance, the closer." When showing the score to people, be careful about the sign's interpretation.
k-nearest-neighbor search (kNN)
Nearest-neighbor search takes the straightforward form ORDER BY <distance operator> LIMIT k.
-- クエリ埋め込み $1 に意味が近いチャンクを上位5件
SELECT id, document_id, content,
embedding <=> $1 AS distance -- 小さいほど近い(コサイン距離)
FROM doc_chunks
ORDER BY embedding <=> $1 -- インデックスはこの ORDER BY に効く
LIMIT 5;
The point is to write the distance operator directly in the ORDER BY. The HNSW/IVFFlat index described later takes effect precisely for this form of "ORDER BY embedding <operator> ... LIMIT k." Outputting the distance in the SELECT clause is for display, and what makes the index effective is the ORDER BY side — understand this and you won't be confused.
2. Index: how to choose between HNSW and IVFFlat
Full-scanning millions of rows every time breaks latency. pgvector's approximate-nearest-neighbor (ANN) indexes are of 2 kinds. This choice decides RAG's search performance and operability.
HNSW
-- HNSW:グラフ構造のインデックス。コサイン距離なら vector_cosine_ops
CREATE INDEX ON doc_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
m: the max number of edges each node connects (default 16). The larger, the higher the recall but more memory.ef_construction: the search width at build time (default 64). The larger, the higher-quality the graph but the slower the build.- The accuracy/speed at search time is tuned with runtime parameters (no index rebuild needed):
SET hnsw.ef_search = 100; -- 探索幅。デフォルト40。大きいほど高再現率・低速
HNSW's biggest operational advantage is, as the official explicitly states, "unlike IVFFlat, it has no training step, so you can create the index even with zero rows." In other words, you can put the index on at table-creation time and pour data in later. This property is why it pairs well with production RAG where ingest continuously occurs.
IVFFlat
-- IVFFlat:転置ファイル。lists はクラスタ数。データ投入後に作成する
CREATE INDEX ON doc_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
lists: the number of clusters (lists). The official guideline is "rows / 1000up to 1 million rows,sqrt(rows)above 1 million rows."- The runtime parameter at search time:
SET ivfflat.probes = 10; -- 探索するリスト数。デフォルト1。大きいほど高再現率・低速
IVFFlat's decisive constraint: as the official says "one key to good recall is to create the index after some data is in the table," IVFFlat has a training step (clustering), so it needs to be created after data is loaded. Put it on an empty table and recall breaks.
Decision table: HNSW vs IVFFlat
The official expression is "HNSW has good query performance but is slow to build and uses a lot of memory," "IVFFlat is fast to build and uses little memory but has inferior query performance." Let me translate this into operational language.
| Viewpoint | HNSW | IVFFlat |
|---|---|---|
| Query performance (recall × speed) | High | Relatively low |
| Build cost | Slow, lots of memory | Fast, little memory |
| Creation on an empty table | Possible (no training step) | Not possible (create after loading data) |
| Resilience to continuous add/update | Strong (can put it on first) | Weak (situations arise needing rebuild/retraining) |
| Robustness to hard-to-see data scale | Strong | Premised on matching lists to the data volume |
| Main use case | First choice for production RAG | Bulk-loading large data and batch-updating |
The practical conclusion: when unsure, HNSW. The reason is ETC (Easy To Change). With HNSW, you can put the index on an empty table first and then begin ingest, so the operational flow becomes the straightforward order of "create → put in" and doesn't depend on estimating the data volume. IVFFlat creates a procedure dependency of "put some in then create / review lists when the volume changes," which tends to become operational debt in RAG with continuous updates. Choose IVFFlat only when build speed and memory are tight and you have a clear reason to bulk-load data and batch-update.
In the voice chatbot, because the premise was that FAQ and product information are updated from time to time — i.e., ingest occurs continuously — I chose HNSW.
Memory caution: HNSW is fast only when the index is on memory. Secure enough
maintenance_work_memat build time, and in operation be mindful of whether the index size fits inshared_buffers. Keeping the dimension down (Chapter 3) directly affects this cost too.
3. How to choose the embedding dimension: why 1024 is the "default"
How to decide the 1024 in vector(1024). This is the place to design the trade-off between cost and accuracy.
OpenAI's embedding models' output dimensions are officially as follows.
text-embedding-3-large: 3072 dimensionstext-embedding-3-small: 1536 dimensionstext-embedding-ada-002: 1536 dimensions
And text-embedding-3-* has a dimensions parameter, and the official says "even shortening the embedding (cutting the trailing numbers) doesn't lose the property of representing concepts." It even states "text-embedding-3-large shortened to 256 dimensions outperforms the non-shortened text-embedding-ada-002 (1536 dimensions)."
What this means is — holding the full 3072 dimensions as-is is over-spec in many cases. The dimension takes effect on all of storage, index memory, and search computation. Drop 3072 dimensions to 1024 and the vector's storage and index memory become roughly 1/3. This meshes perfectly with the pgvector consolidation strategy — "fit it inside Postgres."
Design decision: I make using
text-embedding-3-largeatdimensions=1024this article's standard. Leveraging large's expressiveness, at 1024 dimensions it fits comfortably within pgvector's indexable 2,000-dimension limit, and keeps cost down too. If accuracy is severe, raise it to 1536 or 3072; if cost is the top priority, lower it to small or 512 — in any case, always match the column-definition dimension and thedimensionsparameter. Get this misaligned and you get a dimension-mismatch error onINSERT(being protected by the type is an advantage).
In the voice chatbot, I operated text-embedding-3-large at 1024 dimensions. Keeping the accuracy needed for customer-service semantic search while fitting the storage inside Postgres without adding a dedicated vector DB — keeping the dimension down to 1024 was the judgment to make this "consolidated and still light" hold.
If you need ultra-high dimensions,
halfvec: the half-precisionhalfvec(n)supports up to 4,000 dimensions when indexed, and halves the storage. It's good to remember as an escape route for when you want to hold 1536 or 3072 as-is but keep memory down (the accuracy degradation needs evaluation).
4. Ingest: chunk design and "idempotent" upsert
The quality of search is 80% decided by ingest design. And what goes wrong most in production is the non-idempotency of ingest: "re-ingesting the same document put the embedding in twice" or "it failed midway and ended up in a half-baked state."
Principles of chunk design
- Cut by units of meaning: cut at the boundaries of headings, paragraphs, and bullet lists. Chopping at a fixed character count splits sentences and concepts, dropping search accuracy.
- Moderate overlap: slightly overlapping between chunks picks up context spanning a boundary. Overlapping too much causes duplication and cost increase.
- Always carry metadata:
document_id, tenant, category, update time. This becomes the foundation of Chapter 5's filtering and access control.
Idempotent upsert: prevent double insertion with a content hash
Guarantee "a chunk with the same content is 1 row no matter how many times you ingest it." The key is making the chunk body's content hash the unique key. Re-run, and if the content hasn't changed, it doesn't even hit the embedding API (= idempotent = cost saving).
-- (document_id, content_hash) を一意に。同内容の再投入は何度やっても1行
CREATE UNIQUE INDEX uq_doc_chunk
ON doc_chunks (document_id, content_hash);
"""ドキュメントを分割し、内容ハッシュで冪等にアップサートする。
再実行しても二重投入せず、変わっていないチャンクは埋め込みAPIを叩かない。"""
import hashlib
import os
import psycopg
from openai import OpenAI
client = OpenAI() # APIキーは環境変数 OPENAI_API_KEY から(ハードコード禁止)
EMBED_MODEL = "text-embedding-3-large"
EMBED_DIM = 1024 # ← vector(1024) と必ず一致させる
def content_hash(text: str) -> bytes:
"""正規化した本文の決定的ハッシュ。同じ内容 → 同じキー → 重複しない。"""
normalized = text.strip()
return hashlib.sha256(normalized.encode("utf-8")).digest()
def embed(texts: list[str]) -> list[list[float]]:
"""OpenAI 埋め込み。dimensions で 1024 に短縮(コストとストレージを抑える)。"""
resp = client.embeddings.create(
model=EMBED_MODEL,
input=texts,
dimensions=EMBED_DIM, # text-embedding-3-* のみ対応
)
return [d.embedding for d in resp.data]
def ingest(conn: psycopg.Connection, document_id: int, chunks: list[str]) -> None:
"""チャンク群を冪等にアップサート。既存(同ハッシュ)は埋め込みを再生成しない。"""
hashes = [content_hash(c) for c in chunks]
# 1) 既に入っているハッシュを先に引く(= 埋め込みAPIの無駄打ちを避ける)
with conn.cursor() as cur:
cur.execute(
"SELECT content_hash FROM doc_chunks "
"WHERE document_id = %s AND content_hash = ANY(%s)",
(document_id, hashes),
)
existing = {row[0] for row in cur.fetchall()}
# 2) 新規チャンクだけ埋め込み生成(コスト効率:差分だけ課金)
new_items = [
(c, h) for c, h in zip(chunks, hashes) if h not in existing
]
if not new_items:
return # 全部既存 → 何もしない(完全に冪等)
vectors = embed([c for c, _ in new_items])
# 3) ON CONFLICT で二重投入を構造的に排除(競合時も1行に収束)
with conn.cursor() as cur:
cur.executemany(
"""
INSERT INTO doc_chunks
(document_id, content, content_hash, embedding, embed_model)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (document_id, content_hash) DO NOTHING
""",
[
(document_id, c, h, v, EMBED_MODEL)
for (c, h), v in zip(new_items, vectors)
],
)
conn.commit() # 業務行の更新と同一トランザクションに束ねれば整合性も担保
Let me organize the design crux.
- Make the content hash the key: if the body is identical, the same hash → with the
UNIQUEconstraint andON CONFLICT DO NOTHING, it doesn't go in twice no matter how many times you run it. Guaranteeing idempotency not with app logic but with a DB constraint is solid (reliability). - Embed only the diff: pull the existing hashes first and hit
client.embeddings.createonly for the new portion. Since the embedding API is token-billed, this directly becomes cost reduction. - Can be bundled in the same transaction: this is the biggest fruit of consolidating into pgvector. If you put the update of the source document's business row (the
documentstable, etc.) and this chunk insertion in onecommit, the inconsistency of "the business row was updated but the embedding is stale" structurally doesn't occur. A dedicated vector DB can't get this guarantee because it's a separate system.
DRY/SRP: separating
content_hash(identity decision),embed(embedding generation), andingest(persistence) lets you swap onlyembedfor re-embedding (Chapter 6) too. Since chunk-splitting logic changes per project, putting it outsideingest(on the caller side) is ETC.
5. Hybrid search: vector × full-text search × metadata filter
Vector search alone doesn't reach production quality. Because it drops keywords you want an exact match for, like "part numbers," "proper nouns," and "dates," even when the meaning is close. The biggest strength of consolidating into pgvector is that here you can combine vector search and PostgreSQL's full-text search and WHERE filter in the same SQL. The part where a dedicated vector DB would interpose a separate system closes in 1 query.
Add a full-text-search column
-- 全文検索用の tsvector を生成列で持つ(本文から自動生成 → 同期ズレが起きない)
ALTER TABLE doc_chunks
ADD COLUMN content_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('simple', content)) STORED;
CREATE INDEX idx_doc_chunks_tsv ON doc_chunks USING gin (content_tsv);
-- メタデータ(テナント・カテゴリ等)でのフィルタ用
CREATE INDEX idx_doc_chunks_meta ON doc_chunks USING gin (metadata);
Japanese full-text search can have weak word segmentation with
to_tsvector('simple', ...). In production, consider introducingpg_bigm(bigram) or morphological analysis (textsearch_ja-family). Since this depends on the requirements and language, first run it withsimpleand measure the accuracy, then decide (YAGNI / measurement-first).
The hybrid search query: fuse the 2 systems with RRF
Because vector similarity and full-text search have different score scales, a simple weighted sum is unstable. RRF (Reciprocal Rank Fusion), which fuses by rank, is simple to implement and solid. It just "sums the reciprocals of the ranks in each list."
-- $1: クエリ埋め込み, $2: クエリ語(websearch構文), $3: テナントID
WITH vector_search AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM doc_chunks
WHERE metadata->>'tenant_id' = $3 -- メタデータフィルタ=アクセス制御の一部
ORDER BY embedding <=> $1
LIMIT 50
),
keyword_search AS (
SELECT id,
ROW_NUMBER() OVER (
ORDER BY ts_rank_cd(content_tsv, websearch_to_tsquery('simple', $2)) DESC
) AS rank
FROM doc_chunks
WHERE metadata->>'tenant_id' = $3
AND content_tsv @@ websearch_to_tsquery('simple', $2)
LIMIT 50
)
SELECT c.id, c.content,
COALESCE(1.0 / (60 + v.rank), 0.0)
+ COALESCE(1.0 / (60 + k.rank), 0.0) AS rrf_score -- RRF: 順位の逆数を融合
FROM doc_chunks c
LEFT JOIN vector_search v ON v.id = c.id
LEFT JOIN keyword_search k ON k.id = c.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 10;
The essence is that this query completes in one SQL.
- The vector side (
vector_search): the top 50 by closeness of meaning. - The keyword side (
keyword_search): the top 50 by full-text search. Picks up exact matches of part numbers and proper nouns. - Fuse with RRF: re-rank by the sum of
1 / (60 + rank).60is the standard smoothing constant, suppressing the prominence of a specific list. - Metadata filter = access control: by putting
metadata->>'tenant_id' = $3in theWHEREof both systems, other tenants' rows don't enter the candidates in the first place.
Close the metadata filter and access control on the "DB side"
With a dedicated vector DB, it tends to become "search across all tenants, then filter on the app side," which is a hotbed of information leaks. If you consolidate into Postgres, you can enforce "you can only see your own tenant's rows" at the DB level with row-level security (RLS).
-- RLS:アプリがどう間違えても、他テナントのチャンクは物理的に返らない
ALTER TABLE doc_chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON doc_chunks
USING (metadata->>'tenant_id' = current_setting('app.tenant_id', true));
This is safety unique to pgvector consolidation. You can place the access-control boundary on the DB's invariant rather than the app's code (the principle of least privilege). The official also says to use, for filtering, "a normal index on the filter column," "a partial index," "table partitioning," and "enabling iterative index scans" as appropriate. For multi-tenant, if you narrow by a few fixed values a partial index takes effect, and if by many values partitioning does.
Filters and ANN compatibility: since HNSW/IVFFlat are approximate indexes, recall can drop if a strong WHERE greatly trims the candidates. Enabling the iterative index scans the official lists makes it "automatically scan more of the index as needed." For queries with a low filter selectivity (= low hit rate), a normal index on the filter column can be faster — measure first, then decide.
6. Re-embedding operations: so it doesn't break even when you change the model
What always comes in production RAG is the day "I want to change the embedding model." A new version of text-embedding-3-large came out, you want to raise the dimension from 1024 to 1536, you're switching to another provider — . What's fatal here is mixing old and new embeddings that aren't in the same space. Distance computation becomes meaningless, and search silently breaks.
The countermeasure is simple: give the embedding a version. Chapter 1's embed_model column is that. Holding an embedding_version that bundles model + dimension + normalization policy is even more solid.
-- どのモデル・次元で作った埋め込みかを必ず記録する
ALTER TABLE doc_chunks
ADD COLUMN embedding_version text NOT NULL DEFAULT 'te3-large-1024-v1';
-- 検索は「現行バージョン」だけを対象にする(新旧混在を構造的に防ぐ)
-- WHERE embedding_version = 'te3-large-1024-v1' ... ORDER BY embedding <=> $1
The re-embedding strategy is safest with an online switch (blue/green).
- Embed additionally: add a column for the new version (e.g.,
embedding_v2 vector(1536)) and re-embed existing chunks in the background to fill it. Since the body andcontent_hashare immutable, Chapter 4's idempotent-ingest mechanism can be reused as-is for the re-embedding job (just swap theembedfunction). - Run both systems in parallel: until re-embedding completes, search keeps using the old version. Being able to hold old and new in the same transaction boundary is the strength of consolidation — it doesn't become the tightrope of a dedicated DB's "migration to a separate system."
- The switch is one flag: once everything is re-embedded, just switch the search's
embedding_versionto the new value. If there's a problem, roll back immediately. - Drop the old column: after confirming stability,
DROPthe old column and old index.
Cost: re-embedding incurs token billing for all chunks. Manage it with a content hash so it's done with only the diff (only changed chunks), and run a full re-embedding on a model change with a nightly batch + rate control. Keeping the dimension down to 1024 itself lightens both the cost and storage of re-embedding (Chapter 3's judgment takes effect here too).
7. Observability and cost: what to measure, where to tighten
If you operate pgvector RAG in production, continuously visualize "isn't search slow," "isn't accuracy dropping," "is cost reasonable."
What to measure
- Search latency: confirm with
EXPLAIN (ANALYZE, BUFFERS)whether the index is effective (Index Scan using ... hnsw). Grasp the curve of recall vs speed when you raiseef_search/probes. - A proxy metric for recall: for some queries, compare the ANN index's results with the results of a full scan (an exact kNN with the index disabled), and monitor the Top-k match rate. If recall drops, review
ef_search/m. - Index size: monitor with
\di+orpg_relation_size. When it no longer fits inshared_buffers, it suddenly slows. A leading indicator against the increase of dimension and row count. - Embedding cost: at ingest time, log "the number of new embedding tokens." If idempotency (Chapter 4) is effective, this value should approach 0 on re-runs — if it doesn't become 0, it's a sign idempotency is broken.
Where to tighten cost
- Keep the dimension down: as in Chapter 3,
dimensions=1024. The most cost-effective lever, taking effect on all of storage, memory, and search computation. - Diff ingest: avoid re-embedding unchanged chunks with a content hash. Minimize the embedding API's token billing.
- Fit the index in memory: keep the dimension and row count down, and design so the HNSW index rides in
shared_buffers. Once swapping starts, both performance and cost worsen. - Don't add a dedicated DB: the very consolidation judgment (Chapter 0) makes the dedicated service's monthly fee and operational labor cost entirely unnecessary — this is the biggest.
Observability and PII: search query bodies can contain personal information. For observability, don't leave queries or chunk bodies in raw logs. Keep what you record to metadata like "query ID, the Top-k chunk IDs, distance, latency, token count," and not leaving PII in body logs is an absolute condition for internal-control projects.
8. Summary: the pgvector production-RAG cheat sheet
Finally, a quick reference for when you're unsure.
- Where to put it: if Postgres already exists and it's several million~tens of millions of rows, consolidate into pgvector. For hundreds of millions of vectors / a sub-millisecond SLA, consider a dedicated vector DB.
- Distance operator: for text embeddings,
<=>(cosine distance) +vector_cosine_ops. Always align the operator, operator class, and the model's distance. - Index: when unsure, HNSW (can be put on an empty table first = strong at continuous ingest). If build cost takes effect with bulk-load + batch update, IVFFlat.
- Dimension:
text-embedding-3-largeatdimensions=1024. Always match the column'svector(1024). - Ingest: idempotently with content hash +
UNIQUE+ON CONFLICT DO NOTHING. Embed only the diff to keep cost down. Bundle in the same transaction as the business row. - Search: fuse vector + full-text search (
tsvector) with RRF. Close access control on the DB side with metadata / RLS. - Re-embedding: version-manage with an
embedding_versioncolumn. Don't mix old and new. Switch with blue/green. - Operation: measure recall, latency, index size, and embedding token count. Don't leave PII in body logs.
RAG looks like "a one-line requirement," but it's the work of designing the trade-offs of placement, distance, index, idempotency, consistency, and cost. And in production, the first move of whether to add a dedicated vector DB greatly affects the subsequent operational burden and cost.
In the generative-AI voice chatbot, I consolidated business data and embeddings into PostgreSQL + pgvector without adding a dedicated vector DB and saw a production RAG that runs semantic search with text-embedding-3-large (1024 dimensions) through to design, implementation, and operation, with one person × generative AI (Claude Code). I make idempotent ingest, hybrid search, and transactional consistency hold on top of the operational simplicity of "keeping the data store to one."
"Where and how to put RAG" — should you add a dedicated vector DB, or lean on Postgres? From that judgment through implementation and operation, I can accompany you end-to-end. Feel free to consult us, even from the requirements-organizing stage.
Reference (Official Documentation)
- pgvector (GitHub, README) —
CREATE EXTENSION vector, thevector/halfvec/bit/sparsevectypes, the distance operators (<->/<#>/<=>/<+>/<~>/<%>), the HNSW / IVFFlat creation syntax and trade-off,ef_search/probes, thelistsguideline, filter strategies - OpenAI Embeddings guide —
text-embedding-3-large(3072 dimensions) /text-embedding-3-small(1536 dimensions), dimension shortening via thedimensionsparameter,client.embeddings.create