Let me state the conclusion first. RAG (retrieval-augmented generation) that works in a demo but "answers wrong, runs slow, leaks information" in production isn't because the LLM isn't smart. It's because the "quality of retrieval" is low. RAG's accuracy is decided almost entirely not by the AI model's smarts but by the quality of the "basis (context)" you pass to the AI. Pass a wrong basis, and even the smartest model gets it wrong. Naive RAG that just "puts documents in a vector DB and passes the similarity-search results" falls into insufficient accuracy in production with high probability.
This article, based on my experience running a voice-concierge AI that "structurally eliminated wrong answers about specialized products" with RAG in production, organizes (1) the typical pitfalls where naive RAG fails, (2) the design that raises it to practical quality (with a hybrid-search implementation), and (3) what buyers should demand. For the judgment of whether to adopt RAG itself, see the RAG vs fine-tuning article; for a deeper implementation, see production RAG with pgvector.
1. Why "naive RAG" fails in production
A RAG prototype is astonishingly easy to build. Put documents in a vector DB, vectorize the question, do a similarity search, and pass the hit documents to the LLM — this works "plausibly." The problem is that this naive configuration's accuracy collapses badly when exposed to the diverse questions of production.
Failures happen in fixed places.
| Pitfall | What happens |
|---|---|
| Insufficient search accuracy | Vector similarity alone can't pick up proper nouns, model numbers, or abbreviations, so the basis misses |
| Chunking failure | If document splitting is coarse/fine, the basis is fragmented/diluted |
| Absence of reranking | The top search hits aren't necessarily in "most useful for the answer" order |
| Absence of evaluation | Can't measure accuracy, so there's no way to improve |
| Missing access control | In multi-tenant, another company's documents mix into the results (information leak) |
| Poor freshness management | An old document becomes the basis, contradicting the latest info |
People tend to think "switch the LLM to a smarter model and it'll be fixed," but if the basis is wrong, changing the model won't fix it. What should be fixed is the search layer.
2. Pitfall ①: vector search alone isn't enough → hybrid search
The most common failure is relying on "vector similarity search alone." Vector search is good at finding "semantically near" documents, but weak at exact matching of proper nouns, model numbers, technical terms, and abbreviations. Asked "the specs of model number XR-200," vector search may pick up "another model number with a similar vibe."
The production standard answer is hybrid search — a design that combines vector search (meaning) and keyword/full-text search (exact match) and integrates both results.
Here's an implementation example with PostgreSQL (pgvector). Run vector search and full-text search separately and integrate the rankings with Reciprocal Rank Fusion (RRF). The security point is to build access control (tenant separation) into the search query itself.
interface RetrievedChunk {
readonly id: string;
readonly content: string;
readonly score: number;
}
interface HybridSearchParams {
readonly tenantId: string; // アクセス制御:必ず検索条件に含める
readonly queryEmbedding: number[]; // 質問のベクトル
readonly queryText: string; // 質問の原文(全文検索用)
readonly limit: number;
}
const RRF_K = 60; // RRFの平滑化定数(順位の影響を緩やかにする慣行値)
/**
* ベクトル検索と全文検索を統合するハイブリッド検索。
* - tenant_id を WHERE に必ず含め、他テナントの文書を構造的に除外(情報漏洩を防ぐ)
* - パラメータ化クエリのみ(SQLインジェクション対策)
* - 2系統の順位を Reciprocal Rank Fusion で統合し、意味一致と完全一致の両取り
*/
export async function hybridSearch(
db: Database,
params: HybridSearchParams,
): Promise<readonly RetrievedChunk[]> {
const { tenantId, queryEmbedding, queryText, limit } = params;
const rows = await db.query<RetrievedChunk & { rrf: number }>(
`
WITH vector_hits AS (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
FROM chunks
WHERE tenant_id = $1 -- アクセス制御は検索層で強制
ORDER BY embedding <=> $2 -- コサイン距離で近い順
LIMIT 50
),
keyword_hits AS (
SELECT id, content,
ROW_NUMBER() OVER (
ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('simple', $3)) DESC
) AS rank
FROM chunks
WHERE tenant_id = $1
AND content_tsv @@ plainto_tsquery('simple', $3) -- 全文検索(完全一致に強い)
LIMIT 50
)
SELECT COALESCE(v.id, k.id) AS id,
COALESCE(v.content, k.content) AS content,
-- Reciprocal Rank Fusion: 1/(k+rank) を両系統で合算
COALESCE(1.0 / ($4 + v.rank), 0) + COALESCE(1.0 / ($4 + k.rank), 0) AS rrf
FROM vector_hits v
FULL OUTER JOIN keyword_hits k ON v.id = k.id
ORDER BY rrf DESC
LIMIT $5
`,
[tenantId, toVector(queryEmbedding), queryText, RRF_K, limit],
);
return rows.map(({ id, content, rrf }) => ({ id, content, score: rrf }));
}
In my voice-concierge AI too, to correctly pick up specialized products' model numbers and proper nouns, I designed it to integrate semantic search and exact-match search. Being able to "structurally eliminate wrong answers" wasn't because I made the model smarter, but because I ensured the quality of the basis passed, at the search layer.
3. Pitfall ②: chunking and reranking
Chunking (splitting documents)
In RAG, you split long documents into "chunks" and index them. This splitting design greatly affects accuracy.
- Too coarse (one chunk is long) → the relevant spot is diluted by irrelevant text, blurring the basis.
- Too fine (one chunk is short) → context is lost, and fragments alone make no sense.
In practice, the norm is to chunk by semantic units (headings, paragraphs, clauses) and slightly overlap with adjacent chunks. The optimal split changes by document type (manual, regulation, FAQ, chat history), so this requires tuning.
Reranking
The top search results aren't necessarily in "most useful for the answer" order. So re-measure the candidates retrieved by search (e.g. the top 50) with a more precise model for "relevance to the question" and reorder them (reranking). This raises the quality of the final basis passed to the LLM. It's a cost-vs-accuracy trade-off, but a step with large effect in accuracy-critical work.
4. Pitfall ③: access control, a "quiet information leak"
In multi-tenant RAG (multiple customers/departments using the same platform), the most dangerous is missing access control. If you put all companies'/all customers' documents in one vector DB and don't filter by tenant at search time, an information leak where a B-company confidential document hits A company's question and is passed as the basis can happen quietly.
This can't be prevented by "being careful later." As in the previous chapter's code example, you need to build the tenant ID into the search query's condition itself and structurally exclude other tenants' documents. Furthermore, narrowing viewable documents by the user's viewing permission (row-level access control) is also essential in a business system.
Implication for buyers: when commissioning RAG, always ask "what guarantees that other customers'/departments' documents don't mix into the search results?" A counterpart who can't answer "we build the tenant ID into the search condition and separate structurally" carries an information-leak risk. As with idempotency in payments, correctness should be ensured by "structure," not "operational care."
5. Pitfall ④: you can't improve without evaluation
Finally, the most overlooked and most important pitfall — having no evaluation (accuracy-measurement) mechanism.
"It kind of feels better" can't improve RAG. Production-quality RAG needs a mechanism that:
- prepares a set of representative questions and "correct basis/answers" (an evaluation dataset),
- measures whether the search retrieves the correct basis (Retrieval accuracy) and whether the answer is faithful to the basis (no hallucination),
- automatically on every change.
Only with this can you judge in numbers "changing the chunking raised/lowered accuracy" and run improvement. In a generative-AI review-support tool for a broadcaster, I attached grounding-derived citations to the answers to make "which document was the basis" traceable. Traceable basis pays off for both evaluation and root-cause investigation.
FAQ
Q. RAG isn't answering well. Will switching the LLM to a smarter model fix it?
In most cases, no. Since RAG's accuracy is decided almost entirely by "the quality of the basis passed to the AI," if the basis misses, even the smartest model gets it wrong. What should be fixed is the search layer. Hybrid search that combines keyword search rather than relying on vector search alone, revisiting chunking, and introducing reranking — raising the quality of the basis with these comes first.
Q. Is RAG complete once I put documents in a vector DB?
That's the prototype stage. Production quality needs hybrid search (meaning + exact match), proper chunking, reranking, access control, and an evaluation mechanism. In particular, proper nouns, model numbers, and technical terms can't be picked up by vector search alone, so combining with full-text search is practically almost essential.
Q. In multi-tenant RAG, can other companies' information leak?
It can if the design is lax. If you put all tenants' documents in one DB and don't filter by tenant at search time, other companies' confidential documents mix into the results. To prevent this, a design that builds the tenant ID into the search query's condition and structurally excludes other tenants is essential. Always confirm this point when commissioning.
Q. How do you measure RAG's accuracy?
Prepare a set of representative questions and "correct basis/answers" (an evaluation dataset), and automatically measure, on every change, whether the search retrieves the correct basis (Retrieval accuracy) and whether the answer is faithful to the basis (presence of hallucination). Without this evaluation foundation, it ends at "it feels better" and improvement doesn't run. When commissioning, always demand "how do you measure accuracy."
Q. I want to outsource RAG introduction. What should I confirm?
Ask "do you use hybrid search," "how do you design chunking and reranking," "how do you guarantee multi-tenant information separation," and "how do you measure and improve accuracy." A counterpart who can answer these with structure is trustworthy. A proposal of "just put it in a vector DB and search" has a high risk of falling into insufficient accuracy in production.
Summary: RAG's accuracy is decided by the "quality of retrieval"
To run RAG in production at practical quality, here's what to grasp.
- The cause of failure isn't the LLM's smarts but the "quality of retrieval" — the quality of the basis passed decides accuracy.
- Vector search alone isn't enough — hybrid search combining keyword search + reranking is the standard.
- The design of chunking affects accuracy — split by semantic units and overlap appropriately.
- Multi-tenant is access control at the search layer — build the tenant ID into the search condition and separate structurally.
- You can't improve without an evaluation mechanism — prepare an accuracy-measurement foundation from the start.
"I put in RAG but accuracy isn't there" / "I want to build an internal-document AI" — that success or failure is decided by the search-layer design, access control, and evaluation foundation. At the same level as the track record of structurally eliminating wrong answers about specialized products, I take it on from requirements definition through accuracy design, security, and operations.