Why production RAG fails: the design that raises accuracy to practical quality, and what buyers should demand

Let me state the conclusion first. RAG (retrieval-augmented generation) that works in a demo but "answers wrong, runs slow, leaks information" in production isn't because the LLM isn't smart. It's because the "quality of retrieval" is low. RAG's accuracy is decided almost entirely not by the AI model's smarts but by the quality of the "basis (context)" you pass to the AI. Pass a wrong basis, and even the smartest model gets it wrong. Naive RAG that just "puts documents in a vector DB and passes the similarity-search results" falls into insufficient accuracy in production with high probability.

This article, based on my experience running a voice-concierge AI that "structurally eliminated wrong answers about specialized products" with RAG in production, organizes (1) the typical pitfalls where naive RAG fails, (2) the design that raises it to practical quality (with a hybrid-search implementation), and (3) what buyers should demand. For the judgment of whether to adopt RAG itself, see the RAG vs fine-tuning article; for a deeper implementation, see production RAG with pgvector.

1. Why "naive RAG" fails in production

A RAG prototype is astonishingly easy to build. Put documents in a vector DB, vectorize the question, do a similarity search, and pass the hit documents to the LLM — this works "plausibly." The problem is that this naive configuration's accuracy collapses badly when exposed to the diverse questions of production.

Failures happen in fixed places.

Pitfall	What happens
Insufficient search accuracy	Vector similarity alone can't pick up proper nouns, model numbers, or abbreviations, so the basis misses
Chunking failure	If document splitting is coarse/fine, the basis is fragmented/diluted
Absence of reranking	The top search hits aren't necessarily in "most useful for the answer" order
Absence of evaluation	Can't measure accuracy, so there's no way to improve
Missing access control	In multi-tenant, another company's documents mix into the results (information leak)
Poor freshness management	An old document becomes the basis, contradicting the latest info

People tend to think "switch the LLM to a smarter model and it'll be fixed," but if the basis is wrong, changing the model won't fix it. What should be fixed is the search layer.

2. Pitfall ①: vector search alone isn't enough → hybrid search

The most common failure is relying on "vector similarity search alone." Vector search is good at finding "semantically near" documents, but weak at exact matching of proper nouns, model numbers, technical terms, and abbreviations. Asked "the specs of model number XR-200," vector search may pick up "another model number with a similar vibe."

The production standard answer is hybrid search — a design that combines vector search (meaning) and keyword/full-text search (exact match) and integrates both results.

Here's an implementation example with PostgreSQL (pgvector). Run vector search and full-text search separately and integrate the rankings with Reciprocal Rank Fusion (RRF). The security point is to build access control (tenant separation) into the search query itself.

interface RetrievedChunk {
  readonly id: string;
  readonly content: string;
  readonly score: number;
}

interface HybridSearchParams {
  readonly tenantId: string;          // アクセス制御：必ず検索条件に含める
  readonly queryEmbedding: number[];  // 質問のベクトル
  readonly queryText: string;         // 質問の原文（全文検索用）
  readonly limit: number;
}

const RRF_K = 60; // RRFの平滑化定数（順位の影響を緩やかにする慣行値）

/**
 * ベクトル検索と全文検索を統合するハイブリッド検索。
 * - tenant_id を WHERE に必ず含め、他テナントの文書を構造的に除外（情報漏洩を防ぐ）
 * - パラメータ化クエリのみ（SQLインジェクション対策）
 * - 2系統の順位を Reciprocal Rank Fusion で統合し、意味一致と完全一致の両取り
 */
export async function hybridSearch(
  db: Database,
  params: HybridSearchParams,
): Promise<readonly RetrievedChunk[]> {
  const { tenantId, queryEmbedding, queryText, limit } = params;

  const rows = await db.query<RetrievedChunk & { rrf: number }>(
    `
    WITH vector_hits AS (
      SELECT id, content,
             ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
      FROM chunks
      WHERE tenant_id = $1                       -- アクセス制御は検索層で強制
      ORDER BY embedding <=> $2                  -- コサイン距離で近い順
      LIMIT 50
    ),
    keyword_hits AS (
      SELECT id, content,
             ROW_NUMBER() OVER (
               ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('simple', $3)) DESC
             ) AS rank
      FROM chunks
      WHERE tenant_id = $1
        AND content_tsv @@ plainto_tsquery('simple', $3)  -- 全文検索（完全一致に強い）
      LIMIT 50
    )
    SELECT COALESCE(v.id, k.id) AS id,
           COALESCE(v.content, k.content) AS content,
           -- Reciprocal Rank Fusion: 1/(k+rank) を両系統で合算
           COALESCE(1.0 / ($4 + v.rank), 0) + COALESCE(1.0 / ($4 + k.rank), 0) AS rrf
    FROM vector_hits v
    FULL OUTER JOIN keyword_hits k ON v.id = k.id
    ORDER BY rrf DESC
    LIMIT $5
    `,
    [tenantId, toVector(queryEmbedding), queryText, RRF_K, limit],
  );

  return rows.map(({ id, content, rrf }) => ({ id, content, score: rrf }));
}

In my voice-concierge AI too, to correctly pick up specialized products' model numbers and proper nouns, I designed it to integrate semantic search and exact-match search. Being able to "structurally eliminate wrong answers" wasn't because I made the model smarter, but because I ensured the quality of the basis passed, at the search layer.

3. Pitfall ②: chunking and reranking

Chunking (splitting documents)

In RAG, you split long documents into "chunks" and index them. This splitting design greatly affects accuracy.

Too coarse (one chunk is long) → the relevant spot is diluted by irrelevant text, blurring the basis.
Too fine (one chunk is short) → context is lost, and fragments alone make no sense.

In practice, the norm is to chunk by semantic units (headings, paragraphs, clauses) and slightly overlap with adjacent chunks. The optimal split changes by document type (manual, regulation, FAQ, chat history), so this requires tuning.

Reranking

The top search results aren't necessarily in "most useful for the answer" order. So re-measure the candidates retrieved by search (e.g. the top 50) with a more precise model for "relevance to the question" and reorder them (reranking). This raises the quality of the final basis passed to the LLM. It's a cost-vs-accuracy trade-off, but a step with large effect in accuracy-critical work.

4. Pitfall ③: access control, a "quiet information leak"

In multi-tenant RAG (multiple customers/departments using the same platform), the most dangerous is missing access control. If you put all companies'/all customers' documents in one vector DB and don't filter by tenant at search time, an information leak where a B-company confidential document hits A company's question and is passed as the basis can happen quietly.

This can't be prevented by "being careful later." As in the previous chapter's code example, you need to build the tenant ID into the search query's condition itself and structurally exclude other tenants' documents. Furthermore, narrowing viewable documents by the user's viewing permission (row-level access control) is also essential in a business system.

Implication for buyers: when commissioning RAG, always ask "what guarantees that other customers'/departments' documents don't mix into the search results?" A counterpart who can't answer "we build the tenant ID into the search condition and separate structurally" carries an information-leak risk. As with idempotency in payments, correctness should be ensured by "structure," not "operational care."

5. Pitfall ④: you can't improve without evaluation

Finally, the most overlooked and most important pitfall — having no evaluation (accuracy-measurement) mechanism.

"It kind of feels better" can't improve RAG. Production-quality RAG needs a mechanism that:

prepares a set of representative questions and "correct basis/answers" (an evaluation dataset),
measures whether the search retrieves the correct basis (Retrieval accuracy) and whether the answer is faithful to the basis (no hallucination),
automatically on every change.

Only with this can you judge in numbers "changing the chunking raised/lowered accuracy" and run improvement. In a generative-AI review-support tool for a broadcaster, I attached grounding-derived citations to the answers to make "which document was the basis" traceable. Traceable basis pays off for both evaluation and root-cause investigation.

FAQ

Q. RAG isn't answering well. Will switching the LLM to a smarter model fix it?

In most cases, no. Since RAG's accuracy is decided almost entirely by "the quality of the basis passed to the AI," if the basis misses, even the smartest model gets it wrong. What should be fixed is the search layer. Hybrid search that combines keyword search rather than relying on vector search alone, revisiting chunking, and introducing reranking — raising the quality of the basis with these comes first.

Q. Is RAG complete once I put documents in a vector DB?

That's the prototype stage. Production quality needs hybrid search (meaning + exact match), proper chunking, reranking, access control, and an evaluation mechanism. In particular, proper nouns, model numbers, and technical terms can't be picked up by vector search alone, so combining with full-text search is practically almost essential.

Q. In multi-tenant RAG, can other companies' information leak?

It can if the design is lax. If you put all tenants' documents in one DB and don't filter by tenant at search time, other companies' confidential documents mix into the results. To prevent this, a design that builds the tenant ID into the search query's condition and structurally excludes other tenants is essential. Always confirm this point when commissioning.

Q. How do you measure RAG's accuracy?

Prepare a set of representative questions and "correct basis/answers" (an evaluation dataset), and automatically measure, on every change, whether the search retrieves the correct basis (Retrieval accuracy) and whether the answer is faithful to the basis (presence of hallucination). Without this evaluation foundation, it ends at "it feels better" and improvement doesn't run. When commissioning, always demand "how do you measure accuracy."

Q. I want to outsource RAG introduction. What should I confirm?

Ask "do you use hybrid search," "how do you design chunking and reranking," "how do you guarantee multi-tenant information separation," and "how do you measure and improve accuracy." A counterpart who can answer these with structure is trustworthy. A proposal of "just put it in a vector DB and search" has a high risk of falling into insufficient accuracy in production.

Summary: RAG's accuracy is decided by the "quality of retrieval"

To run RAG in production at practical quality, here's what to grasp.

The cause of failure isn't the LLM's smarts but the "quality of retrieval" — the quality of the basis passed decides accuracy.
Vector search alone isn't enough — hybrid search combining keyword search + reranking is the standard.
The design of chunking affects accuracy — split by semantic units and overlap appropriately.
Multi-tenant is access control at the search layer — build the tenant ID into the search condition and separate structurally.
You can't improve without an evaluation mechanism — prepare an accuracy-measurement foundation from the start.

"I put in RAG but accuracy isn't there" / "I want to build an internal-document AI" — that success or failure is decided by the search-layer design, access control, and evaluation foundation. At the same level as the track record of structurally eliminating wrong answers about specialized products, I take it on from requirements definition through accuracy design, security, and operations.

Why production RAG fails: the design that raises accuracy to practical quality, and what buyers should demand

1. Why "naive RAG" fails in production

2. Pitfall ①: vector search alone isn't enough → hybrid search

3. Pitfall ②: chunking and reranking

Chunking (splitting documents)

Reranking

4. Pitfall ③: access control, a "quiet information leak"

5. Pitfall ④: you can't improve without evaluation

FAQ

Q. RAG isn't answering well. Will switching the LLM to a smarter model fix it?

Q. Is RAG complete once I put documents in a vector DB?

Q. In multi-tenant RAG, can other companies' information leak?

Q. How do you measure RAG's accuracy?

Q. I want to outsource RAG introduction. What should I confirm?

Summary: RAG's accuracy is decided by the "quality of retrieval"

The cost and break-even of generative AI: a decision guide for API usage vs self-hosting

RAG vs fine-tuning: the cost-effectiveness of which to invest in, and the decision

Self-hosting speech synthesis (TTS) vs ElevenLabs: choose by cost, data sovereignty, and lock-in

Also worth reading

Taking AI-generated code (vibe coding) to production: why the demo works but production breaks, and how to recover quality

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

Breaking out of 'stuck at PoC' when adopting generative AI for your business: the walls to production, and a guide to commissioning in-housing support

1. Why "naive RAG" fails in production

2. Pitfall ①: vector search alone isn't enough → hybrid search

3. Pitfall ②: chunking and reranking

Chunking (splitting documents)

Reranking

4. Pitfall ③: access control, a "quiet information leak"

5. Pitfall ④: you can't improve without evaluation

FAQ

Q. RAG isn't answering well. Will switching the LLM to a smarter model fix it?

Q. Is RAG complete once I put documents in a vector DB?

Q. In multi-tenant RAG, can other companies' information leak?

Q. How do you measure RAG's accuracy?

Q. I want to outsource RAG introduction. What should I confirm?

Summary: RAG's accuracy is decided by the "quality of retrieval"

Related articles

The cost and break-even of generative AI: a decision guide for API usage vs self-hosting

RAG vs fine-tuning: the cost-effectiveness of which to invest in, and the decision

Self-hosting speech synthesis (TTS) vs ElevenLabs: choose by cost, data sovereignty, and lock-in

Also worth reading

Taking AI-generated code (vibe coding) to production: why the demo works but production breaks, and how to recover quality

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

Breaking out of 'stuck at PoC' when adopting generative AI for your business: the walls to production, and a guide to commissioning in-housing support