Skip to main content
友田 陽大
Generative AI, LLMs & RAG
AI
RAG
LangChain
Pinecone
OpenAI
Python
FastAPI
機械学習
NLP

Building a Production RAG System with LangChain + Pinecone: Hallucination Countermeasures and Accuracy Improvement in Practice

A guide to building a RAG system at production-operation level, not a verification environment. The 5 hallucination countermeasures, accuracy-evaluation methods, and cost-optimization strategies implemented with LangChain + Pinecone + FastAPI, explained with real code.

Published
Reading time
13 min read
Author
友田 陽大
Share

"I want to put a RAG system into production, but I can't solve the hallucination problem." "The LangChain tutorial worked, but the accuracy is too low to be practical." "The cost estimate is unclear, and I can't read the budget when scaling."

These are the challenges that development leads trying to integrate AI features into an existing SaaS, and engineers wanting to take a RAG system from PoC to production, face. I hit the same wall.

In this article, I publish the implementation of a production-level RAG system built with LangChain + Pinecone + FastAPI. Not a mere tutorial — I share the realities of production operation, all the way to the 5 hallucination countermeasures, quantitative methods of accuracy evaluation, and a cost estimate at 10,000 queries/month.


Premise: What Is a RAG System

The Basics of RAG (Retrieval-Augmented Generation)

RAG is a technique that searches and injects external knowledge into a large language model (LLM) to raise answer accuracy.

The limits of a conventional LLM (ChatGPT, etc.):

  • Doesn't know information after the training-data date
  • Can't access a company's internal documents
  • Hallucinations (answers contrary to fact) occur

The solution via RAG:

  1. Search a vector DB for documents relevant to the user's question (Retrieval)
  2. Inject the search results into the prompt and send to the LLM (Augmentation)
  3. The LLM generates an accurate answer based on the search results (Generation)

The Overall System Architecture

┌─────────────┐
│ User        │
└──────┬──────┘
       │ HTTP Request
       ▼
┌──────────────────────────┐
│ FastAPI (API gateway)     │
│ - Input validation        │
│ - Authentication/authz    │
│ - Rate limiting           │
└──────┬───────────────────┘
       │
       ▼
┌──────────────────────────┐
│ LangChain                 │
│ - Query embedding         │
│ - Pinecone search         │
│ - Prompt construction     │
│ - OpenAI API call         │
└──────┬───────────────────┘
       │
       ├─────────────┐
       ▼             ▼
┌────────────┐  ┌──────────────┐
│ Pinecone   │  │ OpenAI API   │
│ Vector DB  │  │ (GPT-4)      │
│ - Doc search│ │ - Answer gen │
└────────────┘  └──────────────┘

The Tech Stack

LayerTechnologyRole
API layerFastAPI 0.109+Provide REST API, validation
OrchestrationLangChain 0.1+RAG-flow control, prompt management
Vector DBPineconeEmbedding-vector search
LLMOpenAI GPT-4 TurboAnswer generation
Embedding modeltext-embedding-3-smallVectorization (1536 dimensions)
DeployAWS ECS FargateContainer execution environment

The 5 Hallucination Countermeasures

Countermeasure ①: Forcing Source Citation

Problem: the LLM ignores the search results and generates an answer from its training data.

Solution: state explicitly in the prompt, "answer only from the search results."

from langchain.prompts import PromptTemplate

# ハルシネーション防止プロンプト
PROMPT_TEMPLATE = """あなたは正確な情報提供を重視するAIアシスタントです。

以下の検索結果のみを使って、ユーザーの質問に回答してください。
検索結果に含まれない情報は「情報が見つかりませんでした」と回答してください。
推測や憶測は絶対に含めないでください。

検索結果:
{context}

ユーザーの質問: {question}

回答の形式:
1. 回答内容(検索結果から引用)
2. 引用元(ソースのファイル名とページ番号)

回答:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)

Effect: hallucination rate 40% → 10% (measured on in-house test data)


Countermeasure ②: Attaching a Confidence Score

Problem: even when the LLM answers without confidence, the user can't tell.

Solution: attach a confidence score (0.0–1.0) to the answer.

from langchain.llms import OpenAI
from langchain.chains import LLMChain
from pydantic import BaseModel, Field

class AnswerWithConfidence(BaseModel):
    """回答と信頼度スコア"""
    answer: str = Field(description="ユーザーへの回答")
    confidence: float = Field(
        description="信頼度スコア(0.0〜1.0)。検索結果に明確な回答がある場合1.0、推測の場合0.5以下",
        ge=0.0,
        le=1.0
    )
    sources: list[str] = Field(description="引用元のドキュメントID")

# OpenAI Function Callingで構造化出力
from langchain.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(pydantic_object=AnswerWithConfidence)

# プロンプトに指示を追加
format_instructions = parser.get_format_instructions()

Use examples:

  • confidence < 0.7: show a warning "this answer may contain speculation"
  • confidence < 0.5: escalate to a human operator

Countermeasure ③: Improving Retrieval Accuracy (Chunk-Size Optimization)

Problem: the document's splitting method is inappropriate, and context is lost.

Solution: optimize chunk size and overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# 最適なチャンク設定(実験で決定)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # トークン数(経験的に512が最適)
    chunk_overlap=50,      # オーバーラップ(文脈保持)
    length_function=len,
    separators=["\n\n", "\n", "。", ".", " ", ""]  # 日本語対応
)

# 文書の分割
chunks = text_splitter.split_text(document_text)

# Pineconeへのアップロード
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Pinecone.from_texts(
    texts=chunks,
    embedding=embeddings,
    index_name="my-knowledge-base",
    namespace="product-docs"
)

Experimental results of parameter selection:

Chunk sizeRecall@5Response time
2560.651.2s
5120.821.5s
10240.782.1s
20480.713.5s

Conclusion: chunk_size=512 is the optimal balance of accuracy and speed


Countermeasure ④: Metadata Filtering

Problem: irrelevant documents mix into the search results (e.g., an old version of a document).

Solution: narrow down with Pinecone's metadata filter.

from datetime import datetime

# 文書のメタデータ付きアップロード
metadata = {
    "source": "product_manual_v2.pdf",
    "page": 42,
    "version": "2.0",
    "category": "API仕様",
    "last_updated": "2024-12-01"
}

vectorstore.add_texts(
    texts=[chunk_text],
    metadatas=[metadata]
)

# 検索時のフィルタリング
from langchain.vectorstores import Pinecone

vectorstore = Pinecone.from_existing_index(
    index_name="my-knowledge-base",
    embedding=embeddings,
    namespace="product-docs"
)

# 最新バージョンのみ検索
results = vectorstore.similarity_search(
    query="APIの認証方法は?",
    k=5,
    filter={
        "version": {"$eq": "2.0"},
        "category": {"$eq": "API仕様"}
    }
)

Effect: contamination rate of irrelevant results 30% → 5%


Countermeasure ⑤: The Human-Review Feedback Loop

Problem: even if you detect a hallucination, there's no automatic improvement.

Solution: reflect human feedback into the system.

from pydantic import BaseModel
from datetime import datetime

class UserFeedback(BaseModel):
    """ユーザーフィードバック"""
    query_id: str
    is_helpful: bool
    is_accurate: bool
    comment: str | None = None
    timestamp: datetime

# フィードバック収集API
from fastapi import FastAPI, HTTPException

app = FastAPI()

@app.post("/api/feedback")
async def submit_feedback(feedback: UserFeedback):
    # フィードバックをDBに保存
    await save_feedback_to_db(feedback)

    # 精度が低い回答を検出
    if not feedback.is_accurate:
        # アラート送信(Slack等)
        await send_alert_to_team(feedback)

        # 該当する文書の再確認フラグ
        await flag_document_for_review(feedback.query_id)

    return {"status": "success"}

# 週次での精度分析
async def analyze_feedback_trends():
    feedbacks = await get_feedbacks_last_week()

    accuracy_rate = sum(f.is_accurate for f in feedbacks) / len(feedbacks)

    if accuracy_rate < 0.85:
        # 精度低下アラート
        await send_weekly_report(accuracy_rate)

Operational flow:

  1. Place 👍/👎 buttons on answers
  2. A human reviews inaccurate answers
  3. Improve the relevant document or adjust the chunk splitting
  4. Monitor accuracy trends with a weekly report

Pinecone Index Design

The Namespace-Separation Strategy

In Pinecone, you can use namespaces within a single index to logically separate data.

import pinecone

# Pinecone初期化
pinecone.init(
    api_key="your-api-key",
    environment="us-west1-gcp"
)

# インデックス作成(初回のみ)
pinecone.create_index(
    name="my-knowledge-base",
    dimension=1536,  # text-embedding-3-smallの次元数
    metric="cosine"   # コサイン類似度
)

# ネームスペース別のデータ投入
namespaces = {
    "product-docs": "製品ドキュメント",
    "api-reference": "API仕様書",
    "faq": "よくある質問",
    "internal-kb": "社内ナレッジベース(社員のみ)"
}

# 例:製品ドキュメントの投入
vectorstore = Pinecone.from_texts(
    texts=product_doc_chunks,
    embedding=embeddings,
    index_name="my-knowledge-base",
    namespace="product-docs"
)

Multi-Tenant Support

When separating data per customer in a B2B SaaS:

# 顧客ごとのネームスペース
customer_namespace = f"customer-{customer_id}"

vectorstore = Pinecone.from_existing_index(
    index_name="my-knowledge-base",
    embedding=embeddings,
    namespace=customer_namespace
)

# 検索時も同じネームスペースを指定
results = vectorstore.similarity_search(
    query=user_query,
    k=5,
    namespace=customer_namespace  # データ漏洩防止
)

Quantitative Methods of Accuracy Evaluation

Evaluation Metrics

MetricDefinitionTarget value
Recall@5The proportion where a correct answer is in the top 5> 0.85
Precision@5The proportion of relevant documents among the top 5> 0.70
MRR (Mean Reciprocal Rank)The mean of the reciprocal of the correct answer's rank> 0.75
AccuracyThe answer's correctness (human evaluation)> 0.90

The Evaluation Script

from typing import List, Tuple
import numpy as np

class RAGEvaluator:
    """RAGシステムの精度評価"""

    def __init__(self, test_data: List[Tuple[str, List[str]]]):
        """
        Args:
            test_data: [(質問, 正解ドキュメントIDのリスト), ...]
        """
        self.test_data = test_data

    def calculate_recall_at_k(
        self,
        retrieved_docs: List[str],
        ground_truth: List[str],
        k: int = 5
    ) -> float:
        """Recall@K を計算"""
        top_k = set(retrieved_docs[:k])
        relevant = set(ground_truth)

        if not relevant:
            return 0.0

        return len(top_k & relevant) / len(relevant)

    def calculate_precision_at_k(
        self,
        retrieved_docs: List[str],
        ground_truth: List[str],
        k: int = 5
    ) -> float:
        """Precision@K を計算"""
        top_k = set(retrieved_docs[:k])
        relevant = set(ground_truth)

        if not top_k:
            return 0.0

        return len(top_k & relevant) / k

    def calculate_mrr(
        self,
        retrieved_docs: List[str],
        ground_truth: List[str]
    ) -> float:
        """Mean Reciprocal Rank を計算"""
        for rank, doc_id in enumerate(retrieved_docs, 1):
            if doc_id in ground_truth:
                return 1.0 / rank
        return 0.0

    def evaluate(self, rag_system) -> dict:
        """RAGシステム全体を評価"""
        recalls = []
        precisions = []
        mrrs = []

        for question, ground_truth in self.test_data:
            # RAGシステムで検索
            retrieved = rag_system.retrieve(question, k=10)
            retrieved_ids = [doc.metadata["id"] for doc in retrieved]

            # 指標計算
            recalls.append(self.calculate_recall_at_k(retrieved_ids, ground_truth, k=5))
            precisions.append(self.calculate_precision_at_k(retrieved_ids, ground_truth, k=5))
            mrrs.append(self.calculate_mrr(retrieved_ids, ground_truth))

        return {
            "recall@5": np.mean(recalls),
            "precision@5": np.mean(precisions),
            "mrr": np.mean(mrrs)
        }

# 使用例
test_data = [
    ("APIキーの取得方法は?", ["doc_123", "doc_456"]),
    ("料金プランの違いは?", ["doc_789"]),
    # ... 100件以上のテストケース
]

evaluator = RAGEvaluator(test_data)
metrics = evaluator.evaluate(my_rag_system)

print(f"Recall@5: {metrics['recall@5']:.2f}")
print(f"Precision@5: {metrics['precision@5']:.2f}")
print(f"MRR: {metrics['mrr']:.2f}")

Evaluation Results in a Real Project

Before improvement:

  • Recall@5: 0.68
  • Precision@5: 0.52
  • MRR: 0.61

After improvement (countermeasures ①–④ applied):

  • Recall@5: 0.87 (+28%)
  • Precision@5: 0.78 (+50%)
  • MRR: 0.81 (+33%)

Cost Estimate: A Real Example at 10,000 Queries/Month

OpenAI API Cost

[Assumptions]
- Queries: 10,000/month
- Average prompt length: 2,000 tokens (5 search results + question)
- Average response length: 500 tokens
- Model: GPT-4 Turbo

[Calculation]
Input:  10,000 queries × 2,000 tokens × $0.01/1K = $200
Output: 10,000 queries × 500 tokens × $0.03/1K = $150
Total: $350/month

Pinecone Cost

[Assumptions]
- Number of indexes: 1
- Number of vectors: 100,000
- Plan: Starter ($70/month for 100K vectors)

[Calculation]
Base fee: $70/month

Embedding-Generation Cost (OpenAI)

[Assumptions]
- New documents added: 1,000/month
- Average document length: 1,000 tokens
- Model: text-embedding-3-small

[Calculation]
1,000 docs × 1,000 tokens × $0.00002/1K = $0.02/month
(virtually negligible)

Infrastructure Cost (AWS ECS Fargate)

[Assumptions]
- Fargate Task: 0.5 vCPU, 1GB RAM
- Uptime: 24 hours × 30 days

[Calculation]
Fargate: $35/month
ALB: $23/month
CloudWatch Logs: $3/month
Total: $61/month

Total Cost

OpenAI API (GPT-4 Turbo): $350/month
Pinecone: $70/month
OpenAI Embeddings: $0.02/month (negligible)
AWS infrastructure: $61/month
────────────────────────
Total: about $481/month (about ¥68,000)

Cost-Reduction Strategies

Strategy ①: Partial Downgrade to GPT-3.5 Turbo

# 簡単な質問はGPT-3.5、複雑な質問はGPT-4
def select_model(query: str, context: str) -> str:
    # トークン数で判定
    total_tokens = len(query.split()) + len(context.split())

    if total_tokens < 500:
        return "gpt-3.5-turbo"  # $0.001/1K (入力)
    else:
        return "gpt-4-turbo"

Effect: route 50% of queries to GPT-3.5 → 30% cost reduction ($350 → $245)

Strategy ②: Caching

import hashlib
from functools import lru_cache

class RAGCache:
    """よくある質問をキャッシュ"""

    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1時間

    def get_cache_key(self, query: str) -> str:
        return f"rag_cache:{hashlib.md5(query.encode()).hexdigest()}"

    async def get(self, query: str) -> str | None:
        key = self.get_cache_key(query)
        return await self.redis.get(key)

    async def set(self, query: str, answer: str):
        key = self.get_cache_key(query)
        await self.redis.setex(key, self.ttl, answer)

# 使用例
cache = RAGCache(redis_client)

async def rag_with_cache(query: str) -> str:
    # キャッシュ確認
    cached = await cache.get(query)
    if cached:
        return cached

    # RAG実行
    answer = await rag_system.query(query)

    # キャッシュ保存
    await cache.set(query, answer)

    return answer

Effect: cache hit rate 30% → 30% reduction in API calls ($350 → $245)


Production-Operation Pitfalls and Error Handling

Pitfall ①: OpenAI API Rate Limits

Problem: a sudden traffic surge hits the API-call limit.

Solution: Exponential Backoff + Retry

import asyncio
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type
)
from openai import RateLimitError

@retry(
    retry=retry_if_exception_type(RateLimitError),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5)
)
async def call_openai_with_retry(prompt: str) -> str:
    """OpenAI API呼び出し(リトライ付き)"""
    response = await openai_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    return response.choices[0].message.content

Pitfall ②: Pinecone Search Timeout

Problem: a timeout (>10s) on a large vector search.

Solution: a timeout setting + fallback

import asyncio

async def search_with_timeout(
    vectorstore,
    query: str,
    k: int = 5,
    timeout: float = 5.0
) -> List[Document]:
    """タイムアウト付き検索"""
    try:
        result = await asyncio.wait_for(
            vectorstore.asimilarity_search(query, k=k),
            timeout=timeout
        )
        return result
    except asyncio.TimeoutError:
        # フォールバック:キーワード検索
        return await fallback_keyword_search(query, k=k)

Pitfall ③: Memory Leaks

Problem: LangChain objects keep lingering in memory.

Solution: explicit cleanup

from contextlib import asynccontextmanager

@asynccontextmanager
async def rag_session():
    """RAGセッション管理(リソース自動解放)"""
    # 初期化
    vectorstore = Pinecone.from_existing_index(...)
    llm = ChatOpenAI(...)

    try:
        yield vectorstore, llm
    finally:
        # クリーンアップ
        del vectorstore
        del llm
        import gc
        gc.collect()

# 使用例
async def query_rag(user_query: str) -> str:
    async with rag_session() as (vectorstore, llm):
        # RAG処理
        results = await vectorstore.asimilarity_search(user_query)
        # ...
    # 自動的にリソース解放

FastAPI Integration: Implementing the Production API

from fastapi import FastAPI, HTTPException, Depends
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import AsyncGenerator
import asyncio

app = FastAPI(title="Production RAG API")

class QueryRequest(BaseModel):
    query: str = Field(..., min_length=1, max_length=500)
    namespace: str = Field(default="product-docs")
    max_results: int = Field(default=5, ge=1, le=10)

class QueryResponse(BaseModel):
    answer: str
    confidence: float
    sources: list[dict]
    processing_time_ms: float

# RAGシステムの初期化(起動時)
@app.on_event("startup")
async def startup_event():
    global rag_system
    rag_system = initialize_rag_system()

@app.post("/api/query", response_model=QueryResponse)
async def query_endpoint(request: QueryRequest):
    """RAGクエリエンドポイント"""
    import time
    start_time = time.time()

    try:
        # RAG実行
        result = await rag_system.query(
            query=request.query,
            namespace=request.namespace,
            k=request.max_results
        )

        processing_time = (time.time() - start_time) * 1000

        return QueryResponse(
            answer=result.answer,
            confidence=result.confidence,
            sources=[
                {
                    "id": src.id,
                    "title": src.metadata.get("title"),
                    "page": src.metadata.get("page"),
                    "score": src.score
                }
                for src in result.sources
            ],
            processing_time_ms=processing_time
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# ストリーミングレスポンス(リアルタイム回答)
@app.post("/api/query/stream")
async def query_stream_endpoint(request: QueryRequest):
    """ストリーミングRAGクエリ"""

    async def generate() -> AsyncGenerator[str, None]:
        async for chunk in rag_system.query_stream(request.query):
            yield f"data: {chunk}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

# ヘルスチェック
@app.get("/health")
async def health_check():
    return {"status": "healthy", "service": "rag-api"}

Summary: Success Factors of a Production RAG

Technical Points

  1. Hallucination countermeasures are defense-in-depth: prompt + score + human review
  2. Accuracy evaluation must be quantified: continuous monitoring with Recall/Precision/MRR
  3. Cost is optimized by operational design: caching + model selection + scaling

Operational Points

  1. The feedback loop is the lifeblood: reflect human evaluation weekly
  2. Error handling to an almost-excessive degree: retry + timeout + fallback
  3. Staged release: internal-only → beta → full release

Real-World Data (My Project)

  • Accuracy: Recall@5 0.87, Accuracy 0.92 (human evaluation)
  • Response time: average 1.8s (95th percentile 3.2s)
  • Cost: about ¥70,000/month at 10,000 queries/month (within budget)
  • Availability: 99.8% (monthly downtime < 1 hour)

Next Steps

If you're struggling with RAG-system implementation, feel free to reach out. I'll bring the practical know-how of LangChain/Pinecone/OpenAI to your project.

Contact me here


友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading