# Build an AI that answers from your own documents, locally: an intro to private RAG (your data never leaves)

> An intro, by an engineer who actually runs RAG in production, to building an AI you can ask questions of your local PDFs, notes, and internal materials — entirely locally, without sending any data outside (private RAG). It introduces how RAG works, a minimal implementation with Ollama's embedding API and cosine similarity, tips to raise accuracy, and the path to production, with type-safe code.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 生成AI, RAG, Ollama, ローカルLLM, セルフホスト, 型安全
- URL: https://tomodahinata.com/en/blog/private-rag-local-llm-chat-with-your-own-documents
- Category: Local LLMs: AI on your own PC
- Pillar guide: https://tomodahinata.com/en/blog/local-llm-getting-started-ollama-lm-studio-vram-model-selection-guide

## Key points

- RAG (retrieval-augmented generation) is a technique to 'search for documents relevant to the question and pass them as the basis to the LLM to answer.' If it completes locally, your data never leaves.
- The mechanism is three stages — embed documents (vectorize) → search for documents near the question → pass to the local LLM with the basis. It can all complete inside your own PC.
- The minimal implementation can be built with just Ollama's embedding API + cosine similarity. For a few documents, no extra DB is needed.
- The key to accuracy is the 'quality of retrieval.' Vector search, weak at proper nouns, is supplemented with keyword search and chunking ingenuity.
- When documents grow, move to a vector DB (pgvector, etc.); for business use, to access control and evaluation — there's a path to production.

---

Let me state the conclusion first. **You can build "an AI that answers when you ask" of your local PDFs, notes, and internal materials, without sending any data outside, entirely inside your own PC.** This mechanism is called **RAG (retrieval-augmented generation).** Instead of pasting materials into ChatGPT and asking, achieving it **completely privately by combining with a local LLM** — that's private RAG. This article explains, from the perspective of an engineer actually running RAG in production, how to understand RAG in the shortest path, from a minimal implementation that runs with just Ollama, to tips to raise accuracy and the path to production.

> This article is an applied edition of the [getting-started guide for local LLMs](/blog/local-llm-getting-started-ollama-lm-studio-vram-model-selection-guide). For the basics of Ollama, see that one.

---

## What is RAG? A technique to "hand it a cheat sheet and let it answer"

LLMs are smart, but **they don't know your local materials (internal regulations, meeting minutes, personal notes).** RAG solves this. The idea is simple — **search for and find documents relevant to the question, pass them to the LLM together as the "basis (cheat sheet)," and answer based on that basis.**

Unlike fine-tuning (additionally training the model), RAG **injects knowledge on the spot.** Even if the materials change, you just swap the data, with no need to re-train (this distinction is detailed in [RAG vs fine-tuning](/blog/rag-vs-fine-tuning-cost-effectiveness-decision-guide)).

RAG processing is three stages.

```text
1. 準備（一度だけ）: 手元の文書を「埋め込み（ベクトル）」に変換して保存する
   ↓
2. 検索: 質問を同じくベクトル化し、意味が近い文書を探す
   ↓
3. 生成: 見つけた文書を「根拠」として、質問と一緒にローカルLLMへ渡して答えさせる
```

**Run all three stages inside your own PC, and your data never leaves.** This is the greatest value of private RAG.

---

## Minimal implementation: Ollama's embedding API + cosine similarity

You might brace for "I'll need a vector DB," but **for a few documents, you can build it with just Ollama's embedding API and cosine similarity.** No extra database needed.

### Step 1: Embed documents (vectorize)

"Embedding" means converting text into **an array of numbers that represents its meaning (a vector).** Texts close in meaning have close vectors. Ollama provides embedding models, so you can convert locally.

```ts
/** Ollamaのローカル埋め込みAPIで、テキストをベクトルに変換する。通信はlocalhostのみ＝データは外に出ない。 */
async function embed(
  text: string,
  model = "nomic-embed-text",
  endpoint = "http://localhost:11434",
): Promise<readonly number[]> {
  const res = await fetch(`${endpoint}/api/embeddings`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ model, prompt: text }),
    signal: AbortSignal.timeout(30_000), // タイムアウトで止まらない（回復性）
  });
  if (!res.ok) throw new Error(`embed failed: ${res.status} ${res.statusText}`);
  const data = (await res.json()) as { embedding: readonly number[] };
  return data.embedding;
}
```

### Step 2: Search for documents near the question (cosine similarity)

The "nearness" between vectors is measured with **cosine similarity.** This is a side-effect-free pure function you can write without an external library.

```ts
/** 2つのベクトルの近さ（-1〜1、1に近いほど類似）。純粋関数なので単体テストが容易。 */
function cosineSimilarity(a: readonly number[], b: readonly number[]): number {
  if (a.length !== b.length) throw new Error("vector length mismatch");
  let dot = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  const denom = Math.sqrt(normA) * Math.sqrt(normB);
  return denom === 0 ? 0 : dot / denom; // ゼロ除算を防ぐ（堅牢性）
}

interface Chunk {
  readonly text: string;
  readonly embedding: readonly number[];
}

/** 質問ベクトルに近い順に上位k件の文書を返す。検索ロジックは生成と分離（SRP）。 */
function retrieveTopK(
  queryEmbedding: readonly number[],
  chunks: readonly Chunk[],
  k: number,
): readonly Chunk[] {
  return chunks
    .map((c) => ({ chunk: c, score: cosineSimilarity(queryEmbedding, c.embedding) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k)
    .map((r) => r.chunk);
}
```

### Step 3: Pass to the local LLM with the basis

Embed the searched documents into the prompt as the "basis" and pass them to the local LLM (the [streaming client from the getting-started guide](/blog/local-llm-getting-started-ollama-lm-studio-vram-model-selection-guide)).

```ts
function buildPrompt(question: string, context: readonly Chunk[]): string {
  const grounding = context.map((c, i) => `[${i + 1}] ${c.text}`).join("\n\n");
  // 「根拠にないことは答えない」と明示し、幻覚（でっち上げ）を抑える
  return [
    "以下の根拠だけに基づいて、質問に日本語で答えてください。",
    "根拠に答えがない場合は「資料からは分かりません」と答えてください。",
    "",
    `# 根拠\n${grounding}`,
    "",
    `# 質問\n${question}`,
  ].join("\n");
}
```

With this, **an AI that answers based on your local documents is complete, entirely locally.** All of document embedding, search, and generation complete on `localhost`, and your data never leaves.

---

## Three tips to raise accuracy

The minimal implementation works, but sometimes "a more off-target answer than expected comes back." RAG's accuracy is decided almost entirely **not by the LLM's smarts but by the "quality of retrieval."** Three tips that work immediately.

1. **Tidy up the chunking (splitting documents)** — split long documents by semantic units (headings, paragraphs). Too coarse blurs the basis; too fine loses context. Slightly overlapping adjacent chunks stabilizes it.
2. **Supplement proper nouns with keyword search** — vector search is good at "semantic nearness" but **weak at exact matching of model numbers, proper nouns, and abbreviations.** Making it "hybrid search" combined with keyword search (full-text search) greatly raises accuracy.
3. **Have it present the basis** — including "which document was the basis" in the answer (the `[1][2]` in the example above) lets you notice errors and raises reliability.

Push this "quality of retrieval" to production quality, and designs like hybrid search, reranking, and evaluation pay off. For details, see [production RAG pitfalls and accuracy improvement](/blog/production-rag-pitfalls-accuracy-improvement-guide). In the voice-concierge AI I worked on, I "structurally eliminated wrong answers about specialized products" by building exactly this quality of retrieval.

---

## The path to production: when documents grow, when used in business

The minimal implementation is optimal for the stage of "a few documents, used by yourself." Beyond that, evolve it by scale and use.

| Stage | What to do |
|---|---|
| **Documents exceed hundreds to thousands** | From in-memory brute force to a vector DB (pgvector, etc.). Search becomes fast and scalable |
| **Used by multiple people/departments** | Build access control into the search layer (structurally separate so others' documents don't mix in) |
| **Used for business decisions** | Prepare an evaluation mechanism to measure accuracy and run improvement |

In particular, **for business use, access control (others'/other departments' documents not mixing into the search results) is essential.** This is dangerous "later," and needs a design that builds it into the search condition from the start (see [production RAG pitfalls](/blog/production-rag-pitfalls-accuracy-improvement-guide)). The minimal implementation you try personally and the production RAG used across the company are continuous, but production needs production-grade engineering.

---

## FAQ

### Q. What is private RAG?

A mechanism to build an AI you can ask questions of your local documents (PDFs, notes, internal materials), without sending data outside. It completes, entirely inside your own PC, RAG (retrieval-augmented generation) that searches for documents relevant to the question and passes them as the basis to a local LLM to answer. It's optimal when you don't want to input confidential information into an external AI.

### Q. Does the data really not leave?

If you run all of document embedding, search, and generation locally (with Ollama, etc.), the data isn't sent outside. Communication completes on `localhost`. This is the greatest value of private RAG, letting you handle confidential information, personal data, and unpublished materials with peace of mind.

### Q. Do I need a vector database?

For a few documents, no. You can build it with just Ollama's embedding API and cosine similarity (a pure function writable without an external library). When documents exceed hundreds to thousands and search gets slow, migrating to a vector DB like pgvector makes it fast and scalable.

### Q. RAG or fine-tuning — which should I use?

If you just want it answered based on your local materials, RAG first. Since RAG injects knowledge on the spot, even if the materials change, you just swap the data. Fine-tuning (additional training) is an option for changing the style or behavior, and is unsuited to updating knowledge. For details, see the "RAG vs fine-tuning" article.

### Q. The answers are off-target. How can I raise accuracy?

RAG's accuracy is decided almost entirely by the "quality of retrieval." Before switching the LLM to a smarter model, try (1) tidying up the document chunking, (2) combining keyword search with vector search, which is weak at proper nouns (hybrid search), and (3) having it present the basis. If that's still insufficient, you need production-RAG design like reranking and an evaluation mechanism.

---

## Summary: a "personal AI" that completes locally

To build privately an AI that answers from your local materials, here's what to grasp.

1. **RAG is a technique to "search for relevant documents and pass them as the basis to the LLM to answer"** — if it completes locally, the data never leaves.
2. **The mechanism is three stages** — embed → search → generate. It can all run inside your own PC.
3. **The minimal implementation is just Ollama's embedding API + cosine similarity** — for a few documents, no DB needed.
4. **The key to accuracy is the "quality of retrieval"** — raise it with chunking, hybrid search, and presenting the basis.
5. **There's a path to production** — it can evolve in stages to a vector DB, access control, and evaluation.

"I want to introduce an AI that answers from internal materials into our business, without sending data outside" — from a minimal implementation you try personally to a production internal RAG with access control and accuracy evaluation, I build it in practice. Feel free to reach out.
