# The complete guide to getting started with local LLMs: run AI on your own PC with Ollama / LM Studio (with model selection by VRAM)

> An engineer who actually runs LLMs in production explains how to get started with 'local LLMs' — running AI for free, privately, and offline on your own PC. From choosing between Ollama / LM Studio, to a model-selection table by VRAM that answers the biggest question 'which model runs on my GPU (VRAM),' quantization (Q4_K_M), the reality of speed, and code to build your own app with the Ollama API.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 生成AI, LLM, Ollama, ローカルLLM, セルフホスト, 型安全
- URL: https://tomodahinata.com/en/blog/local-llm-getting-started-ollama-lm-studio-vram-model-selection-guide
- Category: Local LLMs: AI on your own PC

## Key points

- Local LLMs have the benefits of 'free, privacy (data doesn't leave), offline.' Practical AI runs even on an ordinary PC.
- The biggest question 'which model runs on my GPU' is determined by VRAM. As a rule of thumb (Q4 quantization): 7-8B at 8GB, 14B at 16GB, 32B at 24GB, and 48GB-class for 70B.
- The tools are Ollama (easy, with API) or LM Studio (GUI-focused). First, you can start with the single line `ollama run` in Ollama.
- Quantization (Q4_K_M) is the practical default that cuts size to about 1/4 while almost preserving quality. Speed (tok/s) is environment-dependent, so always measure on your own PC.
- Ollama exposes an http API locally, so you can build your own chat or internal-document AI (private RAG) without letting data leave.

---

Let me state the conclusion first. **Yes, even without a special server, you can run AI (an LLM) "for free, privately, and offline" on your ordinary PC.** Using an AI like ChatGPT with no monthly subscription, without sending data outside, and without an internet connection — this is a local LLM. And the truly important question for getting started is just one: **"which model runs on my GPU (VRAM)."** This article answers that question immediately with a table first, then explains choosing between Ollama / LM Studio, quantization, the reality of speed, and how to build your own app — from **the viewpoint of an engineer who actually runs LLMs in production**, so you can get it working in the shortest time, reliably.

> This article continues to the [comparison of AWQ/GPTQ/FP8/GGUF quantization](/blog/qwen3-quantization-awq-gptq-fp8-gguf-comparison-guide) for quantization details, the [comparison of local vs ChatGPT](/blog/local-llm-vs-chatgpt-cost-privacy-offline-comparison) for cost comparison, and [introduction to private RAG](/blog/private-rag-local-llm-chat-with-your-own-documents) for how to make it answer from your own materials.

---

## What is a local LLM? Why is it so popular now

A local LLM means **running a large language model (LLM) like ChatGPT on your own PC instead of in the cloud.** The reasons its popularity is surging are clear.

| Benefit | Content |
|---|---|
| **Free** | No monthly subscription. Once you build the environment, use it unlimited (except electricity) |
| **Privacy** | Input data isn't sent outside. You can handle confidential/personal information |
| **Offline** | Runs without an internet connection. On a plane, or in a company's closed environment |
| **Unlimited, free** | No usage-count limit. Freely choose and modify models |

The background is that **open-weight models (Llama, Qwen, Gemma, etc.) reached a "practically smart" level** and that **tools like Ollama made adoption dramatically easy.** Ollama's download count exploded over a few years, a desktop app appeared, and even people unfamiliar with commands can use it. "Putting AI in your own hands" is no longer just for some engineers.

---

## [The core] Which model runs on my GPU? A model-selection table by VRAM

The first wall you hit with local LLMs is "which model runs on my PC." The answer is **mostly determined by the amount of GPU VRAM (video memory).** A model is smarter the larger its "parameter count (7B = 7 billion, etc.)," but it eats that much more memory.

The table below is **a rule of thumb for the required VRAM, premised on practical quantization (the Q4_K_M below)** (the weight size is a definite value computable as "parameter count × bit count," plus a margin for context).

| Model scale | Required VRAM rule of thumb (Q4) | Example environment it runs on | Use rule of thumb |
|---|---|---|---|
| **3B** | About 3GB~ | 8GB VRAM / many laptops / CPU possible | Light summary, classification, chat |
| **7–8B** | About 6GB~ | 8GB VRAM (a bit of margin at 12GB) / M1~ | The everyday all-rounder line (first recommendation) |
| **13–14B** | About 10GB~ | 12–16GB VRAM / M2 Pro~ | One notch smarter general-purpose |
| **32–34B** | About 20GB~ | 24GB VRAM (RTX 3090/4090) / M-Pro 32GB | High quality, reasoning-leaning |
| **70B** | About 42GB~ | 48GB-class or 2 GPUs / Mac 64GB+ | Top tier (picky about hardware) |

> **How to read it**: about "your GPU's VRAM ÷ 1.3–1.5" is the rule-of-thumb upper bound for weights that run comfortably. For example, **a 16GB GPU is comfortable up to the 14B class**, and **24GB can aim for the 32B class.** Apple Silicon (M-series) can **use main memory like VRAM**, so a 64GB machine can run 70B, but the speed is modest due to bandwidth.

This "fits or not" can be confirmed by calculation. Made into a function with the premises stated explicitly, it looks like this.

```ts
/**
 * モデルが手元のGPUに載るかを概算する。
 * Q4_K_M ≒ 約4.5bit/パラメータを前提（量子化方式で変わるので目安）。
 * 文脈（KVキャッシュ）の余裕を加える。最終的な可否は実機で確認すること。
 */
function estimateModelVramGiB(paramsBillions: number, bitsPerParam = 4.5): number {
  const weightBytes = paramsBillions * 1e9 * (bitsPerParam / 8);
  const contextOverheadGiB = 1.5; // 文脈長・同時実行で増減する目安
  return weightBytes / 1024 ** 3 + contextOverheadGiB;
}

function fitsOnGpu(paramsBillions: number, gpuVramGiB: number): boolean {
  return estimateModelVramGiB(paramsBillions) <= gpuVramGiB;
}

// 例: 8B は 8GB GPU に概ね載る / 70B は 48GB 級が必要
fitsOnGpu(8, 8);   // → true（ぎりぎり。文脈を短めに）
fitsOnGpu(70, 24); // → false（24GBには載らない）
```

---

## Which tool to use? Ollama vs LM Studio

There are various tools for running local LLMs, but **the first choice is effectively these two.**

| | Ollama | LM Studio |
|---|---|---|
| **Features** | Command + local API. Lightweight, for automation/own apps | GUI (screen operation)-centric. Model searching and chatting are intuitive |
| **Who it suits** | Developers, those who want to build, those who want to connect via API | Those who want to try with a GUI first, non-engineers |
| **How to start** | After install, the single line `ollama run <model>` | Choose a model in the app, download → chat |
| **API** | Yes (`localhost:11434`) | Yes (OpenAI-compatible server) |

**If in doubt, I recommend Ollama.** It's light to set up and, as described later, **comes with a local API, so it's easy to develop into your own app or internal-document AI.** If you want to start casually with a GUI, LM Studio is comfortable.

---

## Quantization (Q4_K_M) in one line: size 1/4 with quality almost preserved

The symbols like **`Q4_K_M` and `Q5`** that appear in model names are "quantization." Quantization is a technique that represents model weights with a lower bit count to **make the size smaller.**

- **No quantization (FP16, 16-bit)**: an 8B model is about 16GB for weights alone → doesn't fit on an ordinary PC.
- **Q4_K_M (about 4.5-bit)**: the same 8B compressed to about 4.5GB → runs on an 8GB GPU.

In practice, **`Q4_K_M` is the standard choice that "cuts size to about 1/4 with almost no quality loss."** Choosing this first won't go wrong. To raise quality more, `Q5_K_M` or `Q6`; to make it smaller, `Q3` — that's the trade-off. The mechanism details are explained in [comparison of quantization methods](/blog/qwen3-quantization-awq-gptq-fp8-gguf-comparison-guide) and [the serving economics of quantization](/blog/llm-quantization-serving-economics-awq-fp8-kv-cache-vram-budget).

---

## The reality of speed (tok/s): always measure on your own PC

"How fast does it run?" — this is the most misunderstood point in local LLMs. Honestly: **speed (generated tokens per second = tok/s) varies greatly by GPU, model size, quantization, and inference backend, so don't take the "I got XX tok/s" numbers online at face value.**

The qualitative reality is this.

- **Running a small model (3–8B, Q4) on a recent discrete GPU** → often sufficiently faster than human reading speed and no problem for dialogue.
- **A large model (70B, etc.), Apple Silicon's unified memory, or CPU offload** → considerably slower, picky about use.
- **Decode (generating character by character) tends to be bottlenecked by memory bandwidth.** So making weights smaller with quantization helps not just size but speed too.

I usually **run quantized open models in production with vLLM on a GPU server**, and from that experience I can assert — **"how it feels on your own workload, you can't know without measuring on your own machine."** With Ollama, the generation speed is shown at the end of the answer. First run a 7–8B at `Q4_K_M`, and if slow make the model smaller, if fast with room to spare make the model larger — **adjust based on your own PC.**

---

## Actually running it: Ollama's shortest procedure, and building your own with the API

### The shortest 3 steps

```bash
# 1. Ollama をインストール（公式サイト or パッケージマネージャ）
# 2. モデルを取得して対話する（初回はダウンロードが走る）
ollama run llama3.1:8b

# 3. これだけ。プロンプトに質問を打てば、ローカルで完結して回答が返る
```

### Build your own app with the local API (data doesn't leave)

Ollama's true value is that it **exposes an HTTP API locally.** Just by throwing requests at `http://localhost:11434`, you can build your own chat app or business tool **without sending any data outside.** A world-class client can be written like this.

```ts
interface ChatMessage {
  readonly role: "system" | "user" | "assistant";
  readonly content: string;
}

interface OllamaChunk {
  readonly message?: { readonly content?: string };
  readonly done: boolean;
}

/**
 * ローカルのOllamaにチャットを投げ、生成トークンを逐次yieldする非同期ジェネレータ。
 * - 通信はlocalhostのみ＝データは外部に出ない（プライバシー）
 * - AbortSignalでタイムアウト/キャンセル可能（回復性）
 * - NDJSONを行単位で安全にパースし、不完全な行は次チャンクへ持ち越す（堅牢性）
 */
async function* streamOllamaChat(
  params: { model: string; messages: readonly ChatMessage[]; signal?: AbortSignal },
  endpoint = "http://localhost:11434",
): AsyncGenerator<string, void, unknown> {
  const res = await fetch(`${endpoint}/api/chat`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ model: params.model, messages: params.messages, stream: true }),
    signal: params.signal ?? AbortSignal.timeout(120_000),
  });
  if (!res.ok || res.body === null) {
    throw new Error(`Ollama request failed: ${res.status} ${res.statusText}`);
  }

  const reader = res.body.pipeThrough(new TextDecoderStream()).getReader();
  let buffer = "";
  try {
    for (;;) {
      const { value, done } = await reader.read();
      if (done) break;
      buffer += value;
      // NDJSON: 改行で区切られた完全な行だけを処理する
      let newlineIndex: number;
      while ((newlineIndex = buffer.indexOf("\n")) !== -1) {
        const line = buffer.slice(0, newlineIndex).trim();
        buffer = buffer.slice(newlineIndex + 1);
        if (line === "") continue;
        const chunk = JSON.parse(line) as OllamaChunk;
        const delta = chunk.message?.content;
        if (delta !== undefined && delta !== "") yield delta;
        if (chunk.done) return;
      }
    }
  } finally {
    reader.releaseLock(); // ストリームのリソースを確実に解放（KISS / 後始末）
  }
}

// 使い方: トークンが届くたびに画面へ流す（ストリーミングUX）
// for await (const token of streamOllamaChat({ model: "llama3.1:8b", messages })) {
//   process.stdout.write(token);
// }
```

This code has **type safety (no `any`), timeout and cancellation (resilience), safe parsing that doesn't crash the whole on a broken line, and reliable resource release.** Connecting it to the local API, you can develop from here into "AI that answers from your own materials ([private RAG](/blog/private-rag-local-llm-chat-with-your-own-documents))."

---

## Local vs ChatGPT, which is better in the end?

To state just the conclusion: **using them differently by purpose is the right answer.**

- **Local LLM suits**: privacy is important (confidential/personal info), you want to use it offline, you want to use it heavily and keep API billing down, you want to freely modify it.
- **ChatGPT (cloud) suits**: you just want the highest quality, you don't want to invest in hardware, you want to use it right away.

"Local is always cheaper because it's free" isn't necessarily true. An **honest cost comparison** including hardware cost, electricity, and effort is verified in detail in the [local LLM vs ChatGPT article](/blog/local-llm-vs-chatgpt-cost-privacy-offline-comparison).

---

## Frequently asked questions (FAQ)

### Q. Are local LLMs really free?

The software (Ollama / LM Studio) and open-weight models can be used for free. There's no monthly subscription. But it's not "completely free" — there's the PC's electricity and the initial investment in hardware (GPU, etc.) to run it comfortably. If you already have a sufficient PC, the additional cost is almost just electricity.

### Q. Which model runs on my PC?

It's determined by the amount of GPU VRAM. As a rule of thumb: 7–8B at 8GB, 13–14B at 12–16GB, 32B at 24GB, and 48GB-class VRAM (or 2 GPUs, Mac 64GB+) for 70B. All are rules of thumb premised on Q4 quantization, and "VRAM ÷ 1.3–1.5" is the rule-of-thumb upper bound for weights that fit comfortably. See the by-VRAM model-selection table in the body.

### Q. Does it run even without a GPU (an ordinary laptop or CPU only)?

It does. A small model around 3B often runs at a practical speed on a CPU without a GPU. An Apple Silicon (M-series) Mac can use main memory like VRAM, so it can run relatively large models even without a discrete GPU. First, try a small model.

### Q. Is data really not sent outside?

A local LLM runs the model on your own PC, so input data isn't sent outside (except when downloading the model). The Ollama API also completes on `localhost`. This is the biggest privacy advantage, especially important for uses handling confidential or personal information.

### Q. What is Q4_K_M? Which should I choose?

It's a quantization (compression to make the model lighter) method. `Q4_K_M` is the standard that "cuts size to about 1/4 with almost no quality loss," and choosing this first won't fail. For higher quality, `Q5_K_M` / `Q6`; to make it smaller, `Q3` — that's the trade-off.

### Q. How much speed do I get?

It varies greatly by GPU, model size, and quantization, so it can't be said in general. Running a small model (3–8B, Q4) on a recent GPU often gives a speed with no problem for dialogue. With large models or CPU/unified memory it gets slower. Don't take online numbers at face value; always measure on your own PC with the generation speed Ollama shows.

---

## Conclusion: first run a Q4 7–8B with `ollama run`

To start local LLMs in the shortest time, the points to grasp are as follows.

1. **The benefits are "free, privacy, offline"** — practical AI runs even on an ordinary PC.
2. **The model that runs is determined by VRAM** — rule of thumb (Q4): 7-8B at 8GB, 14B at 16GB, 32B at 24GB, 48GB-class for 70B.
3. **The tool is Ollama (easy, with API) as the first candidate** — from the single line `ollama run llama3.1:8b`.
4. **Default to `Q4_K_M` for quantization** — quality almost preserved, size 1/4. Always measure speed on your own PC.
5. **Build your own with the local API** — without letting data leave, develop into chat or internal-document AI (private RAG).

"I want to use AI without letting our confidential data leave," "I want to seriously adopt local LLMs into work" — from the stage of trying it individually to production operation (own GPU, vLLM, internal RAG), I handle it in practice. Along with [the cost design of self-hosting](/blog/generative-ai-cost-api-vs-self-hosting-decision-guide) and [the design of production RAG](/blog/production-rag-pitfalls-accuracy-improvement-guide), I help per your requirements.
