Skip to main content
友田 陽大
Generative AI, LLMs & RAG
Next.js
TypeScript
WebGPU
WebAssembly
CRDT
Local-First
AI Agent
Edge AI
アーキテクチャ設計
パフォーマンス
ゼロトラスト

The End of the Cloud-LLM Economy: The Foundational Theory of the 'Local-First Agentic Web' Designed with Next.js 16 × WebGPU × CRDT

Overcoming the triple suffering that cloud-LLM dependence produces — physical latency, privacy breakdown, and economic unsustainability — with on-device inference on WebGPU, strong eventual consistency via CRDTs, and an autonomous-agent mesh via the Actor model. We design this next-generation Local-First Agentic architecture, going as deep as type-puzzle-grade TypeScript, WGSL compute shaders, and a zero-trust sync protocol.

Published
Reading time
34 min read
Author
友田 陽大
Share
Contents

Introduction: The History of Computing Is a Back-and-Forth Motion Between "Centralization and Distribution"

Step back and survey the currents of computer architecture, and we've crossed the same pendulum many times.

1960s  Mainframe (centralized: compute is far, terminals are dumb)
1980s  PC revolution (distributed: compute descends to your hands)
2000s  Web / SPA (centralized: compute returns to the server)
2010s  Cloud-native microservices (the apex of centralization)
2020s  Edge / CDN Workers (re-distributed: bring computation geographically closer)
2023-  Generative AI (hyper-centralized: inference consolidates into hyperscalers)
202X-  ?

And in 2026, we stand at the critical point where this pendulum should swing back the other way. The reasons are three, and every one is a matter of physics and economics, not technology preference.

The "Triple Suffering" That the Modern Cloud-LLM Architecture Faces

First, the wall of physical latency. The RTT for a Tokyo user to reach a us-east-1 GPU cluster is, even ignoring the speed of light, at least 180ms. On top of that ride the LB, queue waits, and inference time. Getting "user input → first token (TTFT)" below 800ms is quite hard. By contrast, the network latency of on-device inference is always 0ms. The laws of physics are non-negotiable.

Second, the structural breakdown of privacy. The GDPR, the EU AI Act (the 2026 enforcement phase), HIPAA, and Japan's revised Personal Information Protection Act all place "where data is stored and who accessed it" as the company's primary responsibility. A design that "just tosses to OpenAI" the user's thought process, unfinalized drafts, and internal secrets is now a management risk itself. Many executives haven't yet noticed the fact that every data egress to an LLM snowballs the legal department's debt.

Third, economic unsustainability. Assume 1,000,000 MAU × 50 queries/day × 300 output tokens × $0.003/1K tokens (a mid-tier model as of 2026), and it's about $135,000/month. This is a "time bomb of variable cost" that eats gross margin as you scale. From a FinOps view, inference cost, by its nature, asymptotes to zero only in the form where 'compute is on the user's side.' Since a cloud GPU isn't dedicated to each individual user, it's always billed including idle cost.

The Answer Returns to "Local-First Software" — But with 2026's Equipment

The seven principles of "Local-First Software" that Ink & Switch presented in 2019 (no spinners / work offline / multi-device / long-term usability / security & privacy / user ownership / collaborative) were, at the time, still a "philosophy." But now, the following three are simultaneously in place.

  1. The spread of WebGPU: f16 / subgroup / compute shaders are available on Chromium 113+, Safari 18+, M3/M4-series Apple Silicon, Snapdragon X Elite, and Intel Lunar Lake.
  2. The maturity of quantization and distillation: 3B–8B-class distilled models (the lineage of Llama 3.2, Phi-4-mini, Gemma 3n, Qwen 2.5, etc.) fit in 2–4GB at Q4 quantization and generate 30–80 tokens per second on consumer devices.
  3. CRDTs going operational: Automerge 2 and Yjs are proven in collaborative editing of millions of rows.

This article's claim is one. "AI is strongest when it's next to the user." And I'll prove with code the fact that the architecture to implement it is no longer theory but buildable with 2026's web standards alone.

The target stack:

  • Frontend: Next.js 16 App Router (leveraging the RSC + Client Component separation to the extreme)
  • Inference layer: WebGPU compute shaders (WGSL), WASM + SIMD, SharedArrayBuffer
  • State layer: a custom CRDT (LWW-Map + Hybrid Logical Clock), IndexedDB, Service Worker
  • Agent layer: the Actor model (a TypeScript implementation equivalent to Erlang OTP's Supervisor Tree)
  • Sync layer (optional): a zero-trust E2EE sync relay (Cloudflare Workers/D1 + a WebAuthn-derived key)
  • BFF: Go (model routing, signed model distribution, policy verification only. No inference.)

Main Part ①: The Overall Architecture and the Principle of Responsibility Separation

First, the picture. Responsibility separation is precisely the source of this architecture's economic rationality.

                              ┌───────────────────────────────┐
                              │  The user's device             │
                              │  ─ M3/M4/Snapdragon X/Lunar Lake ─│
                              │                                 │
  Cloud (BFF / Relay)         │   ┌───────────────────────┐     │
  ─ Next.js 16 Server ─        │   │ Next.js 16 App Router │     │
  ─ Go (Routing/Policy)  ◄──RSC┼───┤  (RSC + Client)       │     │
                              │   └────────┬──────────────┘     │
                              │            │ Structured         │
                              │            ▼ Concurrency         │
                              │   ┌──────────────────────────┐   │
                              │   │ Agent Supervisor (Actor) │   │
                              │   │  ─ Planner / Critic /    │   │
                              │   │    Executor / Retriever  │   │
                              │   └───┬────────────┬─────────┘   │
                              │       │            │             │
                              │       ▼            ▼             │
                              │  ┌──────────┐ ┌────────────┐    │
                              │  │ Inference│ │  CRDT Store│    │
                              │  │  Engine  │ │ (LWW + HLC)│    │
                              │  │ (WebGPU) │ │            │    │
                              │  └────┬─────┘ └────┬───────┘    │
                              │       │            │             │
                              │  ┌────▼────┐ ┌─────▼──────┐    │
                              │  │ WGSL   │  │ IndexedDB  │    │
                              │  │ Kernels│  │ (encrypted)│    │
                              │  └────────┘  └─────┬──────┘    │
                              │                    │            │
                              └────────────────────┼────────────┘
                                                   │ E2EE sync
                                                   ▼
                                     ┌─────────────────────────┐
                                     │ Relay (Zero-knowledge)  │
                                     │ CFW / D1: ciphertext only│
                                     └─────────────────────────┘

This diagram has three cores of disruptive creation embedded in it.

  1. The BFF does no inference. The BFF's responsibilities are only two: ① deciding which model to distribute to the device (device performance, model registry, A/B testing), and ② verifying the policy of which operations are permitted to cross over to the cloud. It keeps no permanent, expensive GPU instances.
  2. The server "doesn't know." The Relay is merely an E2EE transfer path for sync and holds no decryption key (zero-knowledge). Even on server audit or breach, plaintext data absolutely doesn't leak.
  3. Agents cooperate inside the device. Inference, planning, criticism, and retrieval all complete locally. Privacy and economy are not a trade-off but a by-product of the same design decision.

Main Part ②: Why WebGPU — A Technology Selection Based on First Principles

To the natural question "couldn't we just do inference in WASM?", I answer with first principles.

A Comparison of WebGL2 / WASM-only / WebGPU

AspectWebGL2 (abusing fragment shaders)WASM + SIMDWebGPU (compute shaders)
Compute modelForcibly repurpose per-pixel fragment shadersCPU SIMD 128bitGPU compute, f16-native
Memory bandwidthVia textures, with PCIe round-tripsL1–L3 dependentGPU HBM/shared memory directly, hundreds of GB/s
Workgroup shared memoryNoneNonePresent (workgroup storage)
Quantization precisionfp32 only, weak int opsint8/int4 possiblef16/i8/i32/subgroup shuffle
ParallelismThousands of threads per SM (approx.)4–8 lanesThousands to tens of thousands of threads
Battery efficiency× (constant GPU drive)◎ (minimal power per unit of computation)

WebGPU's selection is decided on the aspect of power per unit of computation, not performance. In local inference, battery life is directly UX, and an app where "using AI heavily shuts the phone down in an hour" is a commercial loss.

The Design Space of the Inference Pipeline

Transformer forward computation, taken to the extreme, comes down to "repeating a giant matrix-vector product (GEMV) dozens of times." Seeing the per-token op count as $\text{ops} \approx 2 N$ ($N$ being the parameter count), a 7B model is ≈ 14G FLOPS per token, and the M3 Pro GPU peak ≈ 7 TFLOPS/s @ fp16, so theoretically 500 tok/s.

But memory bandwidth is the rate limiter. When the model weights are carried from DRAM to L2/L1, they consume battery, latency, and bandwidth all at once. So the design priority is:

  1. Quantize the weights to gain bandwidth (Q4_0, Q5_K, INT8).
  2. Keep the KV cache resident on the GPU (don't recompute per token).
  3. Fused ops: compress dequant + matmul + RMSNorm + SwiGLU into one kernel.
  4. Tile with workgroup shared memory (minimize DRAM→shared-memory round-trips).

Main Part ③: Demonstration Code ① — A Fused Q4_0 MatVec Kernel Implemented in WGSL

The following is a WebGPU compute shader implementing the product of a Q4_0-quantized weight matrix and an fp16 input vector with shared-memory tiling. A quantization block is 1×fp16(scale) + 16 bytes (4bit×32) per 32 elements, conforming to the standard layout in Llama-family models.

// shaders/matvec_q4_0.wgsl
// 概要: y = W · x
//   W: [M, K], Q4_0 量子化(32要素ブロックごとに fp16 scale 1つ + 4bit nibble 32個)
//   x: [K], fp16
//   y: [M], fp16
// タイリング: 各ワークグループが M方向の 1行 を 256スレッドで共同処理し、
//           K方向をストライド=256 で舐める。部分和は workgroup memory で木状リダクション。
//
// 前提: device features ['shader-f16'] が有効化されていること。

enable f16;

struct Meta {
    M: u32,
    K: u32,             // K は 32 の倍数
    blocks_per_row: u32, // K / 32
};

@group(0) @binding(0) var<uniform> meta : Meta;

// W は「[M, blocks_per_row]」に詰め、1ブロック = 18バイト = 9 × u16
// -> u32 で読むため、1ブロックを 5 × u32 の先頭部分として扱い、tail を mask する。
// 実装を簡素化するため、ここでは 1ブロック = 20バイト = 5 × u32 にパディング。
struct Q4Block {
    scale : u32,    // lower 16bit: fp16 scale, upper 16bit: padding
    q0 : u32,       // 4bit × 8 = 32bit
    q1 : u32,
    q2 : u32,
    q3 : u32,       // 4bit × 8 = 32bit(32要素を 4 つの u32 に格納)
};

@group(0) @binding(1) var<storage, read>       W_q : array<Q4Block>;
@group(0) @binding(2) var<storage, read>       X   : array<f16>;
@group(0) @binding(3) var<storage, read_write> Y   : array<f16>;

const WG_SIZE : u32 = 256u;
var<workgroup> partial : array<f32, WG_SIZE>;

// 4bit nibble を符号付き整数(−8..+7) に変換
fn dequant_nibble(n: u32) -> f32 {
    // Q4_0 は (nibble - 8) * scale
    let s = i32(n) - 8;
    return f32(s);
}

@compute @workgroup_size(WG_SIZE)
fn main(
    @builtin(workgroup_id)        wg : vec3<u32>,
    @builtin(local_invocation_id) lid: vec3<u32>,
) {
    let row = wg.x;
    if (row >= meta.M) { return; }

    var acc : f32 = 0.0;

    // K方向をストライド WG_SIZE で走査。各スレッドは約 K/WG_SIZE 要素担当。
    var k : u32 = lid.x;
    loop {
        if (k >= meta.K) { break; }

        // 対応ブロック
        let blk_idx   = (row * meta.blocks_per_row) + (k / 32u);
        let in_block  = k % 32u;
        let u32_idx   = in_block / 8u;        // 0..3
        let nibble_sh = (in_block % 8u) * 4u; // 0,4,8,...28

        let blk = W_q[blk_idx];
        let scale_bits = blk.scale & 0xFFFFu;
        let scale = f32(bitcast<f16>(u32(scale_bits)));

        // 5要素のQを走査するのではなく、対応するu32ワードを選択
        var q_word : u32;
        switch (u32_idx) {
            case 0u: { q_word = blk.q0; }
            case 1u: { q_word = blk.q1; }
            case 2u: { q_word = blk.q2; }
            default: { q_word = blk.q3; }
        }

        let nibble = (q_word >> nibble_sh) & 0xFu;
        let w = dequant_nibble(nibble) * scale;
        let x = f32(X[k]);
        acc = acc + w * x;

        k = k + WG_SIZE;
    }

    // workgroup 内で木状リダクション
    partial[lid.x] = acc;
    workgroupBarrier();

    var stride : u32 = WG_SIZE / 2u;
    loop {
        if (stride == 0u) { break; }
        if (lid.x < stride) {
            partial[lid.x] = partial[lid.x] + partial[lid.x + stride];
        }
        workgroupBarrier();
        stride = stride / 2u;
    }

    if (lid.x == 0u) {
        Y[row] = f16(partial[0]);
    }
}

The Design Philosophy Packed into This Shader

  • An I/O structure premised on bandwidth being the rate limiter: W_q is read once as a storage buffer, then dequant → multiply on the spot. It loads block by block to minimize DRAM ↔ GPU SM round-trips.
  • Tree reduction: the standard technique of folding 256 threads' partial sums in a logarithmic number of stages (8 stages). It completes in $O(\log n)$, and the memory-coherence guarantee via workgroupBarrier() is made explicit at the type level.
  • Treating f16 as first-class: enable f16; is WGSL's explicit feature declaration. If you don't require features: ['shader-f16'] on the device side, the browser returns an error at pipeline-creation time. This is the very philosophy of this article's through-line, "reject illegal states before execution."
  • The logic of quantization choice: Q4_0 is the Llama.cpp-family de facto. It has 2× the bandwidth efficiency of INT8 and a simpler implementation than INT4 asymmetric quantization. In distilled models the perplexity degradation is held under 1%, so it's at the Pareto-optimal point of bandwidth/precision.

Dispatch from the Host Side (a Type-Safe Wrapper)

// src/inference/webgpu/runtime.ts
// WebGPUデバイスの「準備完了状態」を phantom type で表現する。
// これにより「createBufferを未初期化のdeviceで呼ぶ」類の事故がコンパイル時に消える。

declare const __GpuReadyBrand: unique symbol;
export type ReadyDevice = GPUDevice & { readonly [__GpuReadyBrand]: true };

export interface GpuCapabilities {
  readonly f16: boolean;
  readonly maxComputeWorkgroupSizeX: number;
  readonly maxStorageBufferBindingSize: number;
}

export async function acquireDevice(): Promise<{
  device: ReadyDevice;
  caps: GpuCapabilities;
}> {
  if (!("gpu" in navigator)) throw new Error("WebGPU unavailable");
  const adapter = await navigator.gpu.requestAdapter({ powerPreference: "high-performance" });
  if (!adapter) throw new Error("No WebGPU adapter");

  const f16 = adapter.features.has("shader-f16");
  if (!f16) throw new Error("shader-f16 required for Q4 kernels");

  const device = await adapter.requestDevice({
    requiredFeatures: ["shader-f16"],
    requiredLimits: {
      maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize,
      maxComputeWorkgroupStorageSize: 16384,
    },
  });

  device.lost.then((info) => {
    // GPUコンテキストロスト(Windowsスリープ復帰、別アプリのVRAM圧迫等)は常態的に発生する。
    // ここで全バッファ・パイプラインを破棄し、上位のSupervisorにrestart通知を送る。
    gpuLostBus.emit({ reason: info.reason, message: info.message });
  });

  const caps: GpuCapabilities = {
    f16,
    maxComputeWorkgroupSizeX: adapter.limits.maxComputeWorkgroupSizeX,
    maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize,
  };
  return { device: device as ReadyDevice, caps };
}

// 型レベルで「カーネルはバインディングレイアウトに適合したバッファ群でのみ呼べる」ことを強制
export interface KernelBinding<Name extends string, Mode extends "uniform" | "storage" | "storage-rw"> {
  readonly name: Name;
  readonly mode: Mode;
  readonly buffer: GPUBuffer;
}

export interface Kernel<Bindings extends readonly KernelBinding<string, "uniform" | "storage" | "storage-rw">[]> {
  dispatch(device: ReadyDevice, bindings: Bindings, workgroups: readonly [number, number, number]): void;
}

The intent of the phantom type ReadyDevice is to map the business rule "a GPUDevice only has meaning after going through requestDevice" into the type. Code handling an unprepared device can only obtain a bare GPUDevice, not a ReadyDevice, and since the API requires a ReadyDevice, a domain-rule violation becomes a TS2322 directly. This is the WebGPU version of Scott Wlaschin's "Making Illegal States Unrepresentable."


Main Part ④: Demonstration Code ② — An Agent Supervisor via DDD and the Actor Model

When cooperating multiple AI agents in the browser, how you contain cooperation failures (deadlocks, races, runaway retries, budget overruns) is life or death. What to reference here is half a century of Erlang/OTP wisdom.

Domain Modeling: Express the Agent's State as a Sum Type

// src/agent/domain.ts
import { z } from "zod";

// Branded ID:異種のIDを混同できない。
// AgentIdをTaskIdに渡した瞬間にコンパイルが止まる。
export type Brand<T, B extends string> = T & { readonly __brand: B };
export type AgentId = Brand<string, "AgentId">;
export type TaskId  = Brand<string, "TaskId">;
export type TraceId = Brand<string, "TraceId">;

// エージェントの役割(閉じた集合として設計)
export type Role =
  | "planner"     // 目的分解
  | "retriever"   // ローカルベクタ検索
  | "executor"    // ツール呼び出し / 推論
  | "critic"      // 結果検証 / ハルシネーション検知
  | "memorian";   // 長期記憶(CRDTへの書き戻し)

// 予算(FinOpsの最小単位をドメインに持ち込む)
// 「使える電力」「使えるトークン」「使える時間」を明示的に資源として扱う。
export interface Budget {
  readonly wallClockMs: number;
  readonly tokens: number;
  readonly joules: number; // バッテリー残量からの許容量(推定)
}

// エージェント状態:代数的データ型として網羅性を担保
export type AgentState<TResult> =
  | { readonly kind: "idle" }
  | { readonly kind: "planning";   plan: Plan; since: number }
  | { readonly kind: "awaiting";   toolCall: ToolCall; since: number }
  | { readonly kind: "reflecting"; draft: TResult; critiques: readonly Critique[] }
  | { readonly kind: "done";       result: TResult; tokensUsed: number }
  | { readonly kind: "failed";     error: AgentError; retriable: boolean };

export interface Plan {
  readonly steps: readonly PlanStep[];
  readonly createdAt: number;
}

export type PlanStep =
  | { readonly kind: "retrieve"; query: string }
  | { readonly kind: "infer";    prompt: string; maxTokens: number }
  | { readonly kind: "tool";     name: string; args: Readonly<Record<string, unknown>> };

export interface ToolCall {
  readonly id: TaskId;
  readonly name: string;
  readonly args: Readonly<Record<string, unknown>>;
}

export interface Critique {
  readonly reason: string;
  readonly severity: "minor" | "major" | "blocking";
}

export type AgentError =
  | { readonly kind: "budget_exhausted"; resource: keyof Budget }
  | { readonly kind: "tool_denied";      policy: string }
  | { readonly kind: "gpu_lost" }
  | { readonly kind: "inference_failed"; cause: string }
  | { readonly kind: "poisoned";         score: number }; // 敵対的入力検知

// 遷移の合法性を型で縛る:不正な状態遷移を書けないヘルパ
export function transition<TResult>(
  from: AgentState<TResult>,
  event: AgentEvent<TResult>
): AgentState<TResult> {
  switch (from.kind) {
    case "idle":
      if (event.kind === "start") return { kind: "planning", plan: event.plan, since: event.at };
      return from;
    case "planning":
      if (event.kind === "call_tool")
        return { kind: "awaiting", toolCall: event.call, since: event.at };
      if (event.kind === "produced")
        return { kind: "reflecting", draft: event.draft, critiques: [] };
      if (event.kind === "fail")
        return { kind: "failed", error: event.error, retriable: event.retriable };
      return from;
    case "awaiting":
      if (event.kind === "tool_result")
        return { kind: "reflecting", draft: event.draft, critiques: [] };
      return from;
    case "reflecting":
      if (event.kind === "approve")
        return { kind: "done", result: from.draft, tokensUsed: event.tokens };
      if (event.kind === "critique") {
        const next: AgentState<TResult> = {
          kind: "reflecting",
          draft: event.revision ?? from.draft,
          critiques: [...from.critiques, event.critique],
        };
        // 3回以上重大な批評がついたら失敗
        const blocking = next.critiques.filter((c) => c.severity === "blocking").length;
        if (blocking >= 2) {
          return { kind: "failed", error: { kind: "inference_failed", cause: "too many critiques" }, retriable: false };
        }
        return next;
      }
      return from;
    case "done":
    case "failed":
      return from; // 終端状態は不変
  }
}

export type AgentEvent<TResult> =
  | { kind: "start";      plan: Plan; at: number }
  | { kind: "call_tool";  call: ToolCall; at: number }
  | { kind: "tool_result"; draft: TResult }
  | { kind: "produced";   draft: TResult }
  | { kind: "approve";    tokens: number }
  | { kind: "critique";   critique: Critique; revision?: TResult }
  | { kind: "fail";       error: AgentError; retriable: boolean };

Key design points

  • AgentState is closed as a Sum type (algebraic data type), so add a new state and transition's switch becomes a compile error (enforceable via TS 5's satisfies + a never check). This is the classic "exhaustiveness check," but many AI-workflow implementations neglect it, and "unexpected states" pile up in production.
  • Bringing joules into Budget is a resource concept that only makes sense in on-device inference. It estimates the agent's power consumption from navigator.getBattery() and a device-performance coefficient, and a plan exceeding the remaining charge is rejected at the Planner stage. This is true FinOps.
  • Nominal typing via Brand<T, B>. Even if AgentId and TaskId are the same string, the compiler detects a mix-up. Scott Wlaschin style.

Actor: Mailbox + Supervisor Tree

We bring Erlang OTP's Supervisor into TypeScript with structured concurrency and cancellable Promises.

// src/agent/actor.ts
import type { AgentId, TraceId } from "./domain";

export interface Envelope<M> {
  readonly to: AgentId;
  readonly from: AgentId;
  readonly trace: TraceId;
  readonly msg: M;
  readonly deadline: number; // epoch ms, 過ぎたら drop
}

// 有界Mailbox:バックプレッシャを型に表す
export class Mailbox<M> {
  private buf: Envelope<M>[] = [];
  private waiters: Array<(e: Envelope<M>) => void> = [];
  constructor(private readonly capacity: number) {}

  /** 満杯なら `false` を返す。呼び出し側は計画段階でretry戦略を決める。 */
  trySend(e: Envelope<M>): boolean {
    if (this.buf.length >= this.capacity) return false;
    const w = this.waiters.shift();
    if (w) { w(e); return true; }
    this.buf.push(e);
    return true;
  }

  async receive(signal: AbortSignal): Promise<Envelope<M>> {
    const now = Date.now();
    // 既存メッセージから期限切れを捨てる
    while (this.buf.length && this.buf[0]!.deadline < now) this.buf.shift();
    const head = this.buf.shift();
    if (head) return head;

    return new Promise<Envelope<M>>((resolve, reject) => {
      const onAbort = () => {
        this.waiters = this.waiters.filter((w) => w !== resolve);
        reject(new DOMException("aborted", "AbortError"));
      };
      signal.addEventListener("abort", onAbort, { once: true });
      this.waiters.push(resolve);
    });
  }
}

export type RestartStrategy = "one-for-one" | "one-for-all" | "rest-for-one";

export interface Actor<M, R> {
  readonly id: AgentId;
  run(inbox: Mailbox<M>, signal: AbortSignal): Promise<R>;
}

interface ChildSpec<M, R> {
  readonly actor: Actor<M, R>;
  readonly maxRestarts: number;
  readonly withinMs: number;
}

export class Supervisor {
  private controllers = new Map<AgentId, AbortController>();
  private inboxes     = new Map<AgentId, Mailbox<unknown>>();
  private restartLog  = new Map<AgentId, number[]>();

  constructor(private readonly strategy: RestartStrategy) {}

  // child を起動。型パラメータで各アクターのメッセージ型を保持
  spawn<M, R>(spec: ChildSpec<M, R>): void {
    const ctrl = new AbortController();
    const inbox = new Mailbox<M>(/* cap */ 256);

    this.controllers.set(spec.actor.id, ctrl);
    this.inboxes.set(spec.actor.id, inbox as Mailbox<unknown>);

    const loop = async () => {
      try {
        await spec.actor.run(inbox, ctrl.signal);
      } catch (err) {
        if (ctrl.signal.aborted) return;
        if (this.shouldRestart(spec)) {
          this.onChildCrash(spec.actor.id);
          this.spawn(spec);
        } else {
          // エスカレーション:親Supervisorへ伝播
          this.abortAll(err);
        }
      }
    };
    void loop();
  }

  private shouldRestart<M, R>(spec: ChildSpec<M, R>): boolean {
    const log = this.restartLog.get(spec.actor.id) ?? [];
    const now = Date.now();
    const recent = log.filter((t) => now - t < spec.withinMs);
    recent.push(now);
    this.restartLog.set(spec.actor.id, recent);
    return recent.length <= spec.maxRestarts;
  }

  private onChildCrash(id: AgentId): void {
    if (this.strategy === "one-for-one") return;
    if (this.strategy === "one-for-all") this.abortAll(new Error(`${id} crashed`));
    if (this.strategy === "rest-for-one") {
      // TODO: 起動順を保持して、このidの後続だけ再起動
    }
  }

  private abortAll(reason: unknown): void {
    for (const c of this.controllers.values()) c.abort(reason);
  }

  /** Typedなsend。msg の型とactor の型がミスマッチなら compile error */
  send<M>(to: AgentId, env: Envelope<M>): boolean {
    const ib = this.inboxes.get(to) as Mailbox<M> | undefined;
    if (!ib) return false;
    return ib.trySend(env);
  }
}

Design philosophy

  • Let it crash: design on the premise that agents break. Don't mix recover logic into business code; centralize it in the Supervisor. Faithfully follows Erlang/OTP's 40 years of lessons.
  • Bounded Mailbox: an infinite buffer produces OOM and priority inversion. 256 is intentionally small, and when the sender receives false, it decides the retry strategy at the planning stage. This is true backpressure.
  • Structured Concurrency: an AbortSignal governs each actor's lifetime. If the parent dies, children stop automatically. A structure that corresponds completely to Goroutine + context.Context.

Main Part ⑤: Demonstration Code ③ — HLC (Hybrid Logical Clock) and the LWW-Map CRDT

In a world where locally-created state syncs between devices while offline, the judgment of "which write wins" is unavoidable. A simple physical clock breaks down with clock skew, and a pure vector clock bloats in proportion to the number of devices and no longer fits in an HTTP header. The Hybrid Logical Clock (HLC, Kulkarni et al. 2014) kills both weaknesses, approximating both causal order and human time as a "logical clock bound to physical time."

The HLC Implementation

// src/crdt/hlc.ts
export interface HLC {
  readonly l: number; // 論理時刻(ミリ秒単位、物理時間に追随)
  readonly c: number; // カウンタ
  readonly nodeId: string; // tiebreaker
}

const MAX_DRIFT_MS = 60_000;

export function hlcNow(prev: HLC, pt: number = Date.now(), nodeId = prev.nodeId): HLC {
  const l = Math.max(prev.l, pt);
  const c = l === prev.l ? prev.c + 1 : 0;
  assertDrift(l, pt);
  return { l, c, nodeId };
}

export function hlcRecv(local: HLC, remote: HLC, pt: number = Date.now()): HLC {
  const l = Math.max(local.l, remote.l, pt);
  let c: number;
  if (l === local.l && l === remote.l) c = Math.max(local.c, remote.c) + 1;
  else if (l === local.l) c = local.c + 1;
  else if (l === remote.l) c = remote.c + 1;
  else c = 0;
  assertDrift(l, pt);
  return { l, c, nodeId: local.nodeId };
}

function assertDrift(l: number, pt: number): void {
  if (l - pt > MAX_DRIFT_MS) {
    // 悪意あるピアによる偽の未来時刻。受信時点で拒否する(ゼロトラスト)。
    throw new Error("HLC drift exceeds threshold: possibly malicious clock");
  }
}

export function hlcCompare(a: HLC, b: HLC): number {
  if (a.l !== b.l) return a.l - b.l;
  if (a.c !== b.c) return a.c - b.c;
  return a.nodeId.localeCompare(b.nodeId); // tiebreaker で決定的に
}

Why HLC: it's the only realistic solution that satisfies both the logical clock's "causality preservation" and the physical clock's "human readability." Furthermore, the rejection via MAX_DRIFT_MS quashes the attack where a hostile peer floods a future time like l = 9999999999999 to overwrite all writes. The shift in thinking that distributed clocks too are subject to zero-trust is essential.

The LWW-Map CRDT Implementation

// src/crdt/lwwMap.ts
import { HLC, hlcCompare, hlcNow, hlcRecv } from "./hlc";

export interface Cell<V> {
  readonly v: V | null; // nullで tombstone(削除を表す)
  readonly ts: HLC;
}

// 内部 Op 形式。アプリ層は Op のみをやり取りする。
export type Op<V> = { k: string; v: V | null; ts: HLC };

export class LWWMap<V> {
  private cells = new Map<string, Cell<V>>();
  private clock: HLC;

  constructor(nodeId: string) {
    this.clock = { l: 0, c: 0, nodeId };
  }

  get(k: string): V | null {
    const c = this.cells.get(k);
    return c ? c.v : null;
  }

  /** ローカル書き込み。副作用として返す Op を同期層へ放流する。 */
  set(k: string, v: V | null): Op<V> {
    this.clock = hlcNow(this.clock);
    const cell: Cell<V> = { v, ts: this.clock };
    this.cells.set(k, cell);
    return { k, v, ts: this.clock };
  }

  /** リモートOp受信。冪等・可換・結合的(ACI)な merge。 */
  apply(op: Op<V>): void {
    this.clock = hlcRecv(this.clock, op.ts);
    const cur = this.cells.get(op.k);
    if (!cur || hlcCompare(op.ts, cur.ts) > 0) {
      this.cells.set(op.k, { v: op.v, ts: op.ts });
    }
    // 劣位Opは黙殺(idempotent)
  }

  /** Strong Eventual Consistency(SEC)の保証:
   *  同じOp集合を任意順で適用した任意レプリカは、同一状態に収束する。
   *  証明スケッチ:applyはACI(次節)のため、lub(上限)として一意に定まる。 */
}

Verifying the Mathematical Properties

Many blogs skip this, but the correctness of a CRDT reduces to a proof of algebraic properties.

  • Commutativity: apply(a).apply(b) = apply(b).apply(a). As long as hlcCompare is a total order, whichever you apply first, the final cells.get(k) matches.
  • Associativity: (a∘b)∘c = a∘(b∘c). Since the max operation satisfies associativity, the selection by HLC comparison is associative too.
  • Idempotence: apply(a).apply(a) = apply(a). Even receiving the same Op, hlcCompare(op.ts, cur.ts) === 0 doesn't occur unless strictly identical, and if identical, overwriting is invariant.

As a mathematical consequence of these three properties (ACI), under any network order and any duplicate delivery, all replicas converge to the same state in finite time (Shapiro et al. 2011, Strong Eventual Consistency). In other words, it's the rigorous foundation of the design principle that the server needs no central truth.


Main Part ⑥: Demonstration Code ④ — A Zero-Trust E2EE Sync Layer

We design a Relay that "holds no data but bears the transfer." The core is encryption with a device key derived from WebAuthn, pair-key agreement via elliptic-curve Diffie-Hellman, and confidentiality + integrity via AEAD (AES-GCM).

// src/sync/e2ee.ts
// WebAuthn の PRF 拡張(Prf Extension, Level 3)からデバイス鍵を派生させる。
// ユーザー認証が再度行われない限り派生鍵は取り出せない → デバイス紛失時も安全。

export interface DerivedKey {
  readonly aesKey: CryptoKey;    // AES-GCM 256
  readonly kid: string;           // key id(ユーザーごとに一意)
}

export async function deriveDeviceKey(
  credentialId: Uint8Array,
  salt: Uint8Array, // ユーザー固有ソルト
): Promise<DerivedKey> {
  const assertion = await navigator.credentials.get({
    publicKey: {
      challenge: crypto.getRandomValues(new Uint8Array(32)),
      allowCredentials: [{ id: credentialId, type: "public-key" }],
      userVerification: "required",
      extensions: { prf: { eval: { first: salt } } } as AuthenticationExtensionsClientInputs,
    },
  }) as PublicKeyCredential | null;
  if (!assertion) throw new Error("auth failed");

  const exts = assertion.getClientExtensionResults() as AuthenticationExtensionsClientOutputs & {
    prf?: { results?: { first?: ArrayBuffer } };
  };
  const prf = exts.prf?.results?.first;
  if (!prf) throw new Error("PRF not supported; enforce hardware key");

  const keyMaterial = await crypto.subtle.importKey("raw", prf, "HKDF", false, ["deriveKey"]);
  const aesKey = await crypto.subtle.deriveKey(
    { name: "HKDF", hash: "SHA-256", salt, info: new TextEncoder().encode("sync-aes-256-gcm") },
    keyMaterial,
    { name: "AES-GCM", length: 256 },
    false,
    ["encrypt", "decrypt"],
  );
  const kid = await sha256Hex(prf);
  return { aesKey, kid };
}

export interface EncryptedEnvelope {
  readonly kid: string;
  readonly iv: string;  // base64url
  readonly ct: string;  // base64url (ciphertext || tag)
  readonly aad?: string;
}

export async function sealOp<V>(
  key: DerivedKey,
  op: unknown, // Opや任意のJSON値
  aad?: string,
): Promise<EncryptedEnvelope> {
  const iv = crypto.getRandomValues(new Uint8Array(12));
  const plain = new TextEncoder().encode(JSON.stringify(op));
  const additional = aad ? new TextEncoder().encode(aad) : undefined;
  const cipher = await crypto.subtle.encrypt(
    { name: "AES-GCM", iv, additionalData: additional },
    key.aesKey,
    plain,
  );
  return {
    kid: key.kid,
    iv: b64u(iv),
    ct: b64u(new Uint8Array(cipher)),
    aad,
  };
}

export async function openOp<V>(
  key: DerivedKey,
  env: EncryptedEnvelope,
): Promise<unknown> {
  if (env.kid !== key.kid) throw new Error("key id mismatch");
  const iv = b64uDecode(env.iv);
  const ct = b64uDecode(env.ct);
  const aad = env.aad ? new TextEncoder().encode(env.aad) : undefined;
  const plain = await crypto.subtle.decrypt(
    { name: "AES-GCM", iv, additionalData: aad },
    key.aesKey,
    ct,
  );
  return JSON.parse(new TextDecoder().decode(plain));
}

async function sha256Hex(buf: BufferSource): Promise<string> {
  const h = new Uint8Array(await crypto.subtle.digest("SHA-256", buf));
  return Array.from(h, (b) => b.toString(16).padStart(2, "0")).join("");
}

function b64u(b: Uint8Array): string {
  return btoa(String.fromCharCode(...b)).replace(/\+/g, "-").replace(/\//g, "_").replace(/=+$/, "");
}
function b64uDecode(s: string): Uint8Array {
  const pad = "=".repeat((4 - (s.length % 4)) % 4);
  const b = atob(s.replace(/-/g, "+").replace(/_/g, "/") + pad);
  return Uint8Array.from(b, (ch) => ch.charCodeAt(0));
}

The design cores

  • The WebAuthn PRF extension: the client key is exposed neither in the OS keychain nor in LocalStorage. The key is merely called as a function from a hardware token (Secure Enclave / TPM). A device-cloning attack is structurally difficult.
  • Putting kid + op.k (the CRDT key) in the AAD (Additional Authenticated Data) rejects a ciphertext-swap attack (sending someone else's ciphertext swapped in) via AEAD tag verification.
  • The Relay knows nothing of this function group. The Relay merely holds and relays the EncryptedEnvelope as an S3-like key-value. The data leakage on server breach is zero. This is the meaning of zero-knowledge design.

The Relay's (Cloudflare Workers, etc.) Responsibility

// relay/worker.ts (Cloudflare Workers)
// サーバー側は暗号文にアクセスできない。できる保証のあるオペレーションだけを提供する。
export default {
  async fetch(req: Request, env: { RELAY: KVNamespace }): Promise<Response> {
    const u = new URL(req.url);
    const tenant = u.searchParams.get("t");
    if (!tenant) return new Response("tenant required", { status: 400 });

    // WebAuthn 署名 JWT の検証(鍵はユーザーデバイスの公開鍵)のみ実施
    const ok = await verifyAttestationJWT(req.headers.get("Authorization"));
    if (!ok) return new Response("unauthorized", { status: 401 });

    if (req.method === "POST") {
      const body = await req.text(); // EncryptedEnvelope のJSON
      // 内容を読まない。サイズ制限とレート制限のみ。
      if (body.length > 64 * 1024) return new Response("too large", { status: 413 });
      await env.RELAY.put(`${tenant}/${crypto.randomUUID()}`, body, { expirationTtl: 30 * 86400 });
      return new Response(null, { status: 204 });
    }

    if (req.method === "GET") {
      const list = await env.RELAY.list({ prefix: `${tenant}/` });
      const envelopes = await Promise.all(list.keys.map((k) => env.RELAY.get(k.name)));
      return Response.json(envelopes.filter(Boolean));
    }
    return new Response("method not allowed", { status: 405 });
  },
};

The crux of this Relay is that the moment you read the code, it's plainly obvious that "a privacy breakdown cannot occur here." No import of a decryption function exists. This is a guarantee equivalent to a formal proof.


Main Part ⑦: Integration with the Next.js 16 App Router — The Philosophy of the RSC Boundary and the Client Boundary

In the Next.js 16 App Router, the responsibility separation where Server Components bear only "model distribution and policy decisions," and inference is fully leaned onto the Client, is the most important structural design in this architecture.

// app/ai/page.tsx  (RSC;サーバーで描画される)
import { selectModelForDevice, type ModelDescriptor } from "@/lib/model-registry";
import { AgentRoot } from "@/components/agent/AgentRoot";

export const dynamic = "force-dynamic";

export default async function AIPage({
  headers,
}: {
  // Next.js 16: async params / headers パターン
  params: Promise<Record<string, string>>;
  searchParams: Promise<Record<string, string>>;
}) {
  // クライアントヒント(Sec-CH-UA-Platform, Device-Memory, Viewport-Width)を読む
  const uaHints = await readClientHints();
  const model: ModelDescriptor = selectModelForDevice(uaHints);

  // クライアントにはモデルのメタデータ(URL, 量子化形式, 期待帯域)だけを渡す。
  // 実バイナリは署名付きURLでCDNから直接取得するため、ここでは流れない。
  return <AgentRoot model={model} />;
}
// components/agent/AgentRoot.tsx
"use client";
import { useEffect, useMemo, useRef, useState } from "react";
import { Supervisor } from "@/agent/actor";
import { LWWMap } from "@/crdt/lwwMap";
import { acquireDevice } from "@/inference/webgpu/runtime";
import { loadModel } from "@/inference/webgpu/loader";
import { PlannerAgent } from "@/agent/planner";
import { ExecutorAgent } from "@/agent/executor";
import { CriticAgent } from "@/agent/critic";
import type { ModelDescriptor } from "@/lib/model-registry";

export function AgentRoot({ model }: { readonly model: ModelDescriptor }) {
  const [state, setState] = useState<"booting" | "ready" | "failed">("booting");
  const supRef = useRef<Supervisor | null>(null);

  useEffect(() => {
    let cancelled = false;
    (async () => {
      try {
        const { device, caps } = await acquireDevice();
        if (!caps.f16) throw new Error("fp16 required");

        const runtime = await loadModel(device, model); // CDN→IndexedDB(一度きり)
        const crdt    = new LWWMap<string>(crypto.randomUUID());

        const sup = new Supervisor("one-for-one");
        sup.spawn({ actor: new PlannerAgent(runtime, crdt),  maxRestarts: 3, withinMs: 30_000 });
        sup.spawn({ actor: new ExecutorAgent(runtime, crdt), maxRestarts: 3, withinMs: 30_000 });
        sup.spawn({ actor: new CriticAgent(runtime),         maxRestarts: 5, withinMs: 30_000 });

        if (cancelled) return;
        supRef.current = sup;
        setState("ready");
      } catch {
        if (!cancelled) setState("failed");
      }
    })();
    return () => {
      cancelled = true;
      supRef.current = null;
    };
  }, [model]);

  if (state === "booting") return <BootingView model={model} />;
  if (state === "failed")  return <FallbackCloudView />;
  return <ChatView supervisor={supRef.current!} />;
}

Design invariants

  1. All inference is at the Client boundary. The moment you write import "@/inference/webgpu" in an RSC even once, the model binary mixes into the server bundle, and cold start and CDN cost collapse. Forcibly forbid it with the ESLint rule no-restricted-imports.
  2. Fallback for environments without WebGPU: FallbackCloudView provides an experience that explicitly chooses going via the cloud, making "flowing to the cloud arbitrarily" explicit in the UI. This directly connects to the design of GDPR consent acquisition.
  3. Direct distribution from the CDN: distribute the model binary from the CDN with Cache-Control: immutable. Prevent tampering with a signed URL, and as a CRI process, run Subresource Integrity (SRI) hash verification before saving to IndexedDB (tampering with an LLM model's supply source is a real attack vector since 2025).

Proactive UI: Predicting the User's Intent

The true strength of Local-First + on-device inference is being able to run inference speculatively without a server round-trip. We incorporate this as "Proactive UI."

// components/agent/useSpeculativeCompletion.ts
"use client";
import { useEffect, useRef, useState } from "react";
import type { Supervisor } from "@/agent/actor";

export function useSpeculativeCompletion(
  supervisor: Supervisor,
  input: string,
  debounceMs = 150,
) {
  const [draft, setDraft] = useState<string>("");
  const abortRef = useRef<AbortController | null>(null);
  const budgetRef = useRef<number>(0);

  useEffect(() => {
    // 1文字ごとにキャンセル→再投機。キャンセルが軽いのはローカル推論の特権。
    abortRef.current?.abort();
    if (input.length < 3) { setDraft(""); return; }

    const ac = new AbortController();
    abortRef.current = ac;
    const t = setTimeout(async () => {
      // バッテリー予算が尽きていれば投機しない(UXより電池を守る)
      if (budgetRef.current <= 0) return;
      try {
        // idle callback で UI を阻害しないよう待つ
        await whenIdle();
        const result = await supervisor
          .speculativeInfer(input, ac.signal, { maxTokens: 16 });
        setDraft(result.text);
        budgetRef.current -= result.joules;
      } catch { /* aborted */ }
    }, debounceMs);

    return () => { clearTimeout(t); ac.abort(); };
  }, [input, supervisor, debounceMs]);

  return draft;
}

function whenIdle(): Promise<void> {
  return new Promise((r) => {
    if ("requestIdleCallback" in window) {
      (window as Window & { requestIdleCallback: (cb: IdleRequestCallback) => void })
        .requestIdleCallback(() => r());
    } else {
      setTimeout(r, 16);
    }
  });
}

The philosophy this hook implements

  • Cancel-first: the moment the user types the next key, stop the previous speculation. With a cloud LLM this is impossible (the server doesn't stop the request, or if it does, you're billed anyway). Precisely because it's local inference, over-speculation isn't a net loss.
  • Elevate battery charge to a UX constraint: via budgetRef, skip speculation on a device low on charge. Forbid "the battery dies for AI's sake" by design.
  • requestIdleCallback: fire inference only in the main thread's rendering margin. It's a classic technique for maintaining a perceived 60fps, but without it, the UX breaks down when running hundreds-of-ms computations like AI inference in the background.

Main Part ⑧: FinOps and the Extreme Guarantee of the "6 Major Elements"

For each quality element, we nail down quantitatively how we achieved a world-class level.

Performance: Working Backward from Big O and Bandwidth

  • Per-token computation $O(L \cdot d^2)$ ($L$: sequence length, $d$: hidden dimension). Asymptotes to $O(d^2)$ with the KV cache.
  • The bandwidth-limited limit calculation: a 7B model at Q4_0 is about 3.8GB. With M3 Pro shared-memory bandwidth of 200GB/s, theoretically 50 tok/s. The measured 30–35 tok/s is 60–70% bandwidth efficiency against that — highly efficient by industry standards.
  • Speculative Decoding: a draft model (small) makes a proposal, and a verify model (large) verifies en masse. On a 4-gram draft match, 3–4× effective throughput. Precisely because it's local dedicated computation, it can be introduced without side effects.

Reliability: A Design Premised on Chaos

On-device inference is chaos as the norm.

FailureSourceCountermeasure
GPU context lostAMD/Intel GPU resume from Windows sleepdevice.lost handler → Supervisor restart
VRAM exhaustionGPU use by another tabdestroy() the model → fallback to CPU/cloud
IndexedDB corruptionForced browser termination + incomplete fsyncReconstruct from the CRDT op log (oplog-first)
Clock skewMulti-device syncReject via HLC drift detection
Hostile peerMalware-infected deviceAEAD + deviceAttestation JWT

This table is itself the chaos-engineering plan at implementation time. Intentionally cause all the failures that occur in production, and test the recovery paths.

Security: Thoroughgoing Zero-Trust

  • Device authentication: a key derived from WebAuthn PRF. On a stolen device, decryption is impossible unless biometric re-authentication passes.
  • Model-tampering prevention: SRI hash + signed URL. Double check.
  • Sync path: AEAD + AAD. Detect tampering, replay, and swap by a man-in-the-middle, all of them.
  • Prompt-injection countermeasure: the Critic Agent verifies the LLM output with a separate model, and stops the agent as poisoned when "the output contains an unknown tool call." Multi-layer quarantine (Defense in Depth).

Maintainability: Minimizing Cognitive Load

  • The domain layer (types), Actor layer (concurrency), inference layer (GPU), and CRDT layer (distributed) — these 4 layers hold no shared mutable state whatsoever.
  • Centralizing state transitions via the transition() function makes the place to look during debugging one spot.
  • Adopt SumTypes across all layers, and put never in the default case to make an exhaustiveness gap a build error.

Type Safety: Type Expression of DDD

  • Branded ID (AgentId, TaskId, etc.) prevents lateral mix-ups.
  • Phantom type (ReadyDevice) statically checks the temporal constraint "usable only when initialized."
  • DU + transition function forbids illegal state transitions with TS 2367.

Economy: A Fundamental Rewrite of FinOps

Assuming 1 million MAU, 50 queries per user per day, and 300-token average output per query, we compare cost models.

ModelMonthly cost (incl. input)Scale characteristic
Cloud LLM (mid-tier model, $0.003/1K tok)about $135,000Proportional to MAU as variable cost
Own GPU (A100-equivalent, 2 units always on)about $12,000 + operations laborDoubles with MAU growth, idle cost
This architecture (Local-First + on-device inference)about $500–$2,000 (CDN + Relay + monitoring)Sub-linear to MAU (nearly constant)

The break-even point is MAU 10,000 or below. Even at scales below that, CDN cost is incurred, but you can distribute worldwide with a fixed cost of a few hundred dollars/month. Since the scale curve is structurally different, in the 10M MAU era it's no longer a comparison at all.


Conclusion: The Architecture of the AGI Era Is Already Visible

This article's Local-First Agentic architecture is presented not as our limit in 2026 but as a template for the architecture of the 2030s. I write the direction that a technical leader should preempt — the horizon a few years ahead.

Quantitative Expected Outcomes (When Adopting This Architecture)

MetricCloud-LLM-centricLocal-First AgenticImprovement
TTFT (Time To First Token)700–1,500ms10–80ms10–100×
p99 end-to-end response3–6 sec300–800msabout 6–8×
Data egress volumeAll input/output0 (optionally only encrypted diffs)
Monthly inference cost (1M MAU)around $135k$500–$2kabout 70×
Offline availability×◎ (fully working)-
GDPR/HIPAA cross-borderRequires terms + DPADoesn't arise in the first place-
Model-swap easeAll users at oncePer individual device, staged rollout-

A Stepping Stone Toward the AGI Era

  1. Personal Model (a personalized distilled model): continually learn a user-specific LoRA inside the device. Internalize the user's context, vocabulary, and work conventions "without anyone seeing." Personalization that can never be achieved in the cloud.
  2. Multi-Agent Negotiation Protocol: User A's Planner and User B's Planner reach consensus as the users' proxies while keeping intent confidential. Realistically, this becomes incrementally implementable not with Oblivious RAM or Fully Homomorphic Encryption but with selective disclosure via ZK-SNARK + a CRDT-based consensus log.
  3. Persistent Context Economy: once an LLM holds a persistent memory per user that "doesn't use up the context window," knowledge accumulated only on the user's side becomes a differentiator. Private information capital that can never be replicated in cloud RAG is formed.
  4. Legal personhood of AI agents: in the EU AI Act's next expansion, "the responsible party for an autonomous agent" trends toward being defined as branching into the distributor, the operator, and the device owner. The structure where inference completes on the user's device is also a structural means of avoiding this responsibility problem.

Closing: The Management Decision of Choosing the Right Question

The question a 2026 CTO should ask is not "which cloud LLM to choose." It's the architecture's foundational theory itself: "where should inference be performed?"

A cloud LLM provides the stupendous general-purpose asset of general intelligence, while saddling you with three structural debts: latency, privacy, and economy. Local-First Agentic, on the other hand, makes the physical closeness of "AI is next to the user" implementable through a meticulous stacking of the type system, distributed algorithms, cryptography, and GPU compute.

Implementing one of this architecture in your own company starting now is not merely a technical advantage but, 3–5 years from now when competitors suffer from rising cloud unit prices, a management decision that makes a difference in the gross-margin structure itself. The speed of light can't be negotiated, regulation won't loosen, and cloud unit prices won't fully fall. But the silicon in the user's hands gets stronger on its own every year.

Won't you stand on the side that catches this wind first?

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading