Skip to main content
友田 陽大
Llama & open-weight LLMs
Llama
マルチモーダル
生成AI
AWS Bedrock
OCR
TypeScript
Python

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

Llama 4 is natively multimodal. With real code, it explains a production pipeline that drops images — invoices, receipts, business cards, drawings, screenshots — into structured data without guessing, covering AWS Bedrock Converse image input, boundary validation with Zod, a confidence gate, human review, and PII protection.

Published
Reading time
8 min read
Author
友田 陽大
Share

The goal of this article

Llama 4 is natively multimodal — it handles text and images in the same model from the start, not bolted on (early fusion). Where this pays off is the task of "look at an image and return structured data." Invoices, receipts, business cards, drawings, screenshots, identity documents… the field overflows with "paper and screenshots."

This article finishes that image understanding into production, not a demo — don't make it guess, bind it with types, protect values you can't get wrong, and don't leak PII.

A disclosure of credibility: the design of "don't make generation guess numbers, and divide the labor so values you can't get wrong are matched against a master" is the core philosophy I used when structurally crushing wrong answers about model numbers and sizes in the voice-concierge case. This article's image extraction is built with the same discipline — the LLM is the words of reading; truth is verification.


Where it pays off

Input imageStructure to extractValue
Invoices, receiptsCounterparty, amount, date, line itemsZero manual entry in accounting
Business cardsName, company, title, contactAuto-registration into CRM
Drawings, spec sheetsModel number, dimensions, materialPreprocessing for quotes/orders
ScreenshotsError text, UI stateAuto-classification of support/QA
Identity documentsName, date of birth, numberKYC (* PII protection is a premise)

What's common is "unstructured images → structured data the downstream system can use." Fill this with the LLM's multimodality and manual transcription disappears.


Feeding it in: Bedrock Converse image input

Bedrock's Converse API can mix an image block into a message. The form is format (png / jpeg / gif / webp) + source.bytes. Since the AWS SDK handles the base64 encoding, you just pass raw bytes.

# llama_vision.py — 画像+指示を Llama 4 に渡し、テキストで理解させる
import boto3
from botocore.config import Config

_bedrock = boto3.client(
    "bedrock-runtime", region_name="us-east-1",
    config=Config(retries={"max_attempts": 4, "mode": "adaptive"}, read_timeout=60),
)
MODEL_ID = "us.meta.llama4-maverick-17b-instruct-v1:0"  # 画像理解は上位のMaverickが堅い

def read_image(path: str) -> dict:
    with open(path, "rb") as f:
        data = f.read()  # SDKがbase64を処理するので生バイトでよい
    ext = path.rsplit(".", 1)[-1].lower()
    fmt = "jpeg" if ext == "jpg" else ext  # jpg→jpeg に正規化
    return {"image": {"format": fmt, "source": {"bytes": data}}}

def describe(image_path: str, instruction: str) -> str:
    resp = _bedrock.converse(
        modelId=MODEL_ID,
        messages=[{"role": "user", "content": [read_image(image_path), {"text": instruction}]}],
        inferenceConfig={"maxTokens": 1024, "temperature": 0.0},  # 抽出は温度0で揺らさない
    )
    return resp["output"]["message"]["content"][0]["text"]

📌 Constraints (official): up to 20 images per message, each image 3.75MB and 8000px or less. Split multi-page invoices per page to feed them. For extraction, temperature 0.0 is the basis — no creativity needed; reading the same every time is quality.


Binding with types: from image to structured data (don't make it guess)

The output of image understanding is also outside the trust boundary. Demand schema-conformant JSON rather than free text, and flow it downstream only after boundary validation with Zod. With the Vercel AI SDK, you can combine an image part and generateObject.

// lib/extract-receipt.ts — レシート画像 → 型安全な構造化データ
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";

const Receipt = z.object({
  merchant: z.string().min(1),
  total: z.number().nonnegative(),
  currency: z.enum(["JPY", "USD", "EUR"]),
  purchasedAt: z.string().date(),                 // YYYY-MM-DD
  items: z.array(z.object({ name: z.string(), price: z.number() })).max(100),
  confidence: z.number().min(0).max(1),           // モデルに自己申告させる読み取り確度
});

export async function extractReceipt(image: Uint8Array) {
  const { object, usage } = await generateObject({
    model: bedrock("us.meta.llama4-maverick-17b-instruct-v1:0"),
    schema: Receipt,
    messages: [{
      role: "user",
      content: [
        { type: "image", image },
        { type: "text", text: "レシート画像から構造化データのみ抽出。読み取れない値は推測せず、confidenceを下げる。" },
      ],
    }],
  });
  // object は Receipt 準拠が型レベルで保証された安全な値。usage は画像トークン課金の可観測性。
  return { receipt: object, usage };
}

No matter what it returns, only a value that passed Receipt flows downstream — this structurally excludes "broken JSON" and "unexpected fields."


Protecting values you can't get wrong: a confidence gate and division of labor

This is the heart of production. You must not finalize "incident-if-wrong" values like amounts, model numbers, and personal info with the model's single output. Lay onto image extraction the same discipline as dividing the labor in voice concierge with "numbers via master matching, not the LLM."

// 抽出結果を“信頼度”と“検証可能性”で振り分ける(自動確定はしない)
type Routed =
  | { status: "auto"; receipt: Receipt }       // 高確度かつ検算一致 → 自動採用
  | { status: "review"; receipt: Receipt };     // 低確度 or 検算不一致 → 人手レビュー

function route(receipt: Receipt): Routed {
  const sumOfItems = receipt.items.reduce((a, b) => a + b.price, 0);
  const arithmeticOk = Math.abs(sumOfItems - receipt.total) < 1; // 明細合計と総額の検算
  const confident = receipt.confidence >= 0.9;
  return confident && arithmeticOk ? { status: "auto", receipt } : { status: "review", receipt };
}

The point is backing it up with "model-independent truth" — the arithmetic check (sum of line items = total). The LLM bears the labor of reading, and code guarantees the correctness. With this, you narrow human review from "all" to "only the suspicious portion," reconciling quality and cost.


Pitfalls and countermeasures

  • Hallucinated fields: filling an unreadable field "plausibly." → State in the prompt "no guessing; unknown is null/low confidence" and temperature=0. Allow nullable in the schema to express "empty."
  • Low resolution, skew, shadows: reading accuracy drops. → Insert one stage of preprocessing (resize, rotation correction, contrast). Bring it under the 8000px / 3.75MB constraints with preprocessing too.
  • Multi-page: cramming too much into one request. → Split per page, extract per page, and integrate.
  • Image-token billing: images consume many input tokens. → Lower the resolution to just enough, and route only hard images to a higher model (cost design).
  • No evaluation: judging by appearance. → Measure field-match rate (precision/recall) with a labeled evaluation set, and ship after detecting regressions.

Security: an image is a mass of PII

Invoices, business cards, and identity documents are personal information itself. Build protection into the design.

  • Don't log the byte string or extracted body: observability is enough with only metadata (model, token count, confidence, processing time).
  • Least privilege: image storage (S3, etc.) at least privilege, encrypted, short-lived URLs. Delete after processing per the retention policy.
  • Masking: handle number-type values downstream to the minimum necessary. Mask in the display UI (accessibly — don't expose all digits in read-aloud).
  • Input validation: validate the file format, size, and count at the boundary before feeding (the discipline of type safety).

The extracted structured data ultimately flows to a screen people use. Only by designing masking, labels, and error expressions to be accessible does the extraction pipeline become a "product."


FAQ

Q. How is it different from a dedicated OCR service? A. Traditional OCR goes up to "read the characters." Llama 4 is end-to-end up to "read, understand the meaning, and return it structured." It's strong at invoices with diverse layouts and context-dependent field extraction. Meanwhile, the design of always backing up values you can't get wrong with an arithmetic check/master matching is, like OCR, essential.

Q. Which model should I use? A. Maverick is solid for image understanding. Routing — holding cost down with Scout for high-volume, easy ones and sending only hard images to a higher model — is the cost-effective optimum.

Q. Can it read Japanese invoices? A. It can. But accuracy drops for handwriting, low resolution, and special fonts, so always prepare preprocessing, a confidence gate, and a human-review path.

Q. How many images can I pass at once? A. The official constraint is up to 20 images per message, each 3.75MB/8000px. Bring multi-page or high-resolution under it with page splitting and resizing.

Q. Can self-hosting do image understanding too? A. It can. Serving Scout/Maverick with vLLM handles it similarly with OpenAI-compatible image input. For KYC, etc., where data can't go outside, this.


Summary

Llama 4's multimodality is a practical tool to "turn paper and screenshots into data the downstream can use." The key isn't flashy usage but discipline.

  1. Feed via Converse's image block (format + bytes, temperature 0).
  2. Validate at the boundary with Zod to structurally exclude broken output.
  3. Back up values you can't get wrong with an arithmetic check/master matching, and don't auto-finalize.
  4. Build in PII protection (no logging, least privilege, masking).
  5. Measure accuracy with an evaluation set, and ship after watching for regressions.

If you want to put image understanding of invoices, identity verification, or drawings into production, including arithmetic checks, human review, PII protection, and cost design, see my track record and reach out. With one-person × generative AI — fast, cheap, and safe.

Sources / official resources

  • Models, constraints, and pricing are updated. Confirm primary sources before implementation, and verify accuracy with a labeled evaluation.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading