Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

The goal of this article

Llama 4 is natively multimodal — it handles text and images in the same model from the start, not bolted on (early fusion). Where this pays off is the task of "look at an image and return structured data." Invoices, receipts, business cards, drawings, screenshots, identity documents… the field overflows with "paper and screenshots."

This article finishes that image understanding into production, not a demo — don't make it guess, bind it with types, protect values you can't get wrong, and don't leak PII.

A disclosure of credibility: the design of "don't make generation guess numbers, and divide the labor so values you can't get wrong are matched against a master" is the core philosophy I used when structurally crushing wrong answers about model numbers and sizes in the voice-concierge case. This article's image extraction is built with the same discipline — the LLM is the words of reading; truth is verification.

Where it pays off

Input image	Structure to extract	Value
Invoices, receipts	Counterparty, amount, date, line items	Zero manual entry in accounting
Business cards	Name, company, title, contact	Auto-registration into CRM
Drawings, spec sheets	Model number, dimensions, material	Preprocessing for quotes/orders
Screenshots	Error text, UI state	Auto-classification of support/QA
Identity documents	Name, date of birth, number	KYC (* PII protection is a premise)

What's common is "unstructured images → structured data the downstream system can use." Fill this with the LLM's multimodality and manual transcription disappears.

Feeding it in: Bedrock Converse image input

Bedrock's Converse API can mix an image block into a message. The form is format (png / jpeg / gif / webp) + source.bytes. Since the AWS SDK handles the base64 encoding, you just pass raw bytes.

# llama_vision.py — 画像＋指示を Llama 4 に渡し、テキストで理解させる
import boto3
from botocore.config import Config

_bedrock = boto3.client(
    "bedrock-runtime", region_name="us-east-1",
    config=Config(retries={"max_attempts": 4, "mode": "adaptive"}, read_timeout=60),
)
MODEL_ID = "us.meta.llama4-maverick-17b-instruct-v1:0"  # 画像理解は上位のMaverickが堅い

def read_image(path: str) -> dict:
    with open(path, "rb") as f:
        data = f.read()  # SDKがbase64を処理するので生バイトでよい
    ext = path.rsplit(".", 1)[-1].lower()
    fmt = "jpeg" if ext == "jpg" else ext  # jpg→jpeg に正規化
    return {"image": {"format": fmt, "source": {"bytes": data}}}

def describe(image_path: str, instruction: str) -> str:
    resp = _bedrock.converse(
        modelId=MODEL_ID,
        messages=[{"role": "user", "content": [read_image(image_path), {"text": instruction}]}],
        inferenceConfig={"maxTokens": 1024, "temperature": 0.0},  # 抽出は温度0で揺らさない
    )
    return resp["output"]["message"]["content"][0]["text"]

📌 Constraints (official): up to 20 images per message, each image 3.75MB and 8000px or less. Split multi-page invoices per page to feed them. For extraction, temperature 0.0 is the basis — no creativity needed; reading the same every time is quality.

Binding with types: from image to structured data (don't make it guess)

The output of image understanding is also outside the trust boundary. Demand schema-conformant JSON rather than free text, and flow it downstream only after boundary validation with Zod. With the Vercel AI SDK, you can combine an image part and generateObject.

// lib/extract-receipt.ts — レシート画像 → 型安全な構造化データ
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";

const Receipt = z.object({
  merchant: z.string().min(1),
  total: z.number().nonnegative(),
  currency: z.enum(["JPY", "USD", "EUR"]),
  purchasedAt: z.string().date(),                 // YYYY-MM-DD
  items: z.array(z.object({ name: z.string(), price: z.number() })).max(100),
  confidence: z.number().min(0).max(1),           // モデルに自己申告させる読み取り確度
});

export async function extractReceipt(image: Uint8Array) {
  const { object, usage } = await generateObject({
    model: bedrock("us.meta.llama4-maverick-17b-instruct-v1:0"),
    schema: Receipt,
    messages: [{
      role: "user",
      content: [
        { type: "image", image },
        { type: "text", text: "レシート画像から構造化データのみ抽出。読み取れない値は推測せず、confidenceを下げる。" },
      ],
    }],
  });
  // object は Receipt 準拠が型レベルで保証された安全な値。usage は画像トークン課金の可観測性。
  return { receipt: object, usage };
}

No matter what it returns, only a value that passed Receipt flows downstream — this structurally excludes "broken JSON" and "unexpected fields."

Protecting values you can't get wrong: a confidence gate and division of labor

This is the heart of production. You must not finalize "incident-if-wrong" values like amounts, model numbers, and personal info with the model's single output. Lay onto image extraction the same discipline as dividing the labor in voice concierge with "numbers via master matching, not the LLM."

// 抽出結果を“信頼度”と“検証可能性”で振り分ける（自動確定はしない）
type Routed =
  | { status: "auto"; receipt: Receipt }       // 高確度かつ検算一致 → 自動採用
  | { status: "review"; receipt: Receipt };     // 低確度 or 検算不一致 → 人手レビュー

function route(receipt: Receipt): Routed {
  const sumOfItems = receipt.items.reduce((a, b) => a + b.price, 0);
  const arithmeticOk = Math.abs(sumOfItems - receipt.total) < 1; // 明細合計と総額の検算
  const confident = receipt.confidence >= 0.9;
  return confident && arithmeticOk ? { status: "auto", receipt } : { status: "review", receipt };
}

The point is backing it up with "model-independent truth" — the arithmetic check (sum of line items = total). The LLM bears the labor of reading, and code guarantees the correctness. With this, you narrow human review from "all" to "only the suspicious portion," reconciling quality and cost.

Pitfalls and countermeasures

Hallucinated fields: filling an unreadable field "plausibly." → State in the prompt "no guessing; unknown is null/low confidence" and temperature=0. Allow nullable in the schema to express "empty."
Low resolution, skew, shadows: reading accuracy drops. → Insert one stage of preprocessing (resize, rotation correction, contrast). Bring it under the 8000px / 3.75MB constraints with preprocessing too.
Multi-page: cramming too much into one request. → Split per page, extract per page, and integrate.
Image-token billing: images consume many input tokens. → Lower the resolution to just enough, and route only hard images to a higher model (cost design).
No evaluation: judging by appearance. → Measure field-match rate (precision/recall) with a labeled evaluation set, and ship after detecting regressions.

Security: an image is a mass of PII

Invoices, business cards, and identity documents are personal information itself. Build protection into the design.

Don't log the byte string or extracted body: observability is enough with only metadata (model, token count, confidence, processing time).
Least privilege: image storage (S3, etc.) at least privilege, encrypted, short-lived URLs. Delete after processing per the retention policy.
Masking: handle number-type values downstream to the minimum necessary. Mask in the display UI (accessibly — don't expose all digits in read-aloud).
Input validation: validate the file format, size, and count at the boundary before feeding (the discipline of type safety).

The extracted structured data ultimately flows to a screen people use. Only by designing masking, labels, and error expressions to be accessible does the extraction pipeline become a "product."

FAQ

Q. How is it different from a dedicated OCR service? A. Traditional OCR goes up to "read the characters." Llama 4 is end-to-end up to "read, understand the meaning, and return it structured." It's strong at invoices with diverse layouts and context-dependent field extraction. Meanwhile, the design of always backing up values you can't get wrong with an arithmetic check/master matching is, like OCR, essential.

Q. Which model should I use? A. Maverick is solid for image understanding. Routing — holding cost down with Scout for high-volume, easy ones and sending only hard images to a higher model — is the cost-effective optimum.

Q. Can it read Japanese invoices? A. It can. But accuracy drops for handwriting, low resolution, and special fonts, so always prepare preprocessing, a confidence gate, and a human-review path.

Q. How many images can I pass at once? A. The official constraint is up to 20 images per message, each 3.75MB/8000px. Bring multi-page or high-resolution under it with page splitting and resizing.

Q. Can self-hosting do image understanding too? A. It can. Serving Scout/Maverick with vLLM handles it similarly with OpenAI-compatible image input. For KYC, etc., where data can't go outside, this.

Summary

Llama 4's multimodality is a practical tool to "turn paper and screenshots into data the downstream can use." The key isn't flashy usage but discipline.

Feed via Converse's image block (format + bytes, temperature 0).
Validate at the boundary with Zod to structurally exclude broken output.
Back up values you can't get wrong with an arithmetic check/master matching, and don't auto-finalize.
Build in PII protection (no logging, least privilege, masking).
Measure accuracy with an evaluation set, and ship after watching for regressions.

If you want to put image understanding of invoices, identity verification, or drawings into production, including arithmetic checks, human review, PII protection, and cost design, see my track record and reach out. With one-person × generative AI — fast, cheap, and safe.

Sources / official resources

The Llama 4 herd (Meta AI) — native multimodal
Bedrock Converse API (image input) — image-block format and constraints
boto3 converse reference
Vercel AI SDK — generateObject and image parts

Models, constraints, and pricing are updated. Confirm primary sources before implementation, and verify accuracy with a labeled evaluation.

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

The goal of this article

Where it pays off

Feeding it in: Bedrock Converse image input

Binding with types: from image to structured data (don't make it guess)

Protecting values you can't get wrong: a confidence gate and division of labor

Pitfalls and countermeasures

Security: an image is a mass of PII

FAQ

Summary

Sources / official resources

Llama Complete Guide: Shipping Meta's Open-Weight LLM to Production, Faithful to the Official Docs (Llama 4, Bedrock, Llama API)

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

Selecting commercial licenses for open-weight LLMs: treating Apache 2.0 / Llama / Qwen / Gemma as a 'design decision'

Also worth reading

Run a backend on Vercel: operate Express, Hono, FastAPI, and NestJS in production with zero config

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

The goal of this article

Where it pays off

Feeding it in: Bedrock Converse image input

Binding with types: from image to structured data (don't make it guess)

Protecting values you can't get wrong: a confidence gate and division of labor

Pitfalls and countermeasures

Security: an image is a mass of PII

FAQ

Summary

Sources / official resources

Related articles

Llama Complete Guide: Shipping Meta's Open-Weight LLM to Production, Faithful to the Official Docs (Llama 4, Bedrock, Llama API)

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Designing Llama inference cost: deriving the break-even of API vs. self-hosting with TCO

Selecting commercial licenses for open-weight LLMs: treating Apache 2.0 / Llama / Qwen / Gemma as a 'design decision'

Also worth reading

Run a backend on Vercel: operate Express, Hono, FastAPI, and NestJS in production with zero config

This is how the 'you can see other people's data' IDOR vulnerability is born in Supabase — a practical guide to finding and fixing the authorization flaws lurking in AI-generated Next.js code

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production