# Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

> Llama 4 is natively multimodal. With real code, it explains a production pipeline that drops images — invoices, receipts, business cards, drawings, screenshots — into structured data without guessing, covering AWS Bedrock Converse image input, boundary validation with Zod, a confidence gate, human review, and PII protection.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Llama, マルチモーダル, 生成AI, AWS Bedrock, OCR, TypeScript, Python
- URL: https://tomodahinata.com/en/blog/llama-4-multimodal-vision-image-understanding-production
- Category: Llama & open-weight LLMs
- Pillar guide: https://tomodahinata.com/en/blog/meta-llama-open-weight-llm-production-guide

## Key points

- Llama 4 is early-fusion native multimodal. It handles text and images in the same model, so image→structured-data extraction runs at realistic accuracy.
- Feed via the Bedrock Converse image block: format (png/jpeg/gif/webp) + bytes. Constraints of up to 20 images, each 3.75MB/8000px. The SDK handles base64.
- The iron rule is 'don't make it guess.' Validate the output at the boundary with Zod, and route values you can't get wrong — amounts, model numbers — to a confidence gate + human or master matching.
- The pitfalls are hallucinated fields, low resolution, multi-page, and image-token billing. Ensure quality with preprocessing (rotation correction, resize) and evaluation.
- An image is a mass of PII (identity documents, etc.). Build into the design: don't log the byte string, least privilege, masking.

---

## The goal of this article

[Llama 4 is natively multimodal](/blog/meta-llama-open-weight-llm-production-guide#llama-4-とは何かネイティブマルチモーダル-moe) — it handles text and images **in the same model from the start, not bolted on** (early fusion). Where this pays off is the task of **"look at an image and return structured data."** Invoices, receipts, business cards, drawings, screenshots, identity documents… the field overflows with "paper and screenshots."

This article finishes that image understanding into **production, not a demo** — don't make it guess, bind it with types, protect values you can't get wrong, and don't leak PII.

> **A disclosure of credibility**: the design of "don't make generation guess numbers, and divide the labor so values you can't get wrong are matched against a master" is the core philosophy I used when structurally crushing wrong answers about model numbers and sizes in the [voice-concierge case](/blog/production-voice-ai-sales-agent-bedrock-pgvector). This article's image extraction is built with the same discipline — **the LLM is the words of reading; truth is verification.**

---

## Where it pays off

| Input image | Structure to extract | Value |
| --- | --- | --- |
| Invoices, receipts | Counterparty, amount, date, line items | Zero manual entry in accounting |
| Business cards | Name, company, title, contact | Auto-registration into CRM |
| Drawings, spec sheets | Model number, dimensions, material | Preprocessing for quotes/orders |
| Screenshots | Error text, UI state | Auto-classification of support/QA |
| Identity documents | Name, date of birth, number | KYC (* PII protection is a premise) |

What's common is "**unstructured images → structured data the downstream system can use.**" Fill this with the LLM's multimodality and manual transcription disappears.

---

## Feeding it in: Bedrock Converse image input

[Bedrock's Converse API](/blog/meta-llama-open-weight-llm-production-guide#使い方baws-bedrock本番フルマネージド) can mix an **image block** into a message. The form is `format` (`png` / `jpeg` / `gif` / `webp`) + `source.bytes`. Since the **AWS SDK handles the base64 encoding**, you just pass raw bytes.

```python
# llama_vision.py — 画像＋指示を Llama 4 に渡し、テキストで理解させる
import boto3
from botocore.config import Config

_bedrock = boto3.client(
    "bedrock-runtime", region_name="us-east-1",
    config=Config(retries={"max_attempts": 4, "mode": "adaptive"}, read_timeout=60),
)
MODEL_ID = "us.meta.llama4-maverick-17b-instruct-v1:0"  # 画像理解は上位のMaverickが堅い

def read_image(path: str) -> dict:
    with open(path, "rb") as f:
        data = f.read()  # SDKがbase64を処理するので生バイトでよい
    ext = path.rsplit(".", 1)[-1].lower()
    fmt = "jpeg" if ext == "jpg" else ext  # jpg→jpeg に正規化
    return {"image": {"format": fmt, "source": {"bytes": data}}}

def describe(image_path: str, instruction: str) -> str:
    resp = _bedrock.converse(
        modelId=MODEL_ID,
        messages=[{"role": "user", "content": [read_image(image_path), {"text": instruction}]}],
        inferenceConfig={"maxTokens": 1024, "temperature": 0.0},  # 抽出は温度0で揺らさない
    )
    return resp["output"]["message"]["content"][0]["text"]
```

> 📌 **Constraints (official)**: up to **20 images per message**, each image **3.75MB and 8000px or less.** Split multi-page invoices **per page** to feed them. For extraction, temperature `0.0` is the basis — no creativity needed; **reading the same every time** is quality.

---

## Binding with types: from image to structured data (don't make it guess)

The output of image understanding is also **outside the trust boundary.** Demand **schema-conformant JSON** rather than free text, and flow it downstream only after **boundary validation with Zod.** With the Vercel AI SDK, you can combine an image part and `generateObject`.

```ts
// lib/extract-receipt.ts — レシート画像 → 型安全な構造化データ
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";

const Receipt = z.object({
  merchant: z.string().min(1),
  total: z.number().nonnegative(),
  currency: z.enum(["JPY", "USD", "EUR"]),
  purchasedAt: z.string().date(),                 // YYYY-MM-DD
  items: z.array(z.object({ name: z.string(), price: z.number() })).max(100),
  confidence: z.number().min(0).max(1),           // モデルに自己申告させる読み取り確度
});

export async function extractReceipt(image: Uint8Array) {
  const { object, usage } = await generateObject({
    model: bedrock("us.meta.llama4-maverick-17b-instruct-v1:0"),
    schema: Receipt,
    messages: [{
      role: "user",
      content: [
        { type: "image", image },
        { type: "text", text: "レシート画像から構造化データのみ抽出。読み取れない値は推測せず、confidenceを下げる。" },
      ],
    }],
  });
  // object は Receipt 準拠が型レベルで保証された安全な値。usage は画像トークン課金の可観測性。
  return { receipt: object, usage };
}
```

**No matter what it returns, only a value that passed `Receipt` flows downstream** — this structurally excludes "broken JSON" and "unexpected fields."

---

## Protecting values you can't get wrong: a confidence gate and division of labor

This is the heart of production. **You must not finalize "incident-if-wrong" values like amounts, model numbers, and personal info with the model's single output.** Lay onto image extraction the same discipline as dividing the labor in voice concierge with "numbers via master matching, not the LLM."

```ts
// 抽出結果を“信頼度”と“検証可能性”で振り分ける（自動確定はしない）
type Routed =
  | { status: "auto"; receipt: Receipt }       // 高確度かつ検算一致 → 自動採用
  | { status: "review"; receipt: Receipt };     // 低確度 or 検算不一致 → 人手レビュー

function route(receipt: Receipt): Routed {
  const sumOfItems = receipt.items.reduce((a, b) => a + b.price, 0);
  const arithmeticOk = Math.abs(sumOfItems - receipt.total) < 1; // 明細合計と総額の検算
  const confident = receipt.confidence >= 0.9;
  return confident && arithmeticOk ? { status: "auto", receipt } : { status: "review", receipt };
}
```

The point is **backing it up with "model-independent truth" — the arithmetic check (sum of line items = total).** The LLM bears the labor of reading, and **code guarantees the correctness.** With this, you narrow human review from "all" to "**only the suspicious portion**," reconciling quality and cost.

---

## Pitfalls and countermeasures

- **Hallucinated fields**: filling an unreadable field "plausibly." → State in the prompt "**no guessing; unknown is null/low confidence**" and `temperature=0`. Allow `nullable` in the schema to express "empty."
- **Low resolution, skew, shadows**: reading accuracy drops. → Insert one stage of **preprocessing** (resize, rotation correction, contrast). Bring it under the 8000px / 3.75MB constraints with preprocessing too.
- **Multi-page**: cramming too much into one request. → **Split per page**, extract per page, and integrate.
- **Image-token billing**: images consume many input tokens. → **Lower the resolution to just enough**, and route only hard images to a higher model ([cost design](/blog/llama-inference-cost-optimization-self-host-vs-api#コスト削減レバー効果の大きい順)).
- **No evaluation**: judging by appearance. → Measure field-match rate (precision/recall) with **a labeled evaluation set**, and ship after detecting regressions.

---

## Security: an image is a mass of PII

Invoices, business cards, and identity documents are **personal information itself.** Build protection into the design.

- **Don't log the byte string or extracted body**: observability is enough with **only metadata (model, token count, confidence, processing time).**
- **Least privilege**: image storage (S3, etc.) at least privilege, encrypted, short-lived URLs. Delete after processing per the retention policy.
- **Masking**: handle number-type values downstream to the minimum necessary. Mask in the display UI (accessibly — don't expose all digits in read-aloud).
- **Input validation**: **validate the file format, size, and count at the boundary** before feeding ([the discipline of type safety](/blog/typescript-type-safety-discipline-zod-nevererror-no-any)).

> The extracted structured data ultimately flows to **a screen people use.** Only by designing masking, labels, and error expressions to be **accessible** does the extraction pipeline become a "product."

---

## FAQ

**Q. How is it different from a dedicated OCR service?**
A. Traditional OCR goes up to "read the characters." Llama 4 is end-to-end up to "**read, understand the meaning, and return it structured.**" It's strong at invoices with diverse layouts and context-dependent field extraction. Meanwhile, the design of always backing up values you can't get wrong with an arithmetic check/master matching is, like OCR, essential.

**Q. Which model should I use?**
A. **Maverick** is solid for image understanding. **Routing** — holding cost down with Scout for high-volume, easy ones and sending only hard images to a higher model — is the cost-effective optimum.

**Q. Can it read Japanese invoices?**
A. It can. But accuracy drops for **handwriting, low resolution, and special fonts**, so always prepare preprocessing, a confidence gate, and a human-review path.

**Q. How many images can I pass at once?**
A. The official constraint is **up to 20 images per message, each 3.75MB/8000px.** Bring multi-page or high-resolution under it with page splitting and resizing.

**Q. Can self-hosting do image understanding too?**
A. It can. [Serving Scout/Maverick with vLLM](/blog/vllm-llama-self-hosting-production-inference-server) handles it similarly with OpenAI-compatible image input. For KYC, etc., where data can't go outside, this.

---

## Summary

Llama 4's multimodality is a practical tool to "**turn paper and screenshots into data the downstream can use.**" The key isn't flashy usage but **discipline.**

1. Feed via **Converse's image block** (format + bytes, temperature 0).
2. **Validate at the boundary with Zod** to structurally exclude broken output.
3. **Back up values you can't get wrong with an arithmetic check/master matching**, and don't auto-finalize.
4. Build in **PII protection** (no logging, least privilege, masking).
5. Measure accuracy with **an evaluation set**, and ship after watching for regressions.

> If you want to put image understanding of invoices, identity verification, or drawings into production, including arithmetic checks, human review, PII protection, and cost design, see my [track record](/case-studies/ai-voice-chatbot) and reach out. With **one-person × generative AI** — fast, cheap, and safe.

### Sources / official resources

- [The Llama 4 herd (Meta AI)](https://ai.meta.com/blog/llama-4-multimodal-intelligence/) — native multimodal
- [Bedrock Converse API (image input)](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html) — image-block format and constraints
- [boto3 converse reference](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/converse.html)
- [Vercel AI SDK](https://ai-sdk.dev/) — `generateObject` and image parts

* Models, constraints, and pricing are updated. Confirm primary sources before implementation, and verify accuracy with a labeled evaluation.
