The goal of this article
Llama 4 is natively multimodal — it handles text and images in the same model from the start, not bolted on (early fusion). Where this pays off is the task of "look at an image and return structured data." Invoices, receipts, business cards, drawings, screenshots, identity documents… the field overflows with "paper and screenshots."
This article finishes that image understanding into production, not a demo — don't make it guess, bind it with types, protect values you can't get wrong, and don't leak PII.
A disclosure of credibility: the design of "don't make generation guess numbers, and divide the labor so values you can't get wrong are matched against a master" is the core philosophy I used when structurally crushing wrong answers about model numbers and sizes in the voice-concierge case. This article's image extraction is built with the same discipline — the LLM is the words of reading; truth is verification.
Where it pays off
| Input image | Structure to extract | Value |
|---|---|---|
| Invoices, receipts | Counterparty, amount, date, line items | Zero manual entry in accounting |
| Business cards | Name, company, title, contact | Auto-registration into CRM |
| Drawings, spec sheets | Model number, dimensions, material | Preprocessing for quotes/orders |
| Screenshots | Error text, UI state | Auto-classification of support/QA |
| Identity documents | Name, date of birth, number | KYC (* PII protection is a premise) |
What's common is "unstructured images → structured data the downstream system can use." Fill this with the LLM's multimodality and manual transcription disappears.
Feeding it in: Bedrock Converse image input
Bedrock's Converse API can mix an image block into a message. The form is format (png / jpeg / gif / webp) + source.bytes. Since the AWS SDK handles the base64 encoding, you just pass raw bytes.
# llama_vision.py — 画像+指示を Llama 4 に渡し、テキストで理解させる
import boto3
from botocore.config import Config
_bedrock = boto3.client(
"bedrock-runtime", region_name="us-east-1",
config=Config(retries={"max_attempts": 4, "mode": "adaptive"}, read_timeout=60),
)
MODEL_ID = "us.meta.llama4-maverick-17b-instruct-v1:0" # 画像理解は上位のMaverickが堅い
def read_image(path: str) -> dict:
with open(path, "rb") as f:
data = f.read() # SDKがbase64を処理するので生バイトでよい
ext = path.rsplit(".", 1)[-1].lower()
fmt = "jpeg" if ext == "jpg" else ext # jpg→jpeg に正規化
return {"image": {"format": fmt, "source": {"bytes": data}}}
def describe(image_path: str, instruction: str) -> str:
resp = _bedrock.converse(
modelId=MODEL_ID,
messages=[{"role": "user", "content": [read_image(image_path), {"text": instruction}]}],
inferenceConfig={"maxTokens": 1024, "temperature": 0.0}, # 抽出は温度0で揺らさない
)
return resp["output"]["message"]["content"][0]["text"]
📌 Constraints (official): up to 20 images per message, each image 3.75MB and 8000px or less. Split multi-page invoices per page to feed them. For extraction, temperature
0.0is the basis — no creativity needed; reading the same every time is quality.
Binding with types: from image to structured data (don't make it guess)
The output of image understanding is also outside the trust boundary. Demand schema-conformant JSON rather than free text, and flow it downstream only after boundary validation with Zod. With the Vercel AI SDK, you can combine an image part and generateObject.
// lib/extract-receipt.ts — レシート画像 → 型安全な構造化データ
import { bedrock } from "@ai-sdk/amazon-bedrock";
import { generateObject } from "ai";
import { z } from "zod";
const Receipt = z.object({
merchant: z.string().min(1),
total: z.number().nonnegative(),
currency: z.enum(["JPY", "USD", "EUR"]),
purchasedAt: z.string().date(), // YYYY-MM-DD
items: z.array(z.object({ name: z.string(), price: z.number() })).max(100),
confidence: z.number().min(0).max(1), // モデルに自己申告させる読み取り確度
});
export async function extractReceipt(image: Uint8Array) {
const { object, usage } = await generateObject({
model: bedrock("us.meta.llama4-maverick-17b-instruct-v1:0"),
schema: Receipt,
messages: [{
role: "user",
content: [
{ type: "image", image },
{ type: "text", text: "レシート画像から構造化データのみ抽出。読み取れない値は推測せず、confidenceを下げる。" },
],
}],
});
// object は Receipt 準拠が型レベルで保証された安全な値。usage は画像トークン課金の可観測性。
return { receipt: object, usage };
}
No matter what it returns, only a value that passed Receipt flows downstream — this structurally excludes "broken JSON" and "unexpected fields."
Protecting values you can't get wrong: a confidence gate and division of labor
This is the heart of production. You must not finalize "incident-if-wrong" values like amounts, model numbers, and personal info with the model's single output. Lay onto image extraction the same discipline as dividing the labor in voice concierge with "numbers via master matching, not the LLM."
// 抽出結果を“信頼度”と“検証可能性”で振り分ける(自動確定はしない)
type Routed =
| { status: "auto"; receipt: Receipt } // 高確度かつ検算一致 → 自動採用
| { status: "review"; receipt: Receipt }; // 低確度 or 検算不一致 → 人手レビュー
function route(receipt: Receipt): Routed {
const sumOfItems = receipt.items.reduce((a, b) => a + b.price, 0);
const arithmeticOk = Math.abs(sumOfItems - receipt.total) < 1; // 明細合計と総額の検算
const confident = receipt.confidence >= 0.9;
return confident && arithmeticOk ? { status: "auto", receipt } : { status: "review", receipt };
}
The point is backing it up with "model-independent truth" — the arithmetic check (sum of line items = total). The LLM bears the labor of reading, and code guarantees the correctness. With this, you narrow human review from "all" to "only the suspicious portion," reconciling quality and cost.
Pitfalls and countermeasures
- Hallucinated fields: filling an unreadable field "plausibly." → State in the prompt "no guessing; unknown is null/low confidence" and
temperature=0. Allownullablein the schema to express "empty." - Low resolution, skew, shadows: reading accuracy drops. → Insert one stage of preprocessing (resize, rotation correction, contrast). Bring it under the 8000px / 3.75MB constraints with preprocessing too.
- Multi-page: cramming too much into one request. → Split per page, extract per page, and integrate.
- Image-token billing: images consume many input tokens. → Lower the resolution to just enough, and route only hard images to a higher model (cost design).
- No evaluation: judging by appearance. → Measure field-match rate (precision/recall) with a labeled evaluation set, and ship after detecting regressions.
Security: an image is a mass of PII
Invoices, business cards, and identity documents are personal information itself. Build protection into the design.
- Don't log the byte string or extracted body: observability is enough with only metadata (model, token count, confidence, processing time).
- Least privilege: image storage (S3, etc.) at least privilege, encrypted, short-lived URLs. Delete after processing per the retention policy.
- Masking: handle number-type values downstream to the minimum necessary. Mask in the display UI (accessibly — don't expose all digits in read-aloud).
- Input validation: validate the file format, size, and count at the boundary before feeding (the discipline of type safety).
The extracted structured data ultimately flows to a screen people use. Only by designing masking, labels, and error expressions to be accessible does the extraction pipeline become a "product."
FAQ
Q. How is it different from a dedicated OCR service? A. Traditional OCR goes up to "read the characters." Llama 4 is end-to-end up to "read, understand the meaning, and return it structured." It's strong at invoices with diverse layouts and context-dependent field extraction. Meanwhile, the design of always backing up values you can't get wrong with an arithmetic check/master matching is, like OCR, essential.
Q. Which model should I use? A. Maverick is solid for image understanding. Routing — holding cost down with Scout for high-volume, easy ones and sending only hard images to a higher model — is the cost-effective optimum.
Q. Can it read Japanese invoices? A. It can. But accuracy drops for handwriting, low resolution, and special fonts, so always prepare preprocessing, a confidence gate, and a human-review path.
Q. How many images can I pass at once? A. The official constraint is up to 20 images per message, each 3.75MB/8000px. Bring multi-page or high-resolution under it with page splitting and resizing.
Q. Can self-hosting do image understanding too? A. It can. Serving Scout/Maverick with vLLM handles it similarly with OpenAI-compatible image input. For KYC, etc., where data can't go outside, this.
Summary
Llama 4's multimodality is a practical tool to "turn paper and screenshots into data the downstream can use." The key isn't flashy usage but discipline.
- Feed via Converse's image block (format + bytes, temperature 0).
- Validate at the boundary with Zod to structurally exclude broken output.
- Back up values you can't get wrong with an arithmetic check/master matching, and don't auto-finalize.
- Build in PII protection (no logging, least privilege, masking).
- Measure accuracy with an evaluation set, and ship after watching for regressions.
If you want to put image understanding of invoices, identity verification, or drawings into production, including arithmetic checks, human review, PII protection, and cost design, see my track record and reach out. With one-person × generative AI — fast, cheap, and safe.
Sources / official resources
- The Llama 4 herd (Meta AI) — native multimodal
- Bedrock Converse API (image input) — image-block format and constraints
- boto3 converse reference
- Vercel AI SDK —
generateObjectand image parts
- Models, constraints, and pricing are updated. Confirm primary sources before implementation, and verify accuracy with a labeled evaluation.