Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

"Process a video," "convert tens of thousands in a batch," "bulk-classify with an LLM" — holding such processing in synchronous HTTP always breaks down. It exceeds the request timeout (Cloud Run is max 60 minutes), processing can't be stopped even when the client disconnects, and retries cause multiple execution. The right answer is to "separate reception (Service) and execution (Job/Workflow)."

On the broadcaster platform, I built a telop-typo-detection pipeline — heavy processing that extracts subtitles from video and matches OCR (image) and speech recognition (utterance) to detect typos — in exactly this form. The HTTP API just starts the job and returns a reception ID, and the heavy processing runs in parallel with Cloud Run Jobs + Cloud Workflows. Parallelizing OCR and speech recognition achieved sequential 18 min → parallel 13 min (about 30% shorter), and I designed the long-running job to be idempotent and resumable by numbering segment IDs deterministically.

This article reproduces that design faithfully to the official documentation. For the production operation of Services themselves, see the Cloud Run production-operations guide.

First, classify: Services / Jobs / Workflows / Worker Pools

Just putting the processing in the right box decides 80% of the design.

Box	Nature	Where to use
Services	Handle synchronous HTTP	API, web app, webhook
Jobs	Run and stop when finished	Batch, DB migration, bulk conversion, long-running processing
Workflows	Orchestration of multiple steps	Bundle jobs/services/APIs by order, parallel, condition
Worker Pools	Resident background	Pub/Sub pull, Kafka consumer

"Heavy → Job," "bundling multiple → Workflow" — do this primary classification first.

Cloud Run Jobs: handle in parallel by task splitting

A job can split one job into up to 10,000 tasks and run them in parallel. The main flags are —

Flag	Meaning	Default/limit
`--tasks`	Number of tasks (how many splits of the work)	Default 1, max 10,000
`--parallelism`	Number of tasks to run simultaneously	Default is as parallel as possible
`--max-retries`	Retry count for a failed task	Default 3, max 10
`--task-timeout`	Max execution time of one task	Default 10 min, max 168 hours (7 days)

Each task knows "which number it is in the whole" via environment variables.

CLOUD_RUN_TASK_INDEX: this task's number (0 to task count - 1)
CLOUD_RUN_TASK_COUNT: the total task count of the whole job
CLOUD_RUN_TASK_ATTEMPT: this task's attempt count (increases on retry)

A task succeeds with exit code 0. On failure it's auto-retried up to --max-retries, and when a task exhausts its retries, the entire job execution fails.

Sharding: decide the assigned range with INDEX

The typical pattern of "process 10,000 items in parallel with 100 tasks." Each task computes its assigned slice deterministically from INDEX.

import os

task_index = int(os.environ["CLOUD_RUN_TASK_INDEX"])   # 0..count-1
task_count = int(os.environ["CLOUD_RUN_TASK_COUNT"])    # 例: 100

items = load_all_item_ids()        # 全件（外部ストアから取得）
# 自分の担当だけを処理（重複なく・漏れなく全体をカバー）
my_items = items[task_index::task_count]

for item_id in my_items:
    process_one(item_id)           # ← この1件が冪等であることが命（後述）

# 100タスク・最大20並列・各タスク最大1時間でデプロイして実行
gcloud run jobs deploy batch-convert \
  --image asia-northeast1-docker.pkg.dev/PROJECT_ID/app/batch:${GIT_SHA} \
  --region asia-northeast1 \
  --tasks 100 --parallelism 20 \
  --max-retries 3 --task-timeout 3600s \
  --service-account batch@PROJECT_ID.iam.gserviceaccount.com
gcloud run jobs execute batch-convert --region asia-northeast1 --wait

Idempotency and resumability: the heart of production quality

Design a job on the premise that it can always be retried (task failure, SIGTERM, re-execution). So unless it's "the result doesn't change even if the same processing runs twice (idempotent)," data breaks. This is the same principle I enforced on the payment platform with 0 double charges in production.

Two standard ways to make it idempotent:

Deterministic ID numbering: globally uniquify the task-local ID using INDEX. The ID is stable across parallel execution and re-execution.

# セグメント単位に「グローバルID = INDEX × stride + local_id」で決定的に採番。
# 並列処理しても後段マージでIDが衝突せず、再実行でも同じIDに収束する。
STRIDE = 100_000
def global_id(local_id: int) -> int:
    return task_index * STRIDE + local_id

Make writes idempotent: make the output a form that can be "overwritten/ignored if it exists." For object storage, overwrite to a deterministic object name; for a DB, UPSERT + a unique constraint.

# 出力先を決定的なパスにする → 再実行は同じ場所を上書きするだけ（重複生成しない）
output_path = f"results/{global_id(local_id)}.json"
storage_client.bucket("outputs").blob(output_path).upload_from_string(payload)

In my telop-typo detection, I numbered, per segment, global telop ID = segment_index × stride + local_id deterministically, so the final result converges uniquely under any of parallel processing, partial re-execution, and retries. Not "carefully controlling retries" but structurally guaranteeing "it doesn't break even when retried" — this is the essence of resilience. For the general theory of idempotent async processing, SQS/Lambda idempotent processing is also a reference.

Triggering: scheduled execution and event-driven

Cloud Scheduler (cron)

Trigger periodic batches from Cloud Scheduler. Since running a job hits the Cloud Run Admin API's jobs:run, use OAuth auth (--oauth-service-account-email) (separate from the OIDC for service calls).

# 毎日午前2時にジョブを起動。SAには roles/run.invoker が必要。
gcloud scheduler jobs create http nightly-batch \
  --location asia-northeast1 \
  --schedule="0 2 * * *" \
  --uri="https://run.googleapis.com/v2/projects/PROJECT_ID/locations/asia-northeast1/jobs/batch-convert:run" \
  --http-method POST \
  --oauth-service-account-email scheduler@PROJECT_ID.iam.gserviceaccount.com

Eventarc (event-driven)

"Process when a file is uploaded" goes to Eventarc. It receives events like GCS finalization to start a service/job.

# GCSへのアップロード（finalized）を受けてスキャナサービスを起動
gcloud eventarc triggers create scan-on-upload \
  --location asia-northeast1 \
  --destination-run-service malware-scanner \
  --event-filters "type=google.cloud.storage.object.v1.finalized" \
  --event-filters "bucket=uploads-raw" \
  --service-account eventarc@PROJECT_ID.iam.gserviceaccount.com

My malware scanner received GCS uploads with Eventarc, passed them to ClamAV (Cloud Run), and stream-inspected up to 10 GiB without buffering, sorting into clean/quarantine buckets. It's idempotent to retries via the atomicity of File.move, and safely ignores zero-length, in-progress, and deleted files. A typical case of event-driven × idempotent.

Cloud Workflows: bundle multiple steps

"After job A, run B in parallel, aggregate the results and call C" — Cloud Workflows handles such orchestration. Declare steps in YAML/JSON, billed per step execution (free when idle). It supports HTTP calls and Google Cloud connectors (including Cloud Run), and can hold state and wait for up to 1 year.

# workflow.yaml — OCRと音声認識を並列実行し、結果を照合する（テロップ誤字検出の骨子）
main:
  params: [input]
  steps:
    - extract_subtitles:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/extract"}
          auth: { type: OIDC }   # 認証付きCloud Runサービスを安全に呼ぶ
          body: { video: ${input.video} }
        result: frames

    # ★ OCRと音声認識は相互依存しない → 並列実行で時間短縮
    - analyze_in_parallel:
        parallel:
          shared: [ocr_result, asr_result]
          branches:
            - ocr_branch:
                steps:
                  - run_ocr:
                      call: http.post
                      args:
                        url: ${"https://ocr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { frames: ${frames.body} }
                      result: ocr_result
            - asr_branch:
                steps:
                  - run_asr:
                      call: http.post
                      args:
                        url: ${"https://asr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { video: ${input.video} }
                      result: asr_result

    - reconcile:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/reconcile"}
          auth: { type: OIDC }
          body: { ocr: ${ocr_result.body}, asr: ${asr_result.body} }
        result: report
    - done:
        return: ${report.body}

Retries and error handling

For unstable external calls, layer retries with exponential backoff declaratively.

    - call_flaky_service:
        try:
          call: http.post
          args:
            url: ${"https://flaky-xxxxx.a.run.app/run"}
            auth: { type: OIDC }
          result: r
        retry:
          predicate: ${http.default_retry_predicate}   # 5xx等で再試行
          max_retries: 5
          backoff: { initial_delay: 1, max_delay: 60, multiplier: 2 }
        except:
          as: e
          steps:
            - handle:
                call: sys.log
                args: { text: ${"failed: " + e.message}, severity: ERROR }

Make it fast with parallel (parallel) and hard to break with try/retry/except — these are Workflows' two big values.

When Jobs / when Workflows / when Worker Pools

Handle a single heavy process in parallel → Jobs (shard with --tasks).
Bundle multiple processes by order, parallel, condition / wait a long time → Workflows (orchestrator).
Continuously pull a queue → Worker Pools (Pub/Sub pull, Kafka).
Many production systems are a combination (my pipeline is "Workflows orchestrates the whole, each heavy process is a Job/Service, the entrance is Eventarc").

Production-rollout checklist

Separated processing unsuited to HTTP from Services into Jobs/Workflows
Tasks split the assigned range deterministically with CLOUD_RUN_TASK_INDEX
Each task's unit processing is idempotent (deterministic ID + UPSERT/overwrite)
Designed on the premise of retries (--max-retries / Workflows' retry)
Set --task-timeout to match the processing (max 7 days)
Trigger with Cloud Scheduler (cron, OAuth) or Eventarc (event)
Multiple steps with Workflows (parallel with parallel, recover with try/except)
Progress/results are observable (structured logs, metrics, a progress store if needed)

Conclusion: heavy processing is "separate, make idempotent, and bundle"

Cloud Run production design starts from "putting processing unsuited to HTTP in the right box." Separate heavy processing into Jobs, handle it in parallel by task splitting, make it idempotent and resumable with deterministic IDs, and bundle order, parallelism, and resilience with Workflows. With this, video processing and large-scale batch alike run without stopping, without breaking, and fast.

I could build "completion, resumption, idempotency" into a broadcaster's production workflow because of this decomposition. For the overall design, the Cloud Run production-operations guide; for CI/CD, the Cloud Run CI/CD guide; for cost, the concurrency/billing guide. If you're troubled by the design of a batch/async platform, I accompany you through implementation.

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

First, classify: Services / Jobs / Workflows / Worker Pools

Cloud Run Jobs: handle in parallel by task splitting

Sharding: decide the assigned range with INDEX

Idempotency and resumability: the heart of production quality

Triggering: scheduled execution and event-driven

Cloud Scheduler (cron)

Eventarc (event-driven)

Cloud Workflows: bundle multiple steps

Retries and error handling

When Jobs / when Workflows / when Worker Pools

Production-rollout checklist

Conclusion: heavy processing is "separate, make idempotent, and bundle"

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

Also worth reading

AWS Lambda production-operation guide: firm up the execution model, idempotency, observability, security, and cost with the official spec

Azure Container Apps Jobs implementation guide: production design for batch, schedule (cron), and event-driven

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)

First, classify: Services / Jobs / Workflows / Worker Pools

Cloud Run Jobs: handle in parallel by task splitting

Sharding: decide the assigned range with INDEX

Idempotency and resumability: the heart of production quality

Triggering: scheduled execution and event-driven

Cloud Scheduler (cron)

Eventarc (event-driven)

Cloud Workflows: bundle multiple steps

Retries and error handling

When Jobs / when Workflows / when Worker Pools

Production-rollout checklist

Conclusion: heavy processing is "separate, make idempotent, and bundle"

Related articles

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

Also worth reading

AWS Lambda production-operation guide: firm up the execution model, idempotency, observability, security, and cost with the official spec

Azure Container Apps Jobs implementation guide: production design for batch, schedule (cron), and event-driven

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)