Skip to main content
友田 陽大
Google Cloud Run in production
GCP
Cloud Run
Cloud Workflows
バッチ処理
冪等性
サーバーレス
信頼性
回復性

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

An implementation guide to building processing unsuited to HTTP (batch, long-running jobs, parallel processing) at production quality with Cloud Run Jobs and Cloud Workflows. It explains, in gcloud/YAML/Python real code: sharding with --tasks/--parallelism, splitting with CLOUD_RUN_TASK_INDEX, idempotent/resumable design with deterministic IDs, cron execution with Cloud Scheduler and event-driven with Eventarc, and Workflows' parallelism, retries, and error handling.

Published
Reading time
8 min read
Author
友田 陽大
Share

"Process a video," "convert tens of thousands in a batch," "bulk-classify with an LLM" — holding such processing in synchronous HTTP always breaks down. It exceeds the request timeout (Cloud Run is max 60 minutes), processing can't be stopped even when the client disconnects, and retries cause multiple execution. The right answer is to "separate reception (Service) and execution (Job/Workflow)."

On the broadcaster platform, I built a telop-typo-detection pipeline — heavy processing that extracts subtitles from video and matches OCR (image) and speech recognition (utterance) to detect typos — in exactly this form. The HTTP API just starts the job and returns a reception ID, and the heavy processing runs in parallel with Cloud Run Jobs + Cloud Workflows. Parallelizing OCR and speech recognition achieved sequential 18 min → parallel 13 min (about 30% shorter), and I designed the long-running job to be idempotent and resumable by numbering segment IDs deterministically.

This article reproduces that design faithfully to the official documentation. For the production operation of Services themselves, see the Cloud Run production-operations guide.


First, classify: Services / Jobs / Workflows / Worker Pools

Just putting the processing in the right box decides 80% of the design.

BoxNatureWhere to use
ServicesHandle synchronous HTTPAPI, web app, webhook
JobsRun and stop when finishedBatch, DB migration, bulk conversion, long-running processing
WorkflowsOrchestration of multiple stepsBundle jobs/services/APIs by order, parallel, condition
Worker PoolsResident backgroundPub/Sub pull, Kafka consumer

"Heavy → Job," "bundling multiple → Workflow" — do this primary classification first.


Cloud Run Jobs: handle in parallel by task splitting

A job can split one job into up to 10,000 tasks and run them in parallel. The main flags are —

FlagMeaningDefault/limit
--tasksNumber of tasks (how many splits of the work)Default 1, max 10,000
--parallelismNumber of tasks to run simultaneouslyDefault is as parallel as possible
--max-retriesRetry count for a failed taskDefault 3, max 10
--task-timeoutMax execution time of one taskDefault 10 min, max 168 hours (7 days)

Each task knows "which number it is in the whole" via environment variables.

  • CLOUD_RUN_TASK_INDEX: this task's number (0 to task count - 1)
  • CLOUD_RUN_TASK_COUNT: the total task count of the whole job
  • CLOUD_RUN_TASK_ATTEMPT: this task's attempt count (increases on retry)

A task succeeds with exit code 0. On failure it's auto-retried up to --max-retries, and when a task exhausts its retries, the entire job execution fails.

Sharding: decide the assigned range with INDEX

The typical pattern of "process 10,000 items in parallel with 100 tasks." Each task computes its assigned slice deterministically from INDEX.

import os

task_index = int(os.environ["CLOUD_RUN_TASK_INDEX"])   # 0..count-1
task_count = int(os.environ["CLOUD_RUN_TASK_COUNT"])    # 例: 100

items = load_all_item_ids()        # 全件(外部ストアから取得)
# 自分の担当だけを処理(重複なく・漏れなく全体をカバー)
my_items = items[task_index::task_count]

for item_id in my_items:
    process_one(item_id)           # ← この1件が冪等であることが命(後述)
# 100タスク・最大20並列・各タスク最大1時間でデプロイして実行
gcloud run jobs deploy batch-convert \
  --image asia-northeast1-docker.pkg.dev/PROJECT_ID/app/batch:${GIT_SHA} \
  --region asia-northeast1 \
  --tasks 100 --parallelism 20 \
  --max-retries 3 --task-timeout 3600s \
  --service-account batch@PROJECT_ID.iam.gserviceaccount.com
gcloud run jobs execute batch-convert --region asia-northeast1 --wait

Idempotency and resumability: the heart of production quality

Design a job on the premise that it can always be retried (task failure, SIGTERM, re-execution). So unless it's "the result doesn't change even if the same processing runs twice (idempotent)," data breaks. This is the same principle I enforced on the payment platform with 0 double charges in production.

Two standard ways to make it idempotent:

  1. Deterministic ID numbering: globally uniquify the task-local ID using INDEX. The ID is stable across parallel execution and re-execution.
# セグメント単位に「グローバルID = INDEX × stride + local_id」で決定的に採番。
# 並列処理しても後段マージでIDが衝突せず、再実行でも同じIDに収束する。
STRIDE = 100_000
def global_id(local_id: int) -> int:
    return task_index * STRIDE + local_id
  1. Make writes idempotent: make the output a form that can be "overwritten/ignored if it exists." For object storage, overwrite to a deterministic object name; for a DB, UPSERT + a unique constraint.
# 出力先を決定的なパスにする → 再実行は同じ場所を上書きするだけ(重複生成しない)
output_path = f"results/{global_id(local_id)}.json"
storage_client.bucket("outputs").blob(output_path).upload_from_string(payload)

In my telop-typo detection, I numbered, per segment, global telop ID = segment_index × stride + local_id deterministically, so the final result converges uniquely under any of parallel processing, partial re-execution, and retries. Not "carefully controlling retries" but structurally guaranteeing "it doesn't break even when retried" — this is the essence of resilience. For the general theory of idempotent async processing, SQS/Lambda idempotent processing is also a reference.


Triggering: scheduled execution and event-driven

Cloud Scheduler (cron)

Trigger periodic batches from Cloud Scheduler. Since running a job hits the Cloud Run Admin API's jobs:run, use OAuth auth (--oauth-service-account-email) (separate from the OIDC for service calls).

# 毎日午前2時にジョブを起動。SAには roles/run.invoker が必要。
gcloud scheduler jobs create http nightly-batch \
  --location asia-northeast1 \
  --schedule="0 2 * * *" \
  --uri="https://run.googleapis.com/v2/projects/PROJECT_ID/locations/asia-northeast1/jobs/batch-convert:run" \
  --http-method POST \
  --oauth-service-account-email scheduler@PROJECT_ID.iam.gserviceaccount.com

Eventarc (event-driven)

"Process when a file is uploaded" goes to Eventarc. It receives events like GCS finalization to start a service/job.

# GCSへのアップロード(finalized)を受けてスキャナサービスを起動
gcloud eventarc triggers create scan-on-upload \
  --location asia-northeast1 \
  --destination-run-service malware-scanner \
  --event-filters "type=google.cloud.storage.object.v1.finalized" \
  --event-filters "bucket=uploads-raw" \
  --service-account eventarc@PROJECT_ID.iam.gserviceaccount.com

My malware scanner received GCS uploads with Eventarc, passed them to ClamAV (Cloud Run), and stream-inspected up to 10 GiB without buffering, sorting into clean/quarantine buckets. It's idempotent to retries via the atomicity of File.move, and safely ignores zero-length, in-progress, and deleted files. A typical case of event-driven × idempotent.


Cloud Workflows: bundle multiple steps

"After job A, run B in parallel, aggregate the results and call C" — Cloud Workflows handles such orchestration. Declare steps in YAML/JSON, billed per step execution (free when idle). It supports HTTP calls and Google Cloud connectors (including Cloud Run), and can hold state and wait for up to 1 year.

# workflow.yaml — OCRと音声認識を並列実行し、結果を照合する(テロップ誤字検出の骨子)
main:
  params: [input]
  steps:
    - extract_subtitles:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/extract"}
          auth: { type: OIDC }   # 認証付きCloud Runサービスを安全に呼ぶ
          body: { video: ${input.video} }
        result: frames

    # ★ OCRと音声認識は相互依存しない → 並列実行で時間短縮
    - analyze_in_parallel:
        parallel:
          shared: [ocr_result, asr_result]
          branches:
            - ocr_branch:
                steps:
                  - run_ocr:
                      call: http.post
                      args:
                        url: ${"https://ocr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { frames: ${frames.body} }
                      result: ocr_result
            - asr_branch:
                steps:
                  - run_asr:
                      call: http.post
                      args:
                        url: ${"https://asr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { video: ${input.video} }
                      result: asr_result

    - reconcile:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/reconcile"}
          auth: { type: OIDC }
          body: { ocr: ${ocr_result.body}, asr: ${asr_result.body} }
        result: report
    - done:
        return: ${report.body}

Retries and error handling

For unstable external calls, layer retries with exponential backoff declaratively.

    - call_flaky_service:
        try:
          call: http.post
          args:
            url: ${"https://flaky-xxxxx.a.run.app/run"}
            auth: { type: OIDC }
          result: r
        retry:
          predicate: ${http.default_retry_predicate}   # 5xx等で再試行
          max_retries: 5
          backoff: { initial_delay: 1, max_delay: 60, multiplier: 2 }
        except:
          as: e
          steps:
            - handle:
                call: sys.log
                args: { text: ${"failed: " + e.message}, severity: ERROR }

Make it fast with parallel (parallel) and hard to break with try/retry/except — these are Workflows' two big values.


When Jobs / when Workflows / when Worker Pools

  • Handle a single heavy process in parallelJobs (shard with --tasks).
  • Bundle multiple processes by order, parallel, condition / wait a long timeWorkflows (orchestrator).
  • Continuously pull a queueWorker Pools (Pub/Sub pull, Kafka).
  • Many production systems are a combination (my pipeline is "Workflows orchestrates the whole, each heavy process is a Job/Service, the entrance is Eventarc").

Production-rollout checklist

  • Separated processing unsuited to HTTP from Services into Jobs/Workflows
  • Tasks split the assigned range deterministically with CLOUD_RUN_TASK_INDEX
  • Each task's unit processing is idempotent (deterministic ID + UPSERT/overwrite)
  • Designed on the premise of retries (--max-retries / Workflows' retry)
  • Set --task-timeout to match the processing (max 7 days)
  • Trigger with Cloud Scheduler (cron, OAuth) or Eventarc (event)
  • Multiple steps with Workflows (parallel with parallel, recover with try/except)
  • Progress/results are observable (structured logs, metrics, a progress store if needed)

Conclusion: heavy processing is "separate, make idempotent, and bundle"

Cloud Run production design starts from "putting processing unsuited to HTTP in the right box." Separate heavy processing into Jobs, handle it in parallel by task splitting, make it idempotent and resumable with deterministic IDs, and bundle order, parallelism, and resilience with Workflows. With this, video processing and large-scale batch alike run without stopping, without breaking, and fast.

I could build "completion, resumption, idempotency" into a broadcaster's production workflow because of this decomposition. For the overall design, the Cloud Run production-operations guide; for CI/CD, the Cloud Run CI/CD guide; for cost, the concurrency/billing guide. If you're troubled by the design of a batch/async platform, I accompany you through implementation.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

I can take on the implementation from this article as an engagement

GCP / Cloud Run container platforms, from design to production and cost optimization

Building container platforms on Cloud Run (services + jobs), migration from AWS/on-prem, keyless CI/CD via Workload Identity, defense-in-depth with Cloud Armor and least privilege, and cost optimization of concurrency and the billing model. With experience building and operating a broadcaster platform on GCP with IaC, I deliver fast, cheap, and secure.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading