# Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

> An implementation guide to building processing unsuited to HTTP (batch, long-running jobs, parallel processing) at production quality with Cloud Run Jobs and Cloud Workflows. It explains, in gcloud/YAML/Python real code: sharding with --tasks/--parallelism, splitting with CLOUD_RUN_TASK_INDEX, idempotent/resumable design with deterministic IDs, cron execution with Cloud Scheduler and event-driven with Eventarc, and Workflows' parallelism, retries, and error handling.

- Published: 2026-06-28
- Author: 友田 陽大
- Tags: GCP, Cloud Run, Cloud Workflows, バッチ処理, 冪等性, サーバーレス, 信頼性, 回復性
- URL: https://tomodahinata.com/en/blog/google-cloud-run-jobs-workflows-batch-async-idempotent-guide
- Category: Google Cloud Run in production
- Pillar guide: https://tomodahinata.com/en/blog/google-cloud-run-production-guide

## Key points

- Services handle synchronous HTTP. Separate processing that runs and finishes into Cloud Run Jobs, orchestration of multiple services into Cloud Workflows, and resident background into Worker Pools.
- Jobs can split one job into up to 10,000 tasks and run them in parallel. Each task knows its assigned range via CLOUD_RUN_TASK_INDEX (0 to count-1)/COUNT/ATTEMPT. Exit code 0 is success; failure is retried up to --max-retries (default 3, max 10).
- The key to production quality is idempotency and resumability. Numbering a global ID deterministically from the task ID makes parallel execution, partial re-execution, and retries converge to a unique final result. Design on the premise of retries.
- Trigger with Cloud Scheduler (cron, OAuth auth for jobs:run) or Eventarc (GCS/Pub/Sub event-driven). --task-timeout is 10 minutes by default and extendable up to 168 hours (7 days).
- Cloud Workflows is a multi-step orchestrator. Parallelize with parallel, declare resilience with exponential backoff using try/retry/except, billed per step execution. There's a real example of shortening processing time by about 30% by parallelizing OCR and speech recognition.

---

"Process a video," "convert tens of thousands in a batch," "bulk-classify with an LLM" — **holding such processing in synchronous HTTP** always breaks down. It exceeds the request timeout (Cloud Run is max 60 minutes), processing can't be stopped even when the client disconnects, and retries cause multiple execution. The right answer is to **"separate reception (Service) and execution (Job/Workflow)."**

On the [broadcaster platform](/case-studies/broadcaster-ai-content-platform), I built a telop-typo-detection pipeline — heavy processing that extracts subtitles from video and matches OCR (image) and speech recognition (utterance) to detect typos — in exactly this form. **The HTTP API just starts the job and returns a reception ID**, and the heavy processing runs **in parallel with Cloud Run Jobs + Cloud Workflows.** Parallelizing OCR and speech recognition achieved **sequential 18 min → parallel 13 min (about 30% shorter)**, and I designed the long-running job to be **idempotent and resumable by numbering segment IDs deterministically.**

This article reproduces that design faithfully to the [official documentation](https://docs.cloud.google.com/run/docs/create-jobs). For the production operation of Services themselves, see the [Cloud Run production-operations guide](/blog/google-cloud-run-production-guide).

---

## First, classify: Services / Jobs / Workflows / Worker Pools

Just putting the processing in the right box decides 80% of the design.

| Box | Nature | Where to use |
|----|------|-------|
| **Services** | Handle synchronous HTTP | API, web app, webhook |
| **Jobs** | Run and **stop when finished** | Batch, DB migration, bulk conversion, long-running processing |
| **Workflows** | **Orchestration** of multiple steps | Bundle jobs/services/APIs by order, parallel, condition |
| **Worker Pools** | **Resident** background | Pub/Sub pull, Kafka consumer |

"Heavy → Job," "bundling multiple → Workflow" — do this primary classification first.

---

## Cloud Run Jobs: handle in parallel by task splitting

A job can **split one job into up to 10,000 tasks** and run them in parallel. The main flags are —

| Flag | Meaning | Default/limit |
|--------|------|----------|
| `--tasks` | Number of tasks (how many splits of the work) | Default 1, max 10,000 |
| `--parallelism` | Number of tasks to run simultaneously | Default is as parallel as possible |
| `--max-retries` | Retry count for a failed task | Default 3, max 10 |
| `--task-timeout` | Max execution time of one task | Default 10 min, max 168 hours (7 days) |

Each task knows "which number it is in the whole" via environment variables.

- **`CLOUD_RUN_TASK_INDEX`**: this task's number (`0` to `task count - 1`)
- **`CLOUD_RUN_TASK_COUNT`**: the total task count of the whole job
- **`CLOUD_RUN_TASK_ATTEMPT`**: this task's attempt count (increases on retry)

**A task succeeds with exit code 0.** On failure it's auto-retried up to `--max-retries`, and **when a task exhausts its retries, the entire job execution fails.**

### Sharding: decide the assigned range with INDEX

The typical pattern of "process 10,000 items in parallel with 100 tasks." Each task computes **its assigned slice deterministically** from `INDEX`.

```python
import os

task_index = int(os.environ["CLOUD_RUN_TASK_INDEX"])   # 0..count-1
task_count = int(os.environ["CLOUD_RUN_TASK_COUNT"])    # 例: 100

items = load_all_item_ids()        # 全件（外部ストアから取得）
# 自分の担当だけを処理（重複なく・漏れなく全体をカバー）
my_items = items[task_index::task_count]

for item_id in my_items:
    process_one(item_id)           # ← この1件が冪等であることが命（後述）
```

```bash
# 100タスク・最大20並列・各タスク最大1時間でデプロイして実行
gcloud run jobs deploy batch-convert \
  --image asia-northeast1-docker.pkg.dev/PROJECT_ID/app/batch:${GIT_SHA} \
  --region asia-northeast1 \
  --tasks 100 --parallelism 20 \
  --max-retries 3 --task-timeout 3600s \
  --service-account batch@PROJECT_ID.iam.gserviceaccount.com
gcloud run jobs execute batch-convert --region asia-northeast1 --wait
```

---

## Idempotency and resumability: the heart of production quality

Design a job on the premise that it **can always be retried** (task failure, SIGTERM, re-execution). So unless it's **"the result doesn't change even if the same processing runs twice (idempotent),"** data breaks. This is the same principle I enforced on [the payment platform with 0 double charges in production](/case-studies/payment-platform-reliability).

**Two standard ways to make it idempotent:**

1. **Deterministic ID numbering**: globally uniquify the task-local ID using `INDEX`. The ID is stable across parallel execution and re-execution.

```python
# セグメント単位に「グローバルID = INDEX × stride + local_id」で決定的に採番。
# 並列処理しても後段マージでIDが衝突せず、再実行でも同じIDに収束する。
STRIDE = 100_000
def global_id(local_id: int) -> int:
    return task_index * STRIDE + local_id
```

2. **Make writes idempotent**: make the output a form that can be "overwritten/ignored if it exists." For object storage, **overwrite to a deterministic object name**; for a DB, **UPSERT + a unique constraint.**

```python
# 出力先を決定的なパスにする → 再実行は同じ場所を上書きするだけ（重複生成しない）
output_path = f"results/{global_id(local_id)}.json"
storage_client.bucket("outputs").blob(output_path).upload_from_string(payload)
```

> In my telop-typo detection, I **numbered, per segment, `global telop ID = segment_index × stride + local_id` deterministically**, so the final result converges uniquely under any of parallel processing, partial re-execution, and retries. Not "carefully controlling retries" but **structurally guaranteeing "it doesn't break even when retried"** — this is the essence of resilience. For the general theory of idempotent async processing, [SQS/Lambda idempotent processing](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide) is also a reference.

---

## Triggering: scheduled execution and event-driven

### Cloud Scheduler (cron)

Trigger periodic batches from Cloud Scheduler. Since running a job hits the Cloud Run Admin API's `jobs:run`, use **OAuth auth (`--oauth-service-account-email`)** (separate from the OIDC for service calls).

```bash
# 毎日午前2時にジョブを起動。SAには roles/run.invoker が必要。
gcloud scheduler jobs create http nightly-batch \
  --location asia-northeast1 \
  --schedule="0 2 * * *" \
  --uri="https://run.googleapis.com/v2/projects/PROJECT_ID/locations/asia-northeast1/jobs/batch-convert:run" \
  --http-method POST \
  --oauth-service-account-email scheduler@PROJECT_ID.iam.gserviceaccount.com
```

### Eventarc (event-driven)

"Process when a file is uploaded" goes to Eventarc. It receives events like GCS finalization to start a service/job.

```bash
# GCSへのアップロード（finalized）を受けてスキャナサービスを起動
gcloud eventarc triggers create scan-on-upload \
  --location asia-northeast1 \
  --destination-run-service malware-scanner \
  --event-filters "type=google.cloud.storage.object.v1.finalized" \
  --event-filters "bucket=uploads-raw" \
  --service-account eventarc@PROJECT_ID.iam.gserviceaccount.com
```

> My malware scanner **received GCS uploads with Eventarc, passed them to ClamAV (Cloud Run), and stream-inspected up to 10 GiB without buffering**, sorting into clean/quarantine buckets. It's idempotent to retries via the atomicity of `File.move`, and safely ignores zero-length, in-progress, and deleted files. A typical case of **event-driven × idempotent.**

---

## Cloud Workflows: bundle multiple steps

"After job A, run B in parallel, aggregate the results and call C" — Cloud Workflows handles such **orchestration.** Declare steps in YAML/JSON, **billed per step execution** (free when idle). It supports HTTP calls and Google Cloud connectors (including Cloud Run), and can **hold state and wait for up to 1 year.**

```yaml
# workflow.yaml — OCRと音声認識を並列実行し、結果を照合する（テロップ誤字検出の骨子）
main:
  params: [input]
  steps:
    - extract_subtitles:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/extract"}
          auth: { type: OIDC }   # 認証付きCloud Runサービスを安全に呼ぶ
          body: { video: ${input.video} }
        result: frames

    # ★ OCRと音声認識は相互依存しない → 並列実行で時間短縮
    - analyze_in_parallel:
        parallel:
          shared: [ocr_result, asr_result]
          branches:
            - ocr_branch:
                steps:
                  - run_ocr:
                      call: http.post
                      args:
                        url: ${"https://ocr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { frames: ${frames.body} }
                      result: ocr_result
            - asr_branch:
                steps:
                  - run_asr:
                      call: http.post
                      args:
                        url: ${"https://asr-xxxxx.a.run.app/run"}
                        auth: { type: OIDC }
                        body: { video: ${input.video} }
                      result: asr_result

    - reconcile:
        call: http.post
        args:
          url: ${"https://api-xxxxx.a.run.app/reconcile"}
          auth: { type: OIDC }
          body: { ocr: ${ocr_result.body}, asr: ${asr_result.body} }
        result: report
    - done:
        return: ${report.body}
```

### Retries and error handling

For unstable external calls, layer **retries with exponential backoff** declaratively.

```yaml
    - call_flaky_service:
        try:
          call: http.post
          args:
            url: ${"https://flaky-xxxxx.a.run.app/run"}
            auth: { type: OIDC }
          result: r
        retry:
          predicate: ${http.default_retry_predicate}   # 5xx等で再試行
          max_retries: 5
          backoff: { initial_delay: 1, max_delay: 60, multiplier: 2 }
        except:
          as: e
          steps:
            - handle:
                call: sys.log
                args: { text: ${"failed: " + e.message}, severity: ERROR }
```

**Make it fast with parallel (`parallel`) and hard to break with `try`/`retry`/`except`** — these are Workflows' two big values.

---

## When Jobs / when Workflows / when Worker Pools

- **Handle a single heavy process in parallel** → **Jobs** (shard with `--tasks`).
- **Bundle multiple processes by order, parallel, condition / wait a long time** → **Workflows** (orchestrator).
- **Continuously pull a queue** → **Worker Pools** (Pub/Sub pull, Kafka).
- Many production systems are a **combination** (my pipeline is "Workflows orchestrates the whole, each heavy process is a Job/Service, the entrance is Eventarc").

---

## Production-rollout checklist

- [ ] **Separated processing unsuited to HTTP from Services into Jobs/Workflows**
- [ ] Tasks split the assigned range deterministically with **`CLOUD_RUN_TASK_INDEX`**
- [ ] Each task's unit processing is **idempotent** (deterministic ID + UPSERT/overwrite)
- [ ] Designed **on the premise of retries** (`--max-retries` / Workflows' `retry`)
- [ ] Set `--task-timeout` to match the processing (max 7 days)
- [ ] Trigger with **Cloud Scheduler (cron, OAuth)** or **Eventarc (event)**
- [ ] Multiple steps with **Workflows** (parallel with `parallel`, recover with `try/except`)
- [ ] Progress/results are **observable** (structured logs, metrics, a progress store if needed)

---

## Conclusion: heavy processing is "separate, make idempotent, and bundle"

Cloud Run production design starts from **"putting processing unsuited to HTTP in the right box."** Separate heavy processing into Jobs, handle it in parallel by task splitting, make it idempotent and resumable with deterministic IDs, and bundle order, parallelism, and resilience with Workflows. With this, **video processing and large-scale batch alike run without stopping, without breaking, and fast.**

I could build "completion, resumption, idempotency" into a broadcaster's production workflow because of this decomposition. For the overall design, the [Cloud Run production-operations guide](/blog/google-cloud-run-production-guide); for CI/CD, the [Cloud Run CI/CD guide](/blog/google-cloud-run-cicd-cloud-build-github-actions-workload-identity-blue-green-canary-guide); for cost, the [concurrency/billing guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide). If you're troubled by the design of a batch/async platform, I accompany you through implementation.
