# Celery + Redis Production-Operations Guide — Async Task Design Faithful to the Official Docs (Idempotency, Retries, Observability)

> A practical guide to designing a production-grade async task queue with Celery 5.6 + Redis. Faithful to the official docs, it explains broker/backend configuration, the visibility_timeout traps, idempotency, autoretry, Beat periodic execution, Canvas workflows, prefetch tuning, observability, and security—with real code and 'when to use / when not to use' decision axes.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: Python, Celery, Redis, 非同期処理, タスクキュー, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/celery-redis-production-async-task-queue-guide
- Category: Reliability, async & real-time
- Pillar guide: https://tomodahinata.com/en/blog/transactional-outbox-pattern-reliable-event-publishing-guide

## Key points

- Celery + Redis is 'at-least-once' delivery. Write tasks assuming idempotency, and absorb double execution as a normal path
- The Redis broker has 3 traps. Re-delivery on visibility_timeout overrun, forbidding far-future ETAs, and the maxmemory-policy noeviction setting
- For retries, use autoretry + exponential backoff + jitter to prevent the thundering herd, and sharply distinguish transient from permanent failures
- Beat's scheduler is always a single instance. Run multiple and periodic tasks get enqueued in duplicate by the number of instances
- The security essentials are json-only (no pickle), Redis auth/TLS, and passing only IDs—not secrets—in task arguments

---

"Email sending takes 3 seconds and clogs the API response," "the post-payment aggregation batch drags in daytime traffic and falls over," "image conversion done synchronously times out the request"—these are all problems solved by **pushing heavy processing outside the request.** The de-facto standard for that in Python is **Celery**, and the easiest message foundation to start with is **Redis.**

But Celery + Redis is not a framework that "just works by calling `.delay()`." **Put it into production without thinking, and double execution of tasks, message loss, infinite re-delivery, and memory exhaustion happen quietly.** These are all known pitfalls the official docs clearly warn about.

This article explains, **faithful to the official docs of the Celery 5.6 line (the latest stable as of March 2026 is 5.6.3, Python 3.9+ / Python 3.14 supported)**, a design that withstands production operation, with code you can actually write and decision axes for "when and how to use it." I have **designed and led the reliability layer of idempotency, retries, and consistency** on a serverless payment foundation handling real money ([related case study](/case-studies/payment-platform-reliability)). This article translates that discipline of "guarding correctness with structure under distributed processing" into the Celery context.

> The code in the body conforms to the Celery 5.6-line API. Config keys and default values are stated based on the official docs ([docs.celeryq.dev](https://docs.celeryq.dev/en/stable/)), but always confirm on the relevant page of your version before going to production.

## What this article covers

- **When you should and shouldn't use** Celery + Redis (hold a decision axis first)
- A **configuration reference** from minimal to production (with official default values)
- The **3 traps** of using Redis as broker / backend (`visibility_timeout`, key eviction, ETA re-delivery)
- **Idempotency**—designing tasks not to break, premised on "at-least-once" delivery
- A retry strategy with **autoretry / backoff / jitter** (the precise meaning of the official parameters)
- **Periodic execution with Beat**, **workflows with Canvas (chain/group/chord)**
- Production tuning of **prefetch, concurrency, and time limit**
- **Observability** (Flower / inspect) and **security** (json-only, Redis auth, handling secrets)

## 1. Should you even use Celery — decide first

The first thing to do in tech selection is to crush "the reasons not to introduce it" rather than "the reasons to introduce it." Celery is powerful but comes with operational cost (worker, broker, monitoring). Judge suitability with the following table.

| Situation | Recommendation |
| --- | --- |
| Light processing that completes within the request (tens of ms) | **Celery unneeded.** Synchronous is enough. A queue is just added complexity |
| Want to decouple heavy/slow processing from the response (email, PDF, image conversion, external API) | **Celery suited ◎** |
| Periodic batches (daily aggregation, cleanup, reminders) | **Celery Beat suited ◎** (or the OS's cron / a cloud scheduler) |
| Want to handle a lot of I/O waits (concurrent web fetches, etc.) | Possible with Celery. But if `asyncio`-native, also consider **arq / Dramatiq** |
| Strict ordering guarantee / exactly-once is a requirement | **Celery + Redis is unsuited.** As described, "at-least-once" delivery. Consider Kafka or a DB outbox |
| AWS-centric, serverless-oriented | **SQS + Lambda** may save you from holding the infra operation |

### Tradeoffs to know before making Redis your broker

For Celery's broker (the path messages travel), you can choose **RabbitMQ** besides Redis. Redis is overwhelmingly simple to introduce and can double as the result backend, but it has **limits due to being in-memory.**

| Aspect | Redis broker | RabbitMQ broker |
| --- | --- | --- |
| Ease of introduction | ◎ Start with one container, can double as backend | △ Requires understanding AMQP |
| Delivery guarantee | ○ But depends on persistence (AOF/RDB) and config. Risk of unprocessed-message loss on a crash | ◎ Acknowledgments and durable queues are its main job |
| Mass messages / high reliability | △ Beware memory limits and eviction policy | ◎ |
| ETA / long-term scheduling | ✕ The `visibility_timeout` constraint is large (below) | ○ |

**Conclusion:** for small-to-medium scale, early startup, and the "just get it running" stage, Redis is the best choice. This article also assumes Redis. But if "you want to handle messages where loss is unacceptable, like money or inventory, with strong delivery guarantees," use RabbitMQ; if "you want to guarantee crash resilience with Redis," operate it with **AOF persistence enabled.** **Whatever you choose, the idempotency design below is mandatory.** Because Redis is "at-least-once" delivery and double execution is premised on being absorbed by design.

## 2. Minimal configuration — the officially-recommended project layout

Following the officially-recommended layout, define the Celery app in a `proj` package.

```text
proj/
├── __init__.py
├── celery.py        # Celery アプリ本体と設定
└── tasks.py         # タスク定義
```

```python
# proj/celery.py
from celery import Celery

app = Celery(
    "proj",
    broker="redis://localhost:6379/0",       # DB 0 = メッセージ用
    backend="redis://localhost:6379/1",       # DB 1 = 結果用（broker と分離）
    include=["proj.tasks"],
)

# 設定は一箇所に集約（SSoT）。詳細は §4 の本番設定で深掘り。
app.conf.update(
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],                  # json 以外を受け付けない（セキュリティ §11）
    timezone="Asia/Tokyo",
    enable_utc=True,
    broker_connection_retry_on_startup=True,  # 起動時に broker が未起動でも再接続を試みる
    result_expires=3600,                       # 結果は1時間で失効（既定は1日）
)
```

```python
# proj/tasks.py
from .celery import app


@app.task
def add(x: int, y: int) -> int:
    return x + y
```

Starting the worker and calling it:

```bash
# worker を起動（並行度は既定で CPU コア数）
$ celery -A proj worker --loglevel=INFO
```

```python
>>> from proj.tasks import add
>>> add.delay(2, 3)          # 非同期で投入（fire-and-forget の最短形）
>>> r = add.delay(2, 3)
>>> r.get(timeout=5)          # 結果を取得（backend が必要）
5
```

Up to here is per the official "First steps." The problem is **from here on**—in production you must face the delivery semantics lurking behind the convenience of `.delay()`.

## 3. The "3 traps" of using Redis as broker / backend

If you use Redis, grasp the following 3 points the official docs warn about, **as premises to configure.** Put it into production without knowing them and you'll suffer mysterious double execution and infinite loops.

### Trap ①: `visibility_timeout` — understand the "time it becomes invisible"

The Redis broker doesn't have an explicit acknowledgment queue like AMQP. Instead, when a worker fetches a message, it makes that message "invisible" for a set time, and **if no acknowledgment (ack) returns within that time, it re-delivers to another worker.** That grace is `visibility_timeout`.

```python
app.conf.broker_transport_options = {"visibility_timeout": 3600}  # 既定 1 時間（秒）
```

What this means is—**when one task's execution (including waiting for a retry) exceeds `visibility_timeout`, Redis judges it "dead" and re-delivers the same task to another worker.** That is, with a long-running task or a long `countdown`/`eta`, **the same task can keep running multiply.**

### Trap ②: don't use ETA / countdown for the "far future"

The official docs state clearly—**scheduling the far future with `eta` and `countdown` is not recommended. Keep it to a few minutes at most.**

The reason directly connects to Trap ①. Specify `countdown=7200` (2 hours later), and the moment it exceeds `visibility_timeout` (default 1 hour), Redis re-delivers and the task duplicates. Furthermore, ETA tasks have the nature of piling up in the worker's memory, so stacking many causes OOM (5.6 added a protective setting `worker_eta_task_limit`).

**The correct usage:**

- A delay of "tens of seconds to a few minutes later" → OK with `countdown` / `eta`
- "Hours later, a specific time, daily" → use **Beat (§7) or a DB-based scheduler**

If you really need a long `visibility_timeout`, the official docs instruct setting **all 3 places to the same value** (setting only one causes inconsistency).

```python
# 例：12 時間に伸ばすなら 3 つとも揃える
app.conf.broker_transport_options = {"visibility_timeout": 43200}
app.conf.result_backend_transport_options = {"visibility_timeout": 43200}
app.conf.visibility_timeout = 43200
```

### Trap ③: Redis's memory eviction policy

Redis is an in-memory DB, so when it hits the memory limit it **evicts keys.** If Celery's message or result keys get evicted, an `InconsistencyError` occurs and tasks or results **quietly disappear.**

The official instruction is clear—**set `maxmemory-policy` to `noeviction` or `allkeys-lru`.**

```bash
# redis.conf（または CONFIG SET）
maxmemory 2gb
maxmemory-policy noeviction   # メモリ満杯時は退避せず書き込みを拒否（消失より明示的エラーを選ぶ）
```

`noeviction` is the behavior of "if memory is full, reject new writes and error." Inconvenient at first glance, but **"failing explicitly" is overwhelmingly safer in a distributed system than "silently losing messages."** Failures can be detected, but loss can't. Also consider `worker_soft_shutdown_timeout` (seconds), which improves re-queuing on a worker's graceful shutdown.

## 4. Production configuration reference (with official default values)

Consolidate config into a **dedicated module** to make it the SSoT. I note each key's **official default value** so it's clear at a glance "what you're overriding."

```python
# proj/celeryconfig.py — 本番設定の単一の真実源
from celery.schedules import crontab

# --- シリアライズ（セキュリティの土台） ---
task_serializer = "json"          # 既定: "json"（4.0 以降）。pickle は使わない
result_serializer = "json"        # 既定: "json"
accept_content = ["json"]          # 既定: {"json"}。受け入れる形式を json に固定

# --- broker / backend ---
broker_url = "redis://localhost:6379/0"
result_backend = "redis://localhost:6379/1"
broker_connection_retry_on_startup = True   # 既定: 有効。起動時の接続リトライ
broker_pool_limit = 10                       # 既定: 10。接続プールの上限
broker_transport_options = {"visibility_timeout": 3600}

# --- 結果の寿命（コスト効率） ---
result_expires = 3600              # 既定: 1 日。結果の墓標(tombstone)が消えるまでの秒数
task_ignore_result = False          # 既定: 無効。結果が不要なタスクは個別に ignore_result=True

# --- 信頼性（冪等性とセットで効く。§5・§8 参照） ---
task_acks_late = True               # 既定: 無効。実行「後」に ack（少なくとも1回保証へ）
task_reject_on_worker_lost = True   # 既定: 無効。worker 異常終了時にメッセージを再キュー

# --- タイムアウト（暴走の封じ込め。§8 参照） ---
task_time_limit = 300               # 既定: 無制限。ハード上限（超えると worker ごと kill）
task_soft_time_limit = 270          # 既定: 無制限。ソフト上限（例外で猶予を与える）

# --- worker チューニング（§8 参照） ---
worker_prefetch_multiplier = 4      # 既定: 4。先読みするタスク数 × 並行度
worker_max_tasks_per_child = 1000   # 既定: 無制限。N 件処理ごとに子プロセスを再生成（メモリリーク対策）
worker_send_task_events = True      # 既定: 無効。Flower 等での監視を有効化（§10）
task_track_started = True           # 既定: 無効。"STARTED" 状態を報告（進捗可視化）

# --- ルーティング（§8 参照） ---
task_default_queue = "celery"       # 既定: "celery"
task_routes = {
    "proj.tasks.charge_*": {"queue": "payments"},
    "proj.tasks.send_*": {"queue": "notifications"},
}

# --- 定期実行（§7 参照） ---
timezone = "Asia/Tokyo"
enable_utc = True
beat_schedule = {
    "nightly-cleanup": {
        "task": "proj.tasks.cleanup_expired",
        "schedule": crontab(hour=3, minute=0),   # 毎日 3:00
    },
}
```

```python
# proj/celery.py で読み込む
app = Celery("proj")
app.config_from_object("proj.celeryconfig")
app.autodiscover_tasks()
```

**A design point:** consolidating config into one module without scattering it makes review and diff management easy, and lets you trace "where what was overridden." This is DRY and also the operation of **making the config itself the target of code review.**

## 5. Idempotency — write unbreakable tasks premised on "at-least-once" delivery

This is the **core** of Celery design. The official docs warn repeatedly:

> **Ideally task functions should be idempotent**—calling them multiple times with the same arguments doesn't cause unintended side effects. Celery is "at-least-once" delivery by default, and because messages stay in the queue until acked, **duplicate execution can occur.**

That is, don't write code on the premise "a task is executed exactly once." Setting `task_acks_late = True` strengthens the delivery guarantee, but at the cost, the official docs make clear that **"if a worker crashes mid-execution, the same task is re-executed."** Reliability (doesn't disappear) and idempotency (doesn't break on duplication) are a set.

### Anti-pattern: a non-idempotent task

```python
@app.task
def charge_customer(order_id: str, amount: int) -> None:
    # 再配信されると二重課金になる。決済では致命的。
    payment_gateway.charge(order_id, amount)
    db.mark_paid(order_id)
```

This task crashes mid-execution under `task_acks_late=True` → re-delivery → **double charge.** This is exactly the accident I've crushed most strictly on a payment foundation.

### Solution A: "converge to once" with an idempotency key

Record processed-ness with a unique key issued by the client (or caller). Using Redis's `SET NX` (set if it doesn't exist) lets you lock atomically.

```python
import redis

r = redis.Redis.from_url("redis://localhost:6379/2")


def _claim_once(key: str, ttl: int = 86400) -> bool:
    """このキーで初めての実行なら True。原子的（SET NX）。"""
    return bool(r.set(f"idem:{key}", "1", nx=True, ex=ttl))


@app.task(bind=True, acks_late=True, reject_on_worker_lost=True)
def charge_customer(self, order_id: str, amount: int, idempotency_key: str) -> None:
    if not _claim_once(idempotency_key):
        # 既に処理済み。再配信なので何もせず正常終了（= ack して再配信を止める）
        return
    payment_gateway.charge(order_id, amount)
    db.mark_paid(order_id)
```

### The pitfall of Solution A: "mark first" loses the processing

Solution A has an **ordering trap.** If you run `_claim_once` first and **then crash**, only the key remains, and even though the charge wasn't executed, it's misjudged as "processed," and **the payment is quietly lost.** This is the very point I actually stepped on in a payment-Webhook implementation.

There are two correct ways of thinking:

1. **Make the downstream operation itself idempotent** — **pass an idempotency key** to the payment gateway (Stripe, etc.). The gateway absorbs duplicates, so even if Celery calls twice, the charge is once. This is the most robust.

   ```python
   @app.task(bind=True, acks_late=True, reject_on_worker_lost=True)
   def charge_customer(self, order_id: str, amount: int, idempotency_key: str) -> None:
       # ゲートウェイに冪等キーを委譲。再実行されても課金は 1 回に収束する。
       payment_gateway.charge(order_id, amount, idempotency_key=idempotency_key)
       db.mark_paid(order_id)  # UPSERT / 条件付き更新で冪等に
   ```

2. **Let the DB's uniqueness constraint guard it** — put a UNIQUE constraint on `idempotency_key`, and with `INSERT ... ON CONFLICT DO NOTHING` (PostgreSQL), **guarantee atomically on the DB side** that "the second insert does nothing." More robust than an app lock, and not a single point of failure.

**The principle: guard correctness with "structure (DB constraints, gateway idempotency keys, atomic operations)," not "operational carefulness."** Correctness guarded by review or a procedure manual eventually breaks; correctness guarded by a uniqueness constraint or a conditional write doesn't.

## 6. Retry strategy — autoretry / backoff / jitter

External APIs, networks, and temporary locks are "**failures that sometimes fail but succeed if you wait a bit.**" What absorbs these is the retry. Celery lets you write it declaratively. Grasp the official parameters' **precise meaning and default values.**

```python
import requests


@app.task(
    bind=True,
    autoretry_for=(requests.RequestException,),  # この例外群なら自動再試行
    retry_backoff=True,        # 指数バックオフ（True なら 1,2,4,8... 秒）
    retry_backoff_max=600,     # 既定 600。バックオフの上限（10 分）
    retry_jitter=True,         # 既定 True。遅延に乱数を混ぜ、同時再試行の雪崩を防ぐ
    max_retries=5,             # 既定 3。None なら無限再試行
    retry_kwargs={"countdown": 5},
)
def fetch_exchange_rate(self, base: str) -> dict:
    resp = requests.get(f"https://api.example.com/rate/{base}", timeout=10)
    resp.raise_for_status()
    return resp.json()
```

The precise official default values:

| Parameter | Default | Meaning |
| --- | --- | --- |
| `max_retries` | `3` | The max retry count before giving up. `None` for infinite |
| `default_retry_delay` | `180` (3 min) | The default wait seconds before a retry |
| `retry_backoff` | `False` | `True` for exponential backoff |
| `retry_backoff_max` | `600` (10 min) | The upper bound of the backoff delay |
| `retry_jitter` | `True` | Adds a random value to the backoff |

### Why jitter prevents the "thundering herd"

When an external API temporarily goes down, **a large number of tasks that failed at the same moment all retry at the same interval simultaneously.** This is the "thundering herd"—a vicious cycle that puts DDoS-grade load on the recovering API and knocks it down again. `retry_jitter=True` mixes a random value into the retry timing to flatten the peak, preventing this. It's **enabled by default** because Celery knows this problem well.

### Manual retry: retry only the "failures that should be retried"

Retrying every exception is wrong. **"Insufficient balance" or "validation error" don't change however many times you retry**—rather, the retry wastes latency and cost. Use `bind=True` and `self.retry()` to select only the failures that should be retried.

```python
@app.task(bind=True, max_retries=5)
def sync_inventory(self, sku: str) -> None:
    try:
        result = external_api.sync(sku)
    except TransientError as exc:
        # 一時的失敗 → 指数バックオフで再試行
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)
    except PermanentError:
        # 恒久的失敗 → 再試行せず即失敗（ログ・DLQ 行き）
        logger.error("permanent failure for sku=%s", sku)
        raise
    db.save(sku, result)
```

You can get the current retry count (0-based) with `self.request.retries`. **Not mixing "failures fixed by a retry" with "failures that aren't"** guards all of latency, cost, and observability.

## 7. Periodic execution — Celery Beat

Periodic tasks like "cleanup at 3 a.m. daily" or "aggregate every 5 minutes" are handled by **Celery Beat.** Beat is a resident process that enqueues tasks per a schedule, started separately from the worker.

```python
from celery.schedules import crontab

app.conf.beat_schedule = {
    # timedelta 形式：30 秒ごと（前回実行から 30 秒後）
    "ping-every-30s": {
        "task": "proj.tasks.healthcheck",
        "schedule": 30.0,
    },
    # crontab 形式：毎週月曜 7:30
    "weekly-report": {
        "task": "proj.tasks.send_report",
        "schedule": crontab(hour=7, minute=30, day_of_week=1),
        "args": (16, 16),
    },
    # 毎日 3:00 にクリーンアップ
    "nightly-cleanup": {
        "task": "proj.tasks.cleanup_expired",
        "schedule": crontab(hour=3, minute=0),
    },
}
app.conf.timezone = "Asia/Tokyo"
```

```bash
$ celery -A proj beat --loglevel=INFO       # Beat を起動（worker とは別プロセス）
```

### Beat's most important warning: "only one scheduler"

As the official docs warn in bold:

> **For the same schedule, always run only one scheduler. Otherwise tasks duplicate.**

Get this wrong and it becomes a production accident. Making a Pod including Beat **2 replicas** in Kubernetes, or standing up multiple workers with `-B` (embedded beat) attached, **enqueues the same periodic task in duplicate by the number of instances.** The billing batch runs twice, notifications fly twice—a typical failure.

**Countermeasures:**

- Always run Beat as a **single instance** (Deployment `replicas: 1`, or leader election).
- `-B` (embedded beat) is **dev-only.** Officially discouraged for production too. Worker scaling and Beat's singleness can't coexist.
- If you want to add schedules dynamically or edit from an admin screen, in a Django environment use **`django-celery-beat`**'s `DatabaseScheduler`. Put the schedule in the DB and share a single source of truth even with multiple workers.

```bash
$ celery -A proj beat -l INFO \
    --scheduler django_celery_beat.schedulers:DatabaseScheduler
```

## 8. Workflows — Canvas (chain / group / chord)

Compositions like "multiple tasks in order," "in parallel," or "aggregate after parallel execution" are handled by **Canvas.** The foundation is the **signature**—an object that lets you carry around a task invocation as "data."

```python
from celery import chain, group, chord

# シグネチャ：引数を部分適用した「呼び出しの設計図」
add.s(2, 2)            # add(2, 2) のシグネチャ
add.si(2, 2)           # immutable（不変）。前段の結果を受け取らない（コールバック向け）
```

```python
# chain：直列。前段の結果が次段の第1引数に渡る → (2+2)+4)+8 = 16
chain(add.s(2, 2) | add.s(4) | add.s(8))().get()   # 16

# group：並列。全タスクを同時に投入
group(add.s(i, i) for i in range(10))()

# chord：group の全完了後にコールバックを1回実行（並列 → 集約）
chord(add.s(i, i) for i in range(100))(tsum.s()).get()
```

### A practical example: parallel fetch → aggregate

"Fetch data from multiple sources in parallel, and aggregate once all are in" is a typical chord.

```python
from celery import chord


@app.task
def fetch_source(source_id: str) -> dict:
    return external_api.fetch(source_id)


@app.task
def aggregate(results: list[dict]) -> dict:
    return {"total": sum(r["value"] for r in results)}


# 5 ソースを並列取得し、全完了後に aggregate へまとめて渡す
workflow = chord(
    (fetch_source.s(sid) for sid in ["a", "b", "c", "d", "e"]),
    aggregate.s(),
)
result = workflow()
```

### Canvas pitfalls

- **A chord's tasks must not discard results.** As the official docs warn, if any of a chord's constituent tasks (header/body) has `ignore_result=True`, the callback can't judge "all complete" and breaks. If you use a chord, enable `result_backend` and have the related tasks save results.
- **Error handling is `link_error` / `on_error`.** When a chain fails partway, make explicit what to do with the rest.

  ```python
  add.apply_async((2, 2), link=mul.s(16), link_error=log_error.s())
  add.s(2, 2).on_error(log_error.s()).delay()
  ```

- **Use the immutable signature `.si()` appropriately.** When you don't want to pass the previous result to a callback (call with fixed arguments), use `.si()`. `.s()` **prepends** the previous result to the front, so for non-commutative tasks you get an unintended argument order.

## 9. Invocation options — delay() and apply_async()

`delay()` is a convenient shortcut for `apply_async()`. **When you want to attach execution options, use `apply_async()`.**

```python
# 等価
add.delay(2, 2)
add.apply_async(args=(2, 2))

# 実行オプションは apply_async でのみ指定可能
add.apply_async(
    args=(2, 2),
    countdown=10,            # 10 秒後に実行（§3 の通り「数分」までに）
    expires=60,              # 60 秒以内に実行されなければ破棄
    queue="payments",        # ルーティング先キュー
    priority=5,              # 優先度（broker 依存）
    retry=True,
    retry_policy={"max_retries": 3, "interval_start": 0, "interval_step": 0.2, "interval_max": 0.2},
)
```

### Catch connection errors at send time

If the broker is down, **the task send itself fails.** Ignore this and you get the accident "I thought I enqueued it but it was gone." The official docs show catching `OperationalError`.

```python
try:
    charge_customer.delay(order_id, amount, idem_key)
except charge_customer.OperationalError as exc:
    # broker 接続失敗。アプリ側でフォールバック（DB アウトボックスに退避 等）
    logger.exception("failed to enqueue task: %r", exc)
    outbox.save_for_later(order_id, amount, idem_key)
```

`expires` is important too. For example, an "inventory notification" is meaningless 5 minutes later, so attaching `expires=300` lets you **auto-discard the old, meaningless task** even if the worker is clogged and delayed, preventing wasted processing and cost on recovery.

## 10. Production tuning — prefetch, concurrency, time limit

### prefetch — change the value by workload

For efficiency, a worker **prefetches (reserves)** more tasks than the number of execution slots. That amount is `worker_prefetch_multiplier` (default **4**) × concurrency. The optimal value is the exact opposite depending on the workload.

| Workload | Recommendation | Reason |
| --- | --- | --- |
| Long-running tasks (minutes) | `worker_prefetch_multiplier = 1` | Prefetch makes one worker hoard tasks while others idle. The official docs also state "1 for long tasks" |
| Short, mass tasks (ms to s) | A large value like `50`–`128` | Earn throughput with prefetch. The official docs say "64 or 128 is reasonable in some cases" |
| Mixed long and short | **Split workers** | Officially recommended: stand up worker nodes for long and short tasks with separate configs and route them |

```bash
# 決済（長め・厳格）：先読み 1、遅延 ack、低並行
$ celery -A proj worker -Q payments --prefetch-multiplier=1 --concurrency=4

# 通知（短く大量）：先読み多め、高並行
$ celery -A proj worker -Q notifications --prefetch-multiplier=64 --concurrency=8
```

Combining `task_acks_late=True` and `worker_prefetch_multiplier=1` reduces hoarding from prefetch while raising reliability with "ack after execution" (**premised on idempotency**). Redis-only, `worker_disable_prefetch`, which fetches only when a slot is free, is also an option.

### Memory countermeasures — periodically regenerate child processes

Run it long and the worker bloats from memory leaks in dependency libraries. The countermeasure is periodically regenerating child processes.

```python
worker_max_tasks_per_child = 1000     # 1000 タスクごとに子を作り直す
worker_max_memory_per_child = 300000  # 300MB 超で子を作り直す（KB 単位）
```

**But the official docs caution against excessive settings.** For example, regenerating a child that takes 1 second to start "every task" handles only 60 tasks a minute. Measure the **tradeoff between regeneration cost and memory stability** and decide.

### time limit — contain runaways

Prevents an infinite loop or hung task from continuing to occupy a worker.

```python
task_soft_time_limit = 270   # ソフト：270 秒で SoftTimeLimitExceeded を送出（後始末の猶予）
task_time_limit = 300        # ハード：300 秒で worker ごと強制終了
```

The soft limit can be caught as an exception, allowing **cleanup** like rolling back a transaction or deleting temp files. The hard limit kills immediately without that grace. The standard is to set both, with soft smaller than hard.

```python
from celery.exceptions import SoftTimeLimitExceeded


@app.task(soft_time_limit=270, time_limit=300)
def render_pdf(doc_id: str) -> None:
    try:
        do_render(doc_id)
    except SoftTimeLimitExceeded:
        cleanup_temp_files(doc_id)   # 後始末してから諦める
        raise
```

## 11. Observability — you can't operate what you can't see

Distributed tasks make "where it's clogged" hard to see. Celery comes with monitoring means as standard.

### Flower — realtime web monitoring

```bash
$ pip install flower
$ celery -A proj flower --port=5555     # http://localhost:5555
```

You can check task progress, history, statistics, and worker state in a browser, and it provides remote control and an HTTP API. To use Flower, enable `worker_send_task_events = True` (set in §4).

### inspect / status — peek into the cluster from the CLI

```bash
$ celery -A proj status                 # 稼働中ノード一覧
$ celery -A proj inspect active         # 実行中タスク
$ celery -A proj inspect scheduled      # ETA 待ちタスク
$ celery -A proj inspect reserved       # 先読み済み・実行待ちタスク
$ celery -A proj inspect stats          # worker 統計
$ celery -A proj inspect ping           # 死活確認
```

**An operational tip:** if the count of `inspect reserved` is swelling, it's a sign that prefetch is excessive and tasks are hoarded in the worker (§8). If `scheduled` is piling up, suspect overuse of `eta`/`countdown` (§3). **Make metrics a substitute for intuition**—this is the essence of observability.

In production, don't rely on Flower alone; visualize progress with `task_track_started=True` and combine with a Prometheus exporter and structured logs to continuously monitor latency, failure rate, and queue backlog—that's ideal.

## 12. Security — a task queue is an attack surface

A task queue is "a path for data crossing a trust boundary." Misdesign it and it becomes a serious vulnerability.

### ① Fix the serializer to json. Don't use pickle.

Historically, Celery's biggest vulnerability was **arbitrary code execution (RCE) via pickle deserialization.** Because pickle restores arbitrary Python objects, **if an attacker who can break into the broker sends a malicious payload, code executes on the worker.** Celery 4.0+ defaults to `json`; **explicitly fix `accept_content=["json"]`** to reject pickle (set in §4).

### ② Don't expose Redis externally / add auth and TLS

Redis has no auth by default. Unauthenticated Redis exposed to the internet has been a breeding ground for large-scale breaches in the past. **Isolate it on the network** (VPC / security group) and add auth and TLS.

```python
# パスワード認証 + TLS（rediss://）
broker_url = "rediss://:STRONG_PASSWORD@redis.internal:6380/0"
broker_use_ssl = {"ssl_cert_reqs": "required"}
```

Don't write passwords or connection strings directly in the code; **read them from environment variables or a secrets manager** (it's CLAUDE.md's iron rule too).

### ③ Don't put secrets in task arguments

This is the most important point that tends to be overlooked. **Task arguments are serialized and stored in the broker (Redis) and result backend, and displayed on Flower's screen too.** Pass a plaintext password, API key, card number, or personal info as an argument, and it remains in Redis, shows in the monitoring screen, and consequently **leaks unintentionally.**

```python
# アンチパターン：秘密情報が Redis と Flower に残る
send_email.delay(to="user@example.com", api_key="sk_live_xxx", ...)

# 正しい：ID だけを渡し、秘密情報は worker 側で安全に取得する
send_email.delay(notification_id="ntf_123")  # worker が ID から DB/Secrets を引く
```

**The principle: what flows through the queue is a "reference (ID)," not the "contents (secret)."** Also narrow the result lifetime with `result_expires` (§4) so PII-containing results don't linger.

## 13. Common pitfalls (FAQ)

**Q. A task executes twice. Is it a bug?**
A. It's the spec. Celery is "at-least-once" delivery, and duplicates can occur via re-delivery, worker crashes, and `acks_late`. The right answer is to **design tasks to be idempotent** (§5).

**Q. A task specified with `eta`/`countdown` runs many times.**
A. A delay exceeding `visibility_timeout` (default 1 hour) is the cause (§3). For reservations beyond a few minutes, use Beat or a DB scheduler.

**Q. Periodic tasks execute twice each.**
A. Multiple Beats are running (§7). Limit Beat to a single instance.

**Q. Memory usage keeps growing over time.**
A. Periodically regenerate child processes with `worker_max_tasks_per_child` / `worker_max_memory_per_child` (§8).

**Q. Throwing a long task delays other tasks.**
A. Hoarding from prefetch. Set `worker_prefetch_multiplier=1` for long-running tasks and, if possible, split workers and route (§8).

**Q. I can't get results / they're gone.**
A. Confirm the result backend is unset, `ignore_result=True`, expiry via `result_expires`, or Redis's key eviction (§3).

**Q. Sending fails when the broker is temporarily down.**
A. Catch `OperationalError` and fall back to a DB outbox, etc. (§9). Treat sending as a "failable operation."

## Summary — the design principles running through Celery + Redis

Celery + Redis is not a framework that "works with `.delay()`"; it demands a **design that faces the delivery semantics of a distributed system head-on.** Let me organize the principles running through this article.

| Principle | How it appears in this guide |
| --- | --- |
| **Guard correctness with structure** | Delegate idempotency to DB uniqueness constraints, gateway idempotency keys, atomic operations (§5) |
| **Know delivery is "at-least-once"** | Design that absorbs double execution as a normal path, not an anomaly (§5) |
| **Classify failures** | Sharply distinguish failures to retry (transient) from failures not to (permanent) (§6) |
| **Prevent the herd** | Avoid the thundering herd with jittered exponential backoff (§6) |
| **Single source of truth (SSoT/DRY)** | Config in one module, periodic execution in a single Beat, schedules consolidated in the DB (§4, §7) |
| **Contain runaways** | Protect workers with time limit, prefetch, child-process regeneration (§8) |
| **Observability** | Catch "clogs" with Flower, inspect, events as numbers, not intuition (§11) |
| **The queue is an attack surface** | json-only, Redis auth/TLS, no secrets in arguments (§12) |
| **Deciding not to use is also tech selection** | Choose other tools for light processing, strict ordering, far-future reservations (§1) |

The quality of async processing is measured not by flashy features but by **"that nothing happened"—no double charges, no message loss, the nightly batch quietly finishing.** Guaranteeing that by design rather than by chance is the condition for a task queue that withstands production. I have proven this discipline of "dropping correctness into structure under distribution" on a payment foundation handling real money, in the form of **zero double charges in production** ([the payment-reliability case study](/case-studies/payment-platform-reliability)).

If you're stuck on **designing an async foundation with Celery + Redis, improving the reliability of an existing system, or resolving failures around double execution or periodic execution**, I can help from the design stage. From "just get it running" to "withstands production"—I help close that distance by the shortest, most accurate route.

---

### References (official documentation)

- [Celery official documentation (stable)](https://docs.celeryq.dev/en/stable/)
- [Using Redis as broker / result backend](https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/redis.html)
- [Tasks (idempotency, acks_late, retries)](https://docs.celeryq.dev/en/stable/userguide/tasks.html)
- [Configuration (the config reference)](https://docs.celeryq.dev/en/stable/userguide/configuration.html)
- [Periodic Tasks (Beat)](https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html)
- [Canvas (workflows)](https://docs.celeryq.dev/en/stable/userguide/canvas.html)
- [Optimizing (prefetch, memory)](https://docs.celeryq.dev/en/stable/userguide/optimizing.html)
- [Monitoring (Flower, inspect)](https://docs.celeryq.dev/en/stable/userguide/monitoring.html)
