Celery + Redis Production-Operations Guide — Async Task Design Faithful to the Official Docs (Idempotency, Retries, Observability)

"Email sending takes 3 seconds and clogs the API response," "the post-payment aggregation batch drags in daytime traffic and falls over," "image conversion done synchronously times out the request"—these are all problems solved by pushing heavy processing outside the request. The de-facto standard for that in Python is Celery, and the easiest message foundation to start with is Redis.

But Celery + Redis is not a framework that "just works by calling .delay()." Put it into production without thinking, and double execution of tasks, message loss, infinite re-delivery, and memory exhaustion happen quietly. These are all known pitfalls the official docs clearly warn about.

This article explains, faithful to the official docs of the Celery 5.6 line (the latest stable as of March 2026 is 5.6.3, Python 3.9+ / Python 3.14 supported), a design that withstands production operation, with code you can actually write and decision axes for "when and how to use it." I have designed and led the reliability layer of idempotency, retries, and consistency on a serverless payment foundation handling real money (related case study). This article translates that discipline of "guarding correctness with structure under distributed processing" into the Celery context.

The code in the body conforms to the Celery 5.6-line API. Config keys and default values are stated based on the official docs (docs.celeryq.dev), but always confirm on the relevant page of your version before going to production.

What this article covers

When you should and shouldn't use Celery + Redis (hold a decision axis first)
A configuration reference from minimal to production (with official default values)
The 3 traps of using Redis as broker / backend (visibility_timeout, key eviction, ETA re-delivery)
Idempotency—designing tasks not to break, premised on "at-least-once" delivery
A retry strategy with autoretry / backoff / jitter (the precise meaning of the official parameters)
Periodic execution with Beat, workflows with Canvas (chain/group/chord)
Production tuning of prefetch, concurrency, and time limit
Observability (Flower / inspect) and security (json-only, Redis auth, handling secrets)

1. Should you even use Celery — decide first

The first thing to do in tech selection is to crush "the reasons not to introduce it" rather than "the reasons to introduce it." Celery is powerful but comes with operational cost (worker, broker, monitoring). Judge suitability with the following table.

Situation	Recommendation
Light processing that completes within the request (tens of ms)	Celery unneeded. Synchronous is enough. A queue is just added complexity
Want to decouple heavy/slow processing from the response (email, PDF, image conversion, external API)	Celery suited ◎
Periodic batches (daily aggregation, cleanup, reminders)	Celery Beat suited ◎ (or the OS's cron / a cloud scheduler)
Want to handle a lot of I/O waits (concurrent web fetches, etc.)	Possible with Celery. But if `asyncio`-native, also consider arq / Dramatiq
Strict ordering guarantee / exactly-once is a requirement	Celery + Redis is unsuited. As described, "at-least-once" delivery. Consider Kafka or a DB outbox
AWS-centric, serverless-oriented	SQS + Lambda may save you from holding the infra operation

Tradeoffs to know before making Redis your broker

For Celery's broker (the path messages travel), you can choose RabbitMQ besides Redis. Redis is overwhelmingly simple to introduce and can double as the result backend, but it has limits due to being in-memory.

Aspect	Redis broker	RabbitMQ broker
Ease of introduction	◎ Start with one container, can double as backend	△ Requires understanding AMQP
Delivery guarantee	○ But depends on persistence (AOF/RDB) and config. Risk of unprocessed-message loss on a crash	◎ Acknowledgments and durable queues are its main job
Mass messages / high reliability	△ Beware memory limits and eviction policy	◎
ETA / long-term scheduling	✕ The `visibility_timeout` constraint is large (below)	○

Conclusion: for small-to-medium scale, early startup, and the "just get it running" stage, Redis is the best choice. This article also assumes Redis. But if "you want to handle messages where loss is unacceptable, like money or inventory, with strong delivery guarantees," use RabbitMQ; if "you want to guarantee crash resilience with Redis," operate it with AOF persistence enabled. Whatever you choose, the idempotency design below is mandatory. Because Redis is "at-least-once" delivery and double execution is premised on being absorbed by design.

2. Minimal configuration — the officially-recommended project layout

Following the officially-recommended layout, define the Celery app in a proj package.

proj/
├── __init__.py
├── celery.py        # Celery アプリ本体と設定
└── tasks.py         # タスク定義

# proj/celery.py
from celery import Celery

app = Celery(
    "proj",
    broker="redis://localhost:6379/0",       # DB 0 = メッセージ用
    backend="redis://localhost:6379/1",       # DB 1 = 結果用（broker と分離）
    include=["proj.tasks"],
)

# 設定は一箇所に集約（SSoT）。詳細は §4 の本番設定で深掘り。
app.conf.update(
    task_serializer="json",
    result_serializer="json",
    accept_content=["json"],                  # json 以外を受け付けない（セキュリティ §11）
    timezone="Asia/Tokyo",
    enable_utc=True,
    broker_connection_retry_on_startup=True,  # 起動時に broker が未起動でも再接続を試みる
    result_expires=3600,                       # 結果は1時間で失効（既定は1日）
)

# proj/tasks.py
from .celery import app


@app.task
def add(x: int, y: int) -> int:
    return x + y

Starting the worker and calling it:

# worker を起動（並行度は既定で CPU コア数）
$ celery -A proj worker --loglevel=INFO

>>> from proj.tasks import add
>>> add.delay(2, 3)          # 非同期で投入（fire-and-forget の最短形）
>>> r = add.delay(2, 3)
>>> r.get(timeout=5)          # 結果を取得（backend が必要）
5

Up to here is per the official "First steps." The problem is from here on—in production you must face the delivery semantics lurking behind the convenience of .delay().

3. The "3 traps" of using Redis as broker / backend

If you use Redis, grasp the following 3 points the official docs warn about, as premises to configure. Put it into production without knowing them and you'll suffer mysterious double execution and infinite loops.

Trap ①: `visibility_timeout` — understand the "time it becomes invisible"

The Redis broker doesn't have an explicit acknowledgment queue like AMQP. Instead, when a worker fetches a message, it makes that message "invisible" for a set time, and if no acknowledgment (ack) returns within that time, it re-delivers to another worker. That grace is visibility_timeout.

app.conf.broker_transport_options = {"visibility_timeout": 3600}  # 既定 1 時間（秒）

What this means is—when one task's execution (including waiting for a retry) exceeds visibility_timeout, Redis judges it "dead" and re-delivers the same task to another worker. That is, with a long-running task or a long countdown/eta, the same task can keep running multiply.

Trap ②: don't use ETA / countdown for the "far future"

The official docs state clearly—scheduling the far future with eta and countdown is not recommended. Keep it to a few minutes at most.

The reason directly connects to Trap ①. Specify countdown=7200 (2 hours later), and the moment it exceeds visibility_timeout (default 1 hour), Redis re-delivers and the task duplicates. Furthermore, ETA tasks have the nature of piling up in the worker's memory, so stacking many causes OOM (5.6 added a protective setting worker_eta_task_limit).

The correct usage:

A delay of "tens of seconds to a few minutes later" → OK with countdown / eta
"Hours later, a specific time, daily" → use Beat (§7) or a DB-based scheduler

If you really need a long visibility_timeout, the official docs instruct setting all 3 places to the same value (setting only one causes inconsistency).

# 例：12 時間に伸ばすなら 3 つとも揃える
app.conf.broker_transport_options = {"visibility_timeout": 43200}
app.conf.result_backend_transport_options = {"visibility_timeout": 43200}
app.conf.visibility_timeout = 43200

Trap ③: Redis's memory eviction policy

Redis is an in-memory DB, so when it hits the memory limit it evicts keys. If Celery's message or result keys get evicted, an InconsistencyError occurs and tasks or results quietly disappear.

The official instruction is clear—set maxmemory-policy to noeviction or allkeys-lru.

# redis.conf（または CONFIG SET）
maxmemory 2gb
maxmemory-policy noeviction   # メモリ満杯時は退避せず書き込みを拒否（消失より明示的エラーを選ぶ）

noeviction is the behavior of "if memory is full, reject new writes and error." Inconvenient at first glance, but "failing explicitly" is overwhelmingly safer in a distributed system than "silently losing messages." Failures can be detected, but loss can't. Also consider worker_soft_shutdown_timeout (seconds), which improves re-queuing on a worker's graceful shutdown.

4. Production configuration reference (with official default values)

Consolidate config into a dedicated module to make it the SSoT. I note each key's official default value so it's clear at a glance "what you're overriding."

# proj/celeryconfig.py — 本番設定の単一の真実源
from celery.schedules import crontab

# --- シリアライズ（セキュリティの土台） ---
task_serializer = "json"          # 既定: "json"（4.0 以降）。pickle は使わない
result_serializer = "json"        # 既定: "json"
accept_content = ["json"]          # 既定: {"json"}。受け入れる形式を json に固定

# --- broker / backend ---
broker_url = "redis://localhost:6379/0"
result_backend = "redis://localhost:6379/1"
broker_connection_retry_on_startup = True   # 既定: 有効。起動時の接続リトライ
broker_pool_limit = 10                       # 既定: 10。接続プールの上限
broker_transport_options = {"visibility_timeout": 3600}

# --- 結果の寿命（コスト効率） ---
result_expires = 3600              # 既定: 1 日。結果の墓標(tombstone)が消えるまでの秒数
task_ignore_result = False          # 既定: 無効。結果が不要なタスクは個別に ignore_result=True

# --- 信頼性（冪等性とセットで効く。§5・§8 参照） ---
task_acks_late = True               # 既定: 無効。実行「後」に ack（少なくとも1回保証へ）
task_reject_on_worker_lost = True   # 既定: 無効。worker 異常終了時にメッセージを再キュー

# --- タイムアウト（暴走の封じ込め。§8 参照） ---
task_time_limit = 300               # 既定: 無制限。ハード上限（超えると worker ごと kill）
task_soft_time_limit = 270          # 既定: 無制限。ソフト上限（例外で猶予を与える）

# --- worker チューニング（§8 参照） ---
worker_prefetch_multiplier = 4      # 既定: 4。先読みするタスク数 × 並行度
worker_max_tasks_per_child = 1000   # 既定: 無制限。N 件処理ごとに子プロセスを再生成（メモリリーク対策）
worker_send_task_events = True      # 既定: 無効。Flower 等での監視を有効化（§10）
task_track_started = True           # 既定: 無効。"STARTED" 状態を報告（進捗可視化）

# --- ルーティング（§8 参照） ---
task_default_queue = "celery"       # 既定: "celery"
task_routes = {
    "proj.tasks.charge_*": {"queue": "payments"},
    "proj.tasks.send_*": {"queue": "notifications"},
}

# --- 定期実行（§7 参照） ---
timezone = "Asia/Tokyo"
enable_utc = True
beat_schedule = {
    "nightly-cleanup": {
        "task": "proj.tasks.cleanup_expired",
        "schedule": crontab(hour=3, minute=0),   # 毎日 3:00
    },
}

# proj/celery.py で読み込む
app = Celery("proj")
app.config_from_object("proj.celeryconfig")
app.autodiscover_tasks()

A design point: consolidating config into one module without scattering it makes review and diff management easy, and lets you trace "where what was overridden." This is DRY and also the operation of making the config itself the target of code review.

5. Idempotency — write unbreakable tasks premised on "at-least-once" delivery

This is the core of Celery design. The official docs warn repeatedly:

Ideally task functions should be idempotent—calling them multiple times with the same arguments doesn't cause unintended side effects. Celery is "at-least-once" delivery by default, and because messages stay in the queue until acked, duplicate execution can occur.

That is, don't write code on the premise "a task is executed exactly once." Setting task_acks_late = True strengthens the delivery guarantee, but at the cost, the official docs make clear that "if a worker crashes mid-execution, the same task is re-executed." Reliability (doesn't disappear) and idempotency (doesn't break on duplication) are a set.

Anti-pattern: a non-idempotent task

@app.task
def charge_customer(order_id: str, amount: int) -> None:
    # 再配信されると二重課金になる。決済では致命的。
    payment_gateway.charge(order_id, amount)
    db.mark_paid(order_id)

This task crashes mid-execution under task_acks_late=True → re-delivery → double charge. This is exactly the accident I've crushed most strictly on a payment foundation.

Solution A: "converge to once" with an idempotency key

Record processed-ness with a unique key issued by the client (or caller). Using Redis's SET NX (set if it doesn't exist) lets you lock atomically.

import redis

r = redis.Redis.from_url("redis://localhost:6379/2")


def _claim_once(key: str, ttl: int = 86400) -> bool:
    """このキーで初めての実行なら True。原子的（SET NX）。"""
    return bool(r.set(f"idem:{key}", "1", nx=True, ex=ttl))


@app.task(bind=True, acks_late=True, reject_on_worker_lost=True)
def charge_customer(self, order_id: str, amount: int, idempotency_key: str) -> None:
    if not _claim_once(idempotency_key):
        # 既に処理済み。再配信なので何もせず正常終了（= ack して再配信を止める）
        return
    payment_gateway.charge(order_id, amount)
    db.mark_paid(order_id)

The pitfall of Solution A: "mark first" loses the processing

Solution A has an ordering trap. If you run _claim_once first and then crash, only the key remains, and even though the charge wasn't executed, it's misjudged as "processed," and the payment is quietly lost. This is the very point I actually stepped on in a payment-Webhook implementation.

There are two correct ways of thinking:

Make the downstream operation itself idempotent — pass an idempotency key to the payment gateway (Stripe, etc.). The gateway absorbs duplicates, so even if Celery calls twice, the charge is once. This is the most robust.

@app.task(bind=True, acks_late=True, reject_on_worker_lost=True)
def charge_customer(self, order_id: str, amount: int, idempotency_key: str) -> None:
    # ゲートウェイに冪等キーを委譲。再実行されても課金は 1 回に収束する。
    payment_gateway.charge(order_id, amount, idempotency_key=idempotency_key)
    db.mark_paid(order_id)  # UPSERT / 条件付き更新で冪等に

Let the DB's uniqueness constraint guard it — put a UNIQUE constraint on idempotency_key, and with INSERT ... ON CONFLICT DO NOTHING (PostgreSQL), guarantee atomically on the DB side that "the second insert does nothing." More robust than an app lock, and not a single point of failure.

The principle: guard correctness with "structure (DB constraints, gateway idempotency keys, atomic operations)," not "operational carefulness." Correctness guarded by review or a procedure manual eventually breaks; correctness guarded by a uniqueness constraint or a conditional write doesn't.

6. Retry strategy — autoretry / backoff / jitter

External APIs, networks, and temporary locks are "failures that sometimes fail but succeed if you wait a bit." What absorbs these is the retry. Celery lets you write it declaratively. Grasp the official parameters' precise meaning and default values.

import requests


@app.task(
    bind=True,
    autoretry_for=(requests.RequestException,),  # この例外群なら自動再試行
    retry_backoff=True,        # 指数バックオフ（True なら 1,2,4,8... 秒）
    retry_backoff_max=600,     # 既定 600。バックオフの上限（10 分）
    retry_jitter=True,         # 既定 True。遅延に乱数を混ぜ、同時再試行の雪崩を防ぐ
    max_retries=5,             # 既定 3。None なら無限再試行
    retry_kwargs={"countdown": 5},
)
def fetch_exchange_rate(self, base: str) -> dict:
    resp = requests.get(f"https://api.example.com/rate/{base}", timeout=10)
    resp.raise_for_status()
    return resp.json()

The precise official default values:

Parameter	Default	Meaning
`max_retries`	`3`	The max retry count before giving up. `None` for infinite
`default_retry_delay`	`180` (3 min)	The default wait seconds before a retry
`retry_backoff`	`False`	`True` for exponential backoff
`retry_backoff_max`	`600` (10 min)	The upper bound of the backoff delay
`retry_jitter`	`True`	Adds a random value to the backoff

Why jitter prevents the "thundering herd"

When an external API temporarily goes down, a large number of tasks that failed at the same moment all retry at the same interval simultaneously. This is the "thundering herd"—a vicious cycle that puts DDoS-grade load on the recovering API and knocks it down again. retry_jitter=True mixes a random value into the retry timing to flatten the peak, preventing this. It's enabled by default because Celery knows this problem well.

Manual retry: retry only the "failures that should be retried"

Retrying every exception is wrong. "Insufficient balance" or "validation error" don't change however many times you retry—rather, the retry wastes latency and cost. Use bind=True and self.retry() to select only the failures that should be retried.

@app.task(bind=True, max_retries=5)
def sync_inventory(self, sku: str) -> None:
    try:
        result = external_api.sync(sku)
    except TransientError as exc:
        # 一時的失敗 → 指数バックオフで再試行
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)
    except PermanentError:
        # 恒久的失敗 → 再試行せず即失敗（ログ・DLQ 行き）
        logger.error("permanent failure for sku=%s", sku)
        raise
    db.save(sku, result)

You can get the current retry count (0-based) with self.request.retries. Not mixing "failures fixed by a retry" with "failures that aren't" guards all of latency, cost, and observability.

7. Periodic execution — Celery Beat

Periodic tasks like "cleanup at 3 a.m. daily" or "aggregate every 5 minutes" are handled by Celery Beat. Beat is a resident process that enqueues tasks per a schedule, started separately from the worker.

from celery.schedules import crontab

app.conf.beat_schedule = {
    # timedelta 形式：30 秒ごと（前回実行から 30 秒後）
    "ping-every-30s": {
        "task": "proj.tasks.healthcheck",
        "schedule": 30.0,
    },
    # crontab 形式：毎週月曜 7:30
    "weekly-report": {
        "task": "proj.tasks.send_report",
        "schedule": crontab(hour=7, minute=30, day_of_week=1),
        "args": (16, 16),
    },
    # 毎日 3:00 にクリーンアップ
    "nightly-cleanup": {
        "task": "proj.tasks.cleanup_expired",
        "schedule": crontab(hour=3, minute=0),
    },
}
app.conf.timezone = "Asia/Tokyo"

$ celery -A proj beat --loglevel=INFO       # Beat を起動（worker とは別プロセス）

Beat's most important warning: "only one scheduler"

As the official docs warn in bold:

For the same schedule, always run only one scheduler. Otherwise tasks duplicate.

Get this wrong and it becomes a production accident. Making a Pod including Beat 2 replicas in Kubernetes, or standing up multiple workers with -B (embedded beat) attached, enqueues the same periodic task in duplicate by the number of instances. The billing batch runs twice, notifications fly twice—a typical failure.

Countermeasures:

Always run Beat as a single instance (Deployment replicas: 1, or leader election).
-B (embedded beat) is dev-only. Officially discouraged for production too. Worker scaling and Beat's singleness can't coexist.
If you want to add schedules dynamically or edit from an admin screen, in a Django environment use django-celery-beat's DatabaseScheduler. Put the schedule in the DB and share a single source of truth even with multiple workers.

$ celery -A proj beat -l INFO \
    --scheduler django_celery_beat.schedulers:DatabaseScheduler

8. Workflows — Canvas (chain / group / chord)

Compositions like "multiple tasks in order," "in parallel," or "aggregate after parallel execution" are handled by Canvas. The foundation is the signature—an object that lets you carry around a task invocation as "data."

from celery import chain, group, chord

# シグネチャ：引数を部分適用した「呼び出しの設計図」
add.s(2, 2)            # add(2, 2) のシグネチャ
add.si(2, 2)           # immutable（不変）。前段の結果を受け取らない（コールバック向け）

# chain：直列。前段の結果が次段の第1引数に渡る → (2+2)+4)+8 = 16
chain(add.s(2, 2) | add.s(4) | add.s(8))().get()   # 16

# group：並列。全タスクを同時に投入
group(add.s(i, i) for i in range(10))()

# chord：group の全完了後にコールバックを1回実行（並列 → 集約）
chord(add.s(i, i) for i in range(100))(tsum.s()).get()

A practical example: parallel fetch → aggregate

"Fetch data from multiple sources in parallel, and aggregate once all are in" is a typical chord.

from celery import chord


@app.task
def fetch_source(source_id: str) -> dict:
    return external_api.fetch(source_id)


@app.task
def aggregate(results: list[dict]) -> dict:
    return {"total": sum(r["value"] for r in results)}


# 5 ソースを並列取得し、全完了後に aggregate へまとめて渡す
workflow = chord(
    (fetch_source.s(sid) for sid in ["a", "b", "c", "d", "e"]),
    aggregate.s(),
)
result = workflow()

Canvas pitfalls

A chord's tasks must not discard results. As the official docs warn, if any of a chord's constituent tasks (header/body) has ignore_result=True, the callback can't judge "all complete" and breaks. If you use a chord, enable result_backend and have the related tasks save results.
Error handling is link_error / on_error. When a chain fails partway, make explicit what to do with the rest.
```
add.apply_async((2, 2), link=mul.s(16), link_error=log_error.s())
add.s(2, 2).on_error(log_error.s()).delay()
```
Use the immutable signature .si() appropriately. When you don't want to pass the previous result to a callback (call with fixed arguments), use .si(). .s() prepends the previous result to the front, so for non-commutative tasks you get an unintended argument order.

9. Invocation options — delay() and apply_async()

delay() is a convenient shortcut for apply_async(). When you want to attach execution options, use apply_async().

# 等価
add.delay(2, 2)
add.apply_async(args=(2, 2))

# 実行オプションは apply_async でのみ指定可能
add.apply_async(
    args=(2, 2),
    countdown=10,            # 10 秒後に実行（§3 の通り「数分」までに）
    expires=60,              # 60 秒以内に実行されなければ破棄
    queue="payments",        # ルーティング先キュー
    priority=5,              # 優先度（broker 依存）
    retry=True,
    retry_policy={"max_retries": 3, "interval_start": 0, "interval_step": 0.2, "interval_max": 0.2},
)

Catch connection errors at send time

If the broker is down, the task send itself fails. Ignore this and you get the accident "I thought I enqueued it but it was gone." The official docs show catching OperationalError.

try:
    charge_customer.delay(order_id, amount, idem_key)
except charge_customer.OperationalError as exc:
    # broker 接続失敗。アプリ側でフォールバック（DB アウトボックスに退避 等）
    logger.exception("failed to enqueue task: %r", exc)
    outbox.save_for_later(order_id, amount, idem_key)

expires is important too. For example, an "inventory notification" is meaningless 5 minutes later, so attaching expires=300 lets you auto-discard the old, meaningless task even if the worker is clogged and delayed, preventing wasted processing and cost on recovery.

10. Production tuning — prefetch, concurrency, time limit

prefetch — change the value by workload

For efficiency, a worker prefetches (reserves) more tasks than the number of execution slots. That amount is worker_prefetch_multiplier (default 4) × concurrency. The optimal value is the exact opposite depending on the workload.

Workload	Recommendation	Reason
Long-running tasks (minutes)	`worker_prefetch_multiplier = 1`	Prefetch makes one worker hoard tasks while others idle. The official docs also state "1 for long tasks"
Short, mass tasks (ms to s)	A large value like `50`–`128`	Earn throughput with prefetch. The official docs say "64 or 128 is reasonable in some cases"
Mixed long and short	Split workers	Officially recommended: stand up worker nodes for long and short tasks with separate configs and route them

# 決済（長め・厳格）：先読み 1、遅延 ack、低並行
$ celery -A proj worker -Q payments --prefetch-multiplier=1 --concurrency=4

# 通知（短く大量）：先読み多め、高並行
$ celery -A proj worker -Q notifications --prefetch-multiplier=64 --concurrency=8

Combining task_acks_late=True and worker_prefetch_multiplier=1 reduces hoarding from prefetch while raising reliability with "ack after execution" (premised on idempotency). Redis-only, worker_disable_prefetch, which fetches only when a slot is free, is also an option.

Memory countermeasures — periodically regenerate child processes

Run it long and the worker bloats from memory leaks in dependency libraries. The countermeasure is periodically regenerating child processes.

worker_max_tasks_per_child = 1000     # 1000 タスクごとに子を作り直す
worker_max_memory_per_child = 300000  # 300MB 超で子を作り直す（KB 単位）

But the official docs caution against excessive settings. For example, regenerating a child that takes 1 second to start "every task" handles only 60 tasks a minute. Measure the tradeoff between regeneration cost and memory stability and decide.

time limit — contain runaways

Prevents an infinite loop or hung task from continuing to occupy a worker.

task_soft_time_limit = 270   # ソフト：270 秒で SoftTimeLimitExceeded を送出（後始末の猶予）
task_time_limit = 300        # ハード：300 秒で worker ごと強制終了

The soft limit can be caught as an exception, allowing cleanup like rolling back a transaction or deleting temp files. The hard limit kills immediately without that grace. The standard is to set both, with soft smaller than hard.

from celery.exceptions import SoftTimeLimitExceeded


@app.task(soft_time_limit=270, time_limit=300)
def render_pdf(doc_id: str) -> None:
    try:
        do_render(doc_id)
    except SoftTimeLimitExceeded:
        cleanup_temp_files(doc_id)   # 後始末してから諦める
        raise

11. Observability — you can't operate what you can't see

Distributed tasks make "where it's clogged" hard to see. Celery comes with monitoring means as standard.

Flower — realtime web monitoring

$ pip install flower
$ celery -A proj flower --port=5555     # http://localhost:5555

You can check task progress, history, statistics, and worker state in a browser, and it provides remote control and an HTTP API. To use Flower, enable worker_send_task_events = True (set in §4).

inspect / status — peek into the cluster from the CLI

$ celery -A proj status                 # 稼働中ノード一覧
$ celery -A proj inspect active         # 実行中タスク
$ celery -A proj inspect scheduled      # ETA 待ちタスク
$ celery -A proj inspect reserved       # 先読み済み・実行待ちタスク
$ celery -A proj inspect stats          # worker 統計
$ celery -A proj inspect ping           # 死活確認

An operational tip: if the count of inspect reserved is swelling, it's a sign that prefetch is excessive and tasks are hoarded in the worker (§8). If scheduled is piling up, suspect overuse of eta/countdown (§3). Make metrics a substitute for intuition—this is the essence of observability.

In production, don't rely on Flower alone; visualize progress with task_track_started=True and combine with a Prometheus exporter and structured logs to continuously monitor latency, failure rate, and queue backlog—that's ideal.

12. Security — a task queue is an attack surface

A task queue is "a path for data crossing a trust boundary." Misdesign it and it becomes a serious vulnerability.

① Fix the serializer to json. Don't use pickle.

Historically, Celery's biggest vulnerability was arbitrary code execution (RCE) via pickle deserialization. Because pickle restores arbitrary Python objects, if an attacker who can break into the broker sends a malicious payload, code executes on the worker. Celery 4.0+ defaults to json; explicitly fix accept_content=["json"] to reject pickle (set in §4).

② Don't expose Redis externally / add auth and TLS

Redis has no auth by default. Unauthenticated Redis exposed to the internet has been a breeding ground for large-scale breaches in the past. Isolate it on the network (VPC / security group) and add auth and TLS.

# パスワード認証 + TLS（rediss://）
broker_url = "rediss://:STRONG_PASSWORD@redis.internal:6380/0"
broker_use_ssl = {"ssl_cert_reqs": "required"}

Don't write passwords or connection strings directly in the code; read them from environment variables or a secrets manager (it's CLAUDE.md's iron rule too).

③ Don't put secrets in task arguments

This is the most important point that tends to be overlooked. Task arguments are serialized and stored in the broker (Redis) and result backend, and displayed on Flower's screen too. Pass a plaintext password, API key, card number, or personal info as an argument, and it remains in Redis, shows in the monitoring screen, and consequently leaks unintentionally.

# アンチパターン：秘密情報が Redis と Flower に残る
send_email.delay(to="user@example.com", api_key="sk_live_xxx", ...)

# 正しい：ID だけを渡し、秘密情報は worker 側で安全に取得する
send_email.delay(notification_id="ntf_123")  # worker が ID から DB/Secrets を引く

The principle: what flows through the queue is a "reference (ID)," not the "contents (secret)." Also narrow the result lifetime with result_expires (§4) so PII-containing results don't linger.

13. Common pitfalls (FAQ)

Q. A task executes twice. Is it a bug? A. It's the spec. Celery is "at-least-once" delivery, and duplicates can occur via re-delivery, worker crashes, and acks_late. The right answer is to design tasks to be idempotent (§5).

Q. A task specified with eta/countdown runs many times. A. A delay exceeding visibility_timeout (default 1 hour) is the cause (§3). For reservations beyond a few minutes, use Beat or a DB scheduler.

Q. Periodic tasks execute twice each. A. Multiple Beats are running (§7). Limit Beat to a single instance.

Q. Memory usage keeps growing over time. A. Periodically regenerate child processes with worker_max_tasks_per_child / worker_max_memory_per_child (§8).

Q. Throwing a long task delays other tasks. A. Hoarding from prefetch. Set worker_prefetch_multiplier=1 for long-running tasks and, if possible, split workers and route (§8).

Q. I can't get results / they're gone. A. Confirm the result backend is unset, ignore_result=True, expiry via result_expires, or Redis's key eviction (§3).

Q. Sending fails when the broker is temporarily down. A. Catch OperationalError and fall back to a DB outbox, etc. (§9). Treat sending as a "failable operation."

Summary — the design principles running through Celery + Redis

Celery + Redis is not a framework that "works with .delay()"; it demands a design that faces the delivery semantics of a distributed system head-on. Let me organize the principles running through this article.

Principle	How it appears in this guide
Guard correctness with structure	Delegate idempotency to DB uniqueness constraints, gateway idempotency keys, atomic operations (§5)
Know delivery is "at-least-once"	Design that absorbs double execution as a normal path, not an anomaly (§5)
Classify failures	Sharply distinguish failures to retry (transient) from failures not to (permanent) (§6)
Prevent the herd	Avoid the thundering herd with jittered exponential backoff (§6)
Single source of truth (SSoT/DRY)	Config in one module, periodic execution in a single Beat, schedules consolidated in the DB (§4, §7)
Contain runaways	Protect workers with time limit, prefetch, child-process regeneration (§8)
Observability	Catch "clogs" with Flower, inspect, events as numbers, not intuition (§11)
The queue is an attack surface	json-only, Redis auth/TLS, no secrets in arguments (§12)
Deciding not to use is also tech selection	Choose other tools for light processing, strict ordering, far-future reservations (§1)

The quality of async processing is measured not by flashy features but by "that nothing happened"—no double charges, no message loss, the nightly batch quietly finishing. Guaranteeing that by design rather than by chance is the condition for a task queue that withstands production. I have proven this discipline of "dropping correctness into structure under distribution" on a payment foundation handling real money, in the form of zero double charges in production (the payment-reliability case study).

If you're stuck on designing an async foundation with Celery + Redis, improving the reliability of an existing system, or resolving failures around double execution or periodic execution, I can help from the design stage. From "just get it running" to "withstands production"—I help close that distance by the shortest, most accurate route.