# Flask Production Deployment in Practice: Gunicorn, Choosing a WSGI Server, ProxyFix, Docker, Graceful Shutdown

> An implementation guide to deploying Flask 3.1.x at production quality. From why you abandon the development server, the separation of the WSGI app and the server, choosing among Gunicorn/Waitress/uWSGI/mod_wsgi, worker counts and gevent workers, the correct ProxyFix configuration and operating behind an ALB, multi-stage Docker and non-root containers, to graceful shutdown via SIGTERM and zero-downtime deploys on ECS — explained with real code faithful to the official documentation.

- Published: 2026-06-22
- Author: 友田 陽大
- Tags: Python, Flask, Gunicorn, Docker, WSGI, 本番運用, AWS, バックエンド
- URL: https://tomodahinata.com/en/blog/flask-deployment-gunicorn-docker-production-wsgi-guide
- Category: Flask in production
- Pillar guide: https://tomodahinata.com/en/blog/flask-production-guide

## Key points

- The development server (flask run / app.run()) is forbidden in production. Flask is a WSGI 'app'; a 'server' like Gunicorn converts HTTP ↔ WSGI and runs it. This separation is the starting point of deployment design
- For a Linux production WSGI server, Gunicorn is the standard. For Windows / cross-platform, Waitress. The starting point for worker count is CPU × 2; the default of 1 is not enough
- Behind a reverse proxy / ALB, interpose ProxyFix and correctly set x_for and the like to the 'number of trusted proxy hops.' Get it wrong and X-Forwarded-For can be spoofed — a serious security problem
- Harden the production container with multi-stage, slim base, non-root user, a Gunicorn entry point, and a HEALTHCHECK to /health. Inject secrets via environment variables (FLASK_ prefix)
- With SIGTERM and graceful-timeout, drain in-flight requests before going down. Separate readiness/liveness and achieve zero-downtime with ECS rolling deploys

---

## **Introduction: Most Production Outages Are Decided by "How You Choose the Server"**

No matter how clean your Flask app's code is, **get the choice of server and the deployment setup wrong, and production silently breaks**. Shipping with `flask run` as-is, the worker count staying at 1 and clogging under load, `request.remote_addr` becoming all the proxy's IP behind a load balancer, in-progress requests being cut on every deploy and producing 502s — these are outages born not of "app bugs" but of "a lack of deployment design."

This article is a spoke that deep-dives §8 "Deployment" of the [Flask production-operations guide (pillar)](/blog/flask-production-guide) to production quality. It covers only deployment design **faithful to the official documentation of Flask 3.1.x (the current stable)**. Concretely, it runs through, with real code, the choice of WSGI server, Gunicorn's worker design, the correct `ProxyFix` configuration, multi-stage Docker, and graceful shutdown that achieves zero downtime.

The author **designed and implemented the backend of a B2B SaaS that won the Minister of Economy, Trade and Industry Award in Python / Flask / SQLAlchemy / PostgreSQL, and ran 221 endpoints in production on API Gateway → ALB → ECS (Fargate)**. What I show here is the design that was needed in that field experience of "keeping Flask running safely behind a proxy chain."

> 💡 **The versions covered in this article**: it assumes **Flask 3.1.x**. Flask 3.1 has Werkzeug at the WSGI/HTTP layer, and features directly tied to deployment, like `ProxyFix` and `TRUSTED_HOSTS`, live here. The WSGI server (Gunicorn, etc.) is not a Flask dependency; it's installed separately.

---

## **1. The Big Premise: The Development Server Must Not Be Used in Production**

Everything starts from Flask's official clear warning.

> "Do not use the development server when deploying to production. It is intended for use only during local development. It is not designed to be particularly secure, stable, or efficient."

In short, **"do not deploy the development server to production. It is for local development only, and is not designed to be particularly secure, stable, or efficient."** The official docs also flatly state that "production means 'not development.'" Even an internal tool becomes production the moment it is exposed externally.

Both `flask run` and `app.run()` are this "development server that must not be used."

```python
# ❌ 本番でこれを起動してはいけない
if __name__ == "__main__":
    app.run()  # これは Werkzeug の開発サーバー。安全でも安定でも効率的でもない
```

> ⚠️ **Anti-pattern**: making `if __name__ == "__main__": app.run()` the `CMD` of Docker or the entry point of a process manager. That is exactly "exposing the development-only server to production traffic." Isolate `app.run()` in `if __name__ == "__main__":` so it runs **only when you hit a bare local script**, and never call it from a production path (Docker, systemd, ECS). Production startup is handled by the WSGI server from the next section on.

### 1.1 Think of the WSGI "App" and the WSGI "Server" Separately

Why can't Flask alone run production? Because **Flask is a WSGI *application*, not a *server*.** In the official words, "Flask is a WSGI application; a WSGI server runs it."

The two roles are clearly divided.

| Layer | Role | Concrete form |
|---|---|---|
| WSGI **server** | Receives HTTP over TCP, converts the HTTP request into the WSGI `environ` dict and passes it to the app, and turns the returned response back into HTTP | Gunicorn / Waitress / uWSGI / mod_wsgi |
| WSGI **app** | A pure function that receives `environ` and returns a response. Routing, views, templates | Your Flask `app` |

So production deployment comes down to **"loading your WSGI app (`app`) into a production-grade WSGI server."** Just replacing `app.run()`'s development server with Gunicorn — the structure is this simple. How to load `app` when you assemble it with an application factory (`create_app`) is covered in §3.3 (for the factory's own design, see the [large-app structure guide](/blog/flask-application-factory-blueprints-large-app-structure-guide)).

---

## **2. Choosing a WSGI Server: Gunicorn / Waitress / uWSGI / mod_wsgi**

The WSGI servers Flask officially lists as "self-hosting" options are **Gunicorn, Waitress, mod_wsgi, uWSGI, and gevent**. The two that become first candidates in practice are Gunicorn and Waitress; the rest are for specific requirements.

| Server | Platform | Characteristics | When to choose |
|---|---|---|---|
| **Gunicorn** | Linux / WSL (**not Windows**) | Pre-fork model. Rich worker types (sync / gevent, etc.). Straightforward config | **The default for Linux production.** Containers, ECS, EC2 |
| **Waitress** | Cross-platform (**Windows OK**) | Pure-Python implementation. Zero deps, thread-based | Windows servers, when you want to complete in pure Python |
| **uWSGI** | Linux | High-functionality, high-performance, but complex config (high learning cost) | Large setups needing multi-language or advanced tuning |
| **mod_wsgi** | Bundled with Apache | Embeds WSGI into Apache httpd | When riding along on existing Apache assets |

The selection guidance is simple.

- **Putting it on a container/VM on Linux → Gunicorn.** This is the protagonist of this article. Config is straightforward, the choice of worker types is wide, and it pairs well with container platforms like ECS/Fargate.
- **If you must run on Windows → Waitress.** The official docs too position Waitress as "a pure-Python option that runs on Windows." Gunicorn does not support Windows (it works on WSL).
- **uWSGI is powerful but config-heavy.** "uWSGI for now" often isn't worth the learning cost, and if Gunicorn can meet the requirements, you should choose Gunicorn.
- **mod_wsgi presupposes Apache.** There's little reason to deliberately choose it for a greenfield; it's the option when you have a special circumstance of embedding into existing Apache.

> 💡 **The author's choice**: in the lumber-distribution SaaS, Linux containers (ECS/Fargate) were the premise, so I adopted Gunicorn without hesitation. There was room to consider Waitress or uWSGI only when there's a "Windows constraint" or an "extreme tuning requirement"; for a now-typical setup of container × Linux, Gunicorn is the de facto standard.

---

## **3. Configuring Gunicorn at Production Quality**

### 3.1 Basic Startup: Understanding the Load Syntax

Gunicorn's app specification is the form `{module_import}:{app_variable}`. The equivalences the official docs show are exactly how to read it.

```bash
# 'from hello import app' と等価
gunicorn -w 4 'hello:app'

# 'from hello import create_app; create_app()' と等価（ファクトリ）
gunicorn -w 4 'hello:create_app()'
```

`hello` is the module (`hello.py`), and after the colon is "the WSGI app to retrieve from that module." Like the latter, **adding parentheses lets you use the result of calling a factory function** as the WSGI app. If you adopt an application factory (`create_app`), use the latter `'myapp:create_app()'` form.

### 3.2 Worker Count (`-w`): Start From CPU × 2

Gunicorn's default worker count is **1**. In the official words, "The default is only 1 worker, which is probably not what you want." In production, 1 worker is a fatal bottleneck where, while processing one request, other requests are made to wait.

The starting point is the official **`CPU × 2`** ("a starting value could be CPU * 2"). It is strictly a "starting point," from which you tune with load testing.

```bash
# 4 コアのホストなら 8 ワーカーから始める
gunicorn -w 8 'myapp:create_app()'
```

But the worker count has a **memory trade-off**. Gunicorn's default (sync worker) is pre-fork, where **each worker is a separate process holding the whole app in memory**. With 8 workers, roughly the app's memory footprint × 8 becomes the required memory. Always check whether `CPU × 2` fits within the container's memory limit (the ECS task definition, etc.).

> ⚠️ **"More workers = faster" is wrong**. Worker count plateaus at the CPU-core count and memory. Workers far exceeding the core count only increase context switching and memory consumption; throughput doesn't rise. If you need to scale, increase **the number of containers / tasks (horizontal scaling)**, not the worker count — this is the philosophy of ECS auto-scaling ([ECS Fargate production guide](/blog/aws-ecs-fargate-production-guide)).

### 3.3 sync Workers vs. gevent Workers: Don't Conflate These

The choice of Gunicorn's worker type is the most misunderstood point. The official guidance is clear.

> "The default sync worker is appropriate for most use cases. If you need numerous, long running, concurrent connections, Gunicorn provides an asynchronous worker using gevent."

**First, the default sync worker is enough.** A sync worker is a model where "one worker processes one request to the end," ideal for CPU-bound processing and general APIs with fast responses. The 221 endpoints of the lumber-distribution SaaS were also handled by sync workers without issue for the most part.

The `gevent` worker is effective only when the requirement is **"numerous, long-running, concurrent connections."** Concretely, it's workloads where IO waiting dominates — long waits for external-API responses, maintaining many connections like long-polling or SSE.

```bash
# gevent ワーカー（IO 待ちが支配的なときだけ）
gunicorn -k gevent 'myapp:create_app()'
```

When using `gevent`, you need `greenlet>=1.0` in the dependencies.

Here is a point you **must absolutely not conflate**. The official docs nail it down explicitly.

> "This is NOT the same as Python's async/await, or the ASGI server spec."

In other words, **the `gevent` worker is neither Python's `async`/`await` nor ASGI**. `gevent` is a mechanism that multiplexes IO waiting via cooperative multitasking with greenlets, and it is a different thing from `async def` views (covered in §7) and from ASGI frameworks like FastAPI. The naive choice of "gevent because I want async" breaks the code's premises. If you bring in `gevent`, do so understanding the premise that blocking calls are cooperatively scheduled via monkey-patching.

> 💡 **On threads and eventlet**: Gunicorn has `--threads` (threads per worker) and an `eventlet` worker too, but **these are not in Flask's official documentation**. They are the domain of Gunicorn's own documentation (docs.gunicorn.org). When this article touches `--threads` or `eventlet`, distinguish them as "Gunicorn features," not "Flask guidance." What Flask official mentions is only sync and gevent.

### 3.4 Bind and "Don't Run as Root"

To make Gunicorn accessible from outside, specify the bind target with `-b` (`--bind`).

```bash
gunicorn -w 4 -b 0.0.0.0 'myapp:create_app()'
```

There are 2 serious warnings here, both noted verbatim by the official docs.

> "Gunicorn should not be run as root..."

**You must not run Gunicorn as root.** If the process is ever hijacked, with root privileges the damage reaches all permissions. As the principle of least privilege, start it as a dedicated unprivileged user (the concrete measure in Docker is §5).

> "Don't [bind to 0.0.0.0] when using a reverse proxy setup, otherwise it will be possible to bypass the proxy."

**Behind a reverse proxy, you must not bind to `0.0.0.0`.** `0.0.0.0` listens on all network interfaces, so Gunicorn can be reached directly without going through the proxy, bypassing all of the authentication, WAF, and header shaping the proxy applies. Behind a proxy, bind to **an address reachable only from the proxy** (a Unix socket, `127.0.0.1`, or a dedicated port on the container-internal network).

```bash
# プロキシと同一ホストなら localhost に閉じる
gunicorn -w 4 -b 127.0.0.1:8000 'myapp:create_app()'

# Unix ソケット（nginx と同居する典型）
gunicorn -w 4 -b unix:/run/myapp.sock 'myapp:create_app()'
```

> ⚠️ **`0.0.0.0` in a container is context-dependent**. As with ECS/Fargate, where "ALB → a specific container port" is the only connectivity path and the container's network is closed by a security group, binding to `0.0.0.0:8000` inside the container is itself common. What's dangerous is the state of "binding to `0.0.0.0` while that port is directly reachable from outside." Whether you can **block direct reach from anywhere but the proxy at the network boundary (SG, VPC)** is the criterion.

### 3.5 Access Logs and Timeouts

Gunicorn's **access logs are off by default**. In container operations the standard is to aggregate logs to standard output, so enable it with `--access-logfile=-` (`-` is stdout).

```bash
gunicorn -w 4 --access-logfile=- 'myapp:create_app()'
```

Distinguish 2 kinds of timeout.

- **`--timeout`** (default 30 seconds): if a worker doesn't respond for this many seconds, the master kills and restarts it. If you have long processing, extend it, but **carelessly lengthening it lets a clogged worker squat**, so the right move is to offload long processing to an async job (queue) rather than running it synchronously inside a worker.
- **`--graceful-timeout`** (default 30 seconds): the grace, on restart/shutdown, to wait for in-progress requests to be drained. It's the core parameter of graceful shutdown (§6).

### 3.6 `gunicorn.conf.py`: Manage Config in Code

When the command-line flags grow, consolidate them in a config file `gunicorn.conf.py`. It's a more review-friendly, more reproducible setup than scattering them on the CLI.

```python
# gunicorn.conf.py — 本番設定をコードで一元管理
import multiprocessing
import os

# バインド：コンテナ内部ポート。外部到達はネットワーク境界（ALB/SG）で制御
bind = os.getenv("GUNICORN_BIND", "0.0.0.0:8000")

# ワーカー数：CPU×2 を出発点に、環境変数で上書き可能に
workers = int(os.getenv("GUNICORN_WORKERS", multiprocessing.cpu_count() * 2))

# ワーカー種別：既定は sync。IO 待ちが支配的なら "gevent" を環境変数で指定
worker_class = os.getenv("GUNICORN_WORKER_CLASS", "sync")

# ログはすべて stdout/stderr へ（コンテナのログドライバが集約）
accesslog = "-"
errorlog = "-"
loglevel = os.getenv("GUNICORN_LOG_LEVEL", "info")

# タイムアウト：詰まったワーカーを kill。長時間処理はキューに逃がす前提
timeout = int(os.getenv("GUNICORN_TIMEOUT", "30"))

# グレースフルシャットダウン猶予（§6）。ALB のドレイン時間と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))

# プロセス名（ps で見分けやすく）
proc_name = "myapp"
```

```bash
# 設定ファイルは自動で読まれる（カレントの gunicorn.conf.py）
gunicorn 'myapp:create_app()'

# 明示する場合
gunicorn -c gunicorn.conf.py 'myapp:create_app()'
```

The design of observability — structuring logs and adding request IDs — is split out into the [error-handling / observability guide](/blog/flask-error-handling-logging-observability-guide). "Skewering Gunicorn's access logs and the Flask app's structured logs with the same correlation ID" is the production crux.

---

## **4. Running Behind a Reverse Proxy / Load Balancer**

### 4.1 Why Put a Proxy in Front

A WSGI server has an HTTP server built in. But the official docs say:

> "WSGI servers have HTTP servers built-in. However, a dedicated HTTP server may be safer, more efficient, or more capable. Putting an HTTP server in front of the WSGI server is called a 'reverse proxy.'"

**Putting a dedicated HTTP server (reverse proxy) in front makes it safer, more efficient, and more capable** — you delegate TLS termination, static-file serving, rate limiting, buffering, and health checks to the proxy layer, and Gunicorn can focus on app processing. As front-line options the official docs list **nginx and Apache httpd**, and state that PaaS (Cloud Run, Elastic Beanstalk, App Engine, Azure, etc.) are similarly proxy setups. And there's an important sentence.

> "You'll probably need to Tell Flask it is Behind a Proxy when using most hosting platforms."

**"On most hosting you need to tell Flask it is behind a proxy"** — this is the next `ProxyFix`.

The author's setup was a multi-stage proxy of **API Gateway → ALB → ECS (Fargate)**. Because there are multiple relays between client and app, the `ProxyFix` configuration required particular care.

### 4.2 ProxyFix: Tell It the Proxy Hop Count Correctly

Behind a proxy, the client's original information (source IP, protocol, host) arrives at the app stored in headers like `X-Forwarded-For` / `X-Forwarded-Proto` / `X-Forwarded-Host`. The middleware that makes Flask (Werkzeug) interpret these correctly is `ProxyFix`. The form the official docs show is this.

```python
from werkzeug.middleware.proxy_fix import ProxyFix

app.wsgi_app = ProxyFix(
    app.wsgi_app, x_for=1, x_proto=1, x_host=1, x_prefix=1
)
```

Each `x_*` argument is **"the COUNT of proxies setting that X-Forwarded-* header."**

| Argument | Corresponding header | Meaning |
|---|---|---|
| `x_for` | `X-Forwarded-For` | The client's source IP (reflected in `request.remote_addr`) |
| `x_proto` | `X-Forwarded-Proto` | The original scheme (`http`/`https`) |
| `x_host` | `X-Forwarded-Host` | The original `Host` header |
| `x_prefix` | `X-Forwarded-Prefix` | The path prefix the proxy stripped |

And the official warning gets at the essence of this configuration.

> "This middleware should only be used if the application is actually behind a proxy, and should be configured with the number of proxies that are chained in front of it. Since incoming headers can be faked, you must set how many proxies are setting each header so the middleware knows what to trust. ... It can be a security issue if you get this configuration wrong."

The points are 3.

1. **Use it only when actually behind a proxy.** Put `ProxyFix` in when there's no proxy, and it will believe the fake `X-Forwarded-*` the client sent.
2. **Set the front-line proxy hop count accurately.** Because headers can be spoofed, fix "how many proxies set each header" as a number in `x_for` etc., and trust only that many from the end.
3. **Getting the config wrong becomes a security problem.** Over-estimate the hop count, and you treat a fake IP injected by the client as genuine, deceiving all of IP-based rate limiting, audit logs, and access control.

> ⚠️ **Never miscount the hops**. `x_for=1` means "1 proxy sets `X-Forwarded-For`." If only the ALB is in front, `x_for=1`. If the **2 stages of API Gateway → ALB** both stack `X-Forwarded-For`, measure the setup and set the appropriate hop count (in many cases `x_for=2`). "1 for everything just in case" or "a big number just in case" is forbidden; **decide the number only after confirming, with the actual request headers, how many hops your proxy chain stacks each header in**. The author observed the actual `X-Forwarded-For` value in a production-equivalent environment and fixed the hop count.

### 4.3 Mapping to ALB / nginx

The `ProxyFix` configuration corresponds directly to the front-line setup.

- **Behind an AWS ALB (ECS/Fargate)**: the ALB terminates TLS and adds `X-Forwarded-For` / `X-Forwarded-Proto`. If there's an API Gateway or CloudFront in front of the ALB and they too stack `X-Forwarded-For`, set their total hop count in `x_for`. Because TLS is terminated at the ALB, it arrives at Gunicorn/Flask as plaintext (HTTP). That's exactly why you need to tell Flask, via `ProxyFix`, with `x_proto`, that "it was originally HTTPS."
- **Behind nginx (same host or VM)**: set `proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;` etc. in nginx, have Gunicorn listen on `127.0.0.1` or a Unix socket, and set `ProxyFix` to `x_for=1` (nginx, 1 hop).

### 4.4 Cookie / Host Validation on TLS Termination

In a setup where the proxy terminates TLS, don't forget to harden 2 security settings on the Flask side.

- **`SESSION_COOKIE_SECURE = True`**: since production presupposes HTTPS, attach the `Secure` attribute to the session cookie and don't let it be sent in plaintext. TLS is terminated at the ALB/nginx, but as long as `x_proto` conveys to Flask that the origin was HTTPS, the `Secure` cookie functions correctly.
- **`TRUSTED_HOSTS`** (added in Flask 3.1): validates the `Host` header during routing and prevents Host-header attacks (using a fake `Host` to target link generation or cache poisoning). Behind a proxy, `Host` can be manipulated from outside, so specify the allowed hosts explicitly.

```python
app.config.update(
    SESSION_COOKIE_SECURE=True,    # HTTPS 限定 Cookie
    SESSION_COOKIE_HTTPONLY=True,
    SESSION_COOKIE_SAMESITE="Lax",
    TRUSTED_HOSTS=["api.example.com"],  # Host ヘッダ検証（3.1+）
)
```

The detailed design of cookie attributes, CSRF, and `TRUSTED_HOSTS` is deep-dived in the [security implementation guide](/blog/flask-security-sessions-csrf-secure-cookies-guide).

> 💡 **A behavior change in `SERVER_NAME` (Flask 3.1)**: in Flask 3.1, setting `SERVER_NAME` no longer restricts requests to that domain when `host_matching=True` or `subdomain_matching=False`. For the production purpose of "accept only a specific host," the correct answer is **`TRUSTED_HOSTS`**, not `SERVER_NAME`.

---

## **5. Docker: Assembling the Production Container Correctly**

If you put it on a container platform like ECS/Fargate, how you build the image governs production quality. The requirements are "small, non-root, doesn't include the development server, has a health check." A multi-stage build satisfies these.

```dockerfile
# syntax=docker/dockerfile:1

# ---- builder：依存をビルド・インストールするステージ ----
FROM python:3.12-slim AS builder

# ビルド時のみ必要なツールはこのステージに閉じ込める
ENV PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /app

# 仮想環境に依存をインストール（次ステージへ丸ごとコピーする）
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install -r requirements.txt gunicorn

# ---- runtime：実行に必要なものだけの最終ステージ ----
FROM python:3.12-slim AS runtime

# 非 root ユーザーを用意（§3.4 の「root で動かさない」を満たす）
RUN useradd --create-home --uid 10001 appuser

# builder からインストール済みの仮想環境だけを持ち込む
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH" \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

WORKDIR /app
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser gunicorn.conf.py ./

# ここで root を捨てる。以降のプロセスは非特権ユーザー
USER appuser

EXPOSE 8000

# /health を叩いて生存確認（§6 の readiness/liveness と連動）
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
    CMD python -c "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8000/health').status==200 else 1)"

# 開発サーバーは絶対に使わない。Gunicorn でファクトリを起動
CMD ["gunicorn", "-c", "gunicorn.conf.py", "myapp:create_app()"]
```

Let me organize the points.

- **Multi-stage**: use compilers and build tools in `builder`, and bring only the virtual environment (`/opt/venv`) into `runtime`. No build tools remain in the final image, shrinking size and attack surface.
- **slim base**: keep the foundation small with `python:3.12-slim`. `alpine` can struggle to build psycopg etc. due to its musl libc origin, so for Python `slim` is the safe bet.
- **Non-root `USER`**: switch to the `appuser` created by `useradd` with `USER`, satisfying §3.4's "don't run as root" at the container level.
- **Doesn't include the development server**: `CMD` is Gunicorn. Neither `flask run` nor `app.run()` appears.
- **`HEALTHCHECK`**: hit the `/health` endpoint (the `@app.route("/health")` defined in pillar §3) and confirm 200. On ECS you separately set a task-definition-side health check, but having one at the Docker level too makes it effective in local/compose.

### 5.1 `.dockerignore`

To not pollute the build context and to absolutely keep secrets and unnecessary things out of the image, always place a `.dockerignore`.

```text
.git
.gitignore
__pycache__/
*.pyc
.venv/
venv/
.env
.env.*
instance/
tests/
.pytest_cache/
*.md
Dockerfile
.dockerignore
```

> ⚠️ **Always exclude `.env` and `instance/`**. If `.env` (secrets) or `instance/` (local config, SQLite) gets baked into the image, secrets leak to the registry. Don't put secrets **in the image; inject them at runtime via environment variables**.

### 5.2 Inject Config via `FLASK_`-Prefixed Environment Variables

Don't bake secrets and environment differences into the image; **pass them at runtime via environment variables** — that's the 12-factor principle. With Flask 3.0's `from_prefixed_env()`, you can auto-import environment variables starting with `FLASK_` into `app.config` (values typed via `json.loads`). For the full picture of config management, see the pillar [Flask production-operations guide §4](/blog/flask-production-guide).

```bash
# ECS のタスク定義 / Secrets Manager から注入する想定
docker run --rm -p 8000:8000 \
  -e FLASK_SECRET_KEY="$(python -c 'import secrets; print(secrets.token_hex())')" \
  -e FLASK_SQLALCHEMY_DATABASE_URI="postgresql+psycopg://..." \
  -e GUNICORN_WORKERS=8 \
  myapp:latest
```

The concrete steps to run this container as a Fargate task and register it as an ALB target are consolidated in the [ECS Fargate production guide](/blog/aws-ecs-fargate-production-guide).

> 💡 **Beware the multiplication of DB connection pool × worker count**: because Gunicorn has **each worker as an independent process**, SQLAlchemy's connection pool is also **held separately per worker**. `pool_size=5` × 8 workers × 4 tasks = up to 160 connections, eating up PostgreSQL's `max_connections` — this is an accident that frequently occurs in production. In a multi-worker × multi-task setup, the standard is to interpose a connection pooler like PgBouncer. Details are compiled in the [PostgreSQL connection-pooling guide](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide).

---

## **6. Graceful Shutdown: The Core of Zero Downtime**

If you get 502/504 on every deploy, the cause is almost always "cutting in-progress requests and bringing the container down." Zero downtime is achieved by **"the new container becoming acceptable, then the old container draining its in-progress requests and quietly exiting."**

### 6.1 SIGTERM and graceful-timeout

A container orchestrator (ECS, Kubernetes) first sends **`SIGTERM`** when stopping a container. On receiving this, Gunicorn

1. stops accepting new requests,
2. **drains the in-progress (in-flight) requests within the grace of `--graceful-timeout`**,
3. terminates the process once all workers are cleaned up,

going down **gracefully** in this order. Workers that don't finish past the grace are force-killed.

```python
# gunicorn.conf.py（再掲・抜粋）
# ALB のデレジスタ遅延（deregistration delay）と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))
```

What matters here is to **align Gunicorn's `graceful_timeout` with the load balancer's drain time (the ALB's deregistration delay)**. If the grace from the ALB stopping new routing to a target until in-flight is drained diverges from Gunicorn's grace, you get a mismatch — the process vanishing before draining, or the ALB still routing.

> 💡 **Make PID 1 Gunicorn so it can be brought down by `SIGTERM`**. With Docker's `CMD ["gunicorn", ...]` (exec form), Gunicorn becomes PID 1 and can receive `SIGTERM` directly. The shell form (`CMD gunicorn ...`) makes the shell PID 1, the signal isn't conveyed to Gunicorn, and **graceful shutdown stops working** — a common container pitfall. This article's Dockerfile uses the exec form (JSON array) for this reason.

### 6.2 When You Do Cleanup After SIGTERM on the App Side

Normally, leaving it to Gunicorn's graceful processing is enough, but if explicit cleanup at worker exit (closing connections, etc.) is needed, close resources with a `gunicorn.conf.py` hook (`worker_exit`, etc.) or, on the app side, with `teardown_appcontext` (see pillar §5). For resources like DB connections that "should be bound to the request's lifetime," if you manage them with `g` and `teardown_appcontext` in the first place, you don't need to handle them individually at shutdown.

### 6.3 Separate readiness and liveness

Health checks are split into 2 kinds by purpose. Conflate them and you get traffic flowing to a still-starting container, or a merely temporarily-slow container being killed.

| Kind | Question | Behavior on failure | What it checks |
|---|---|---|---|
| **readiness** | "Can it receive requests now?" | The load balancer **doesn't send** traffic | Confirms up to DB connection and connectivity of required deps |
| **liveness** | "Is the process alive?" | **Restarts** the container | A lightweight check of whether the process responds |

```python
# liveness：軽量。プロセスが応答すれば 200
@app.route("/health")
def liveness():
    return {"status": "ok"}


# readiness：依存の疎通まで確認。起動直後や DB 断のときは 503
@app.route("/ready")
def readiness():
    try:
        db.session.execute(text("SELECT 1"))
    except Exception:
        return {"status": "not ready"}, 503
    return {"status": "ready"}
```

> ⚠️ **Don't put a heavy dependency check in liveness**. Put DB connectivity into liveness, and a merely temporarily-slow DB makes the container judged "dead" and falls into a **restart loop**. Handle a DB outage with readiness (stop traffic), and have liveness look only at the process's life-or-death — this separation is the crux of stable operation. Health-check design is also covered in the [error-handling / observability guide](/blog/flask-error-handling-logging-observability-guide).

### 6.4 Tying It to a Rolling Deploy (ECS)

Applying everything so far to an ECS rolling deploy, the zero-downtime flow becomes this.

1. ECS starts a new task (the new image's container).
2. Once the ALB confirms "acceptable" via **readiness (`/ready`)**, it registers the new task as a target and begins flowing traffic.
3. **`SIGTERM`** to the old task. The ALB begins deregistration and stops sending new requests to the old task.
4. The old task's Gunicorn **drains in-flight within the grace of `graceful_timeout`** and quietly terminates.

When these 4 steps mesh, users see no errors during the deploy. The concrete task definition / service / deploy settings on Fargate are covered in the [ECS Fargate production guide](/blog/aws-ecs-fargate-production-guide).

---

## **7. `async def` Views: When You May and May Not Use Them**

Flask supports `async def` views from 2.0 (`pip install flask[async]` is needed). But misunderstand this as a deployment-performance improvement and you'll get hurt. The official caution is decisive.

**Even with an `async` view, each request still monopolizes one worker.** Going `async` **does not increase the number of requests you can handle concurrently**. `async` is effective only when "within a single view, you run multiple IOs (external API calls, etc.) in parallel" — it's not an improvement to throughput (concurrent request count).

```python
import asyncio
import httpx


@app.route("/aggregate")
async def aggregate():
    # 1 ビュー内で 3 つの外部 API を並行呼び出し → ここでは async が効く
    async with httpx.AsyncClient() as client:
        a, b, c = await asyncio.gather(
            client.get("https://api.example.com/a"),
            client.get("https://api.example.com/b"),
            client.get("https://api.example.com/c"),
        )
    return {"a": a.json(), "b": b.json(), "c": c.json()}
```

There's a further constraint. **Even if you spawn a background task with `asyncio.create_task()`, it is canceled the moment the view returns.** Flask's `async` view is for "IO parallelism during request processing," not a place for fire-and-forget background processing.

The judgment is simple.

- **May use**: when you want to parallelize multiple independent external IOs within a single view to shrink that view's latency.
- **Must not use / should take another approach**: when you want the whole app to be async-first, to handle a large number of concurrent connections, or to run background jobs.

If you want the whole app to be async-first, the official docs recommend **Quart (which has a Flask-compatible API) on an ASGI premise**. Alternatively, to run an existing Flask app under an ASGI server, you can use `asgiref`'s `WsgiToAsgi` adapter. For a large number of concurrent connections there's also the option of Gunicorn's `gevent` worker (§3.3), but that's a different thing from `async`/`await` (§3.3), so don't conflate them. The use-cases of Flask / FastAPI / Django themselves are compiled in the [technology-selection guide](/blog/flask-vs-fastapi-vs-django-comparison-guide).

---

## **8. Production Checklist**

Let me put the design so far into a form you confirm without fail before deploying.

| Category | Check item | Basis (body) |
|---|---|---|
| Server | Excluded `flask run` / `app.run()` from the production path | §1 |
| Server | Chose a WSGI server (Gunicorn on Linux) | §2 |
| Gunicorn | Set worker count from `CPU × 2`, fits within the memory limit | §3.2 |
| Gunicorn | Chose the worker type by requirement (default sync, gevent if IO-heavy) | §3.3 |
| Gunicorn | Not conflating gevent with `async`/`await` / ASGI | §3.3 |
| Gunicorn | Not running as root (non-root user) | §3.4 / §5 |
| Gunicorn | Not directly exposed on `0.0.0.0` behind a proxy | §3.4 |
| Gunicorn | Emitting access logs to stdout (`--access-logfile=-`) | §3.5 |
| Proxy | Put in `ProxyFix` and set `x_for` etc. correctly to the proxy hop count | §4.2 |
| Proxy | Verified `ProxyFix`'s hop count with actual request headers | §4.2 |
| Proxy | Set `SESSION_COOKIE_SECURE=True` / `TRUSTED_HOSTS` | §4.4 |
| Docker | Multi-stage, slim base, non-root `USER` | §5 |
| Docker | Doesn't include the development server; `CMD` is Gunicorn (exec form) | §5 / §6.1 |
| Docker | Excluded `.env` / `instance/` in `.dockerignore` | §5.1 |
| Docker | Secrets not baked into the image; injected via env vars (`FLASK_` prefix) | §5.2 |
| DB | Worker count × task count × pool size within `max_connections` | §5.2 |
| Stop | Goes down gracefully on `SIGTERM`, PID 1 is Gunicorn | §6.1 |
| Stop | Aligned `graceful_timeout` with the ALB's drain time | §6.1 |
| Stop | Separated readiness and liveness | §6.3 |
| async | Understand the `async def` view's "1-worker monopoly" constraint | §7 |

---

## **Summary: Deployment Is the Design of "Outside the App"**

Most outages in Flask production deployment are born not from the app's code but from **"the design of the environment that runs the app."** Let me summarize the article's discipline one line at a time.

1. **Abandon the development server**. `flask run` / `app.run()` is forbidden in production. Flask is a WSGI *app*; a WSGI *server* like Gunicorn runs it.
2. **Configure Gunicorn at production quality**. Workers from `CPU × 2`, default sync is enough, gevent only when IO-heavy. Don't run as root, don't directly expose behind a proxy.
3. **Tell Gunicorn the proxy hop count correctly with `ProxyFix`**. `x_for` etc. is "the number of trusted proxies." Get it wrong and `X-Forwarded-For` is spoofed, collapsing IP-based control.
4. **Assemble the container correctly**. Multi-stage, slim, non-root, development server excluded, `/health` health check. Don't bake secrets; inject via env vars.
5. **Go down gracefully**. With `SIGTERM` and `graceful_timeout`, drain in-flight, separate readiness/liveness, and achieve zero downtime with ECS rolling deploys.
6. **Don't over-trust `async` views**. Understand the 1-worker-monopoly constraint, and if async-first is the requirement, consider Quart/ASGI.

The reason the author could stably operate 221 endpoints on API Gateway → ALB → ECS (Fargate) is that I designed this "outside the app" one piece at a time. Invest in deployment quality as much as in code quality. The overall picture is in the [Flask production-operations guide (pillar)](/blog/flask-production-guide), and the specifics are in the spoke articles linked from there.