Skip to main content
友田 陽大
Flask in production
Python
Flask
Gunicorn
Docker
WSGI
本番運用
AWS
バックエンド

Flask Production Deployment in Practice: Gunicorn, Choosing a WSGI Server, ProxyFix, Docker, Graceful Shutdown

An implementation guide to deploying Flask 3.1.x at production quality. From why you abandon the development server, the separation of the WSGI app and the server, choosing among Gunicorn/Waitress/uWSGI/mod_wsgi, worker counts and gevent workers, the correct ProxyFix configuration and operating behind an ALB, multi-stage Docker and non-root containers, to graceful shutdown via SIGTERM and zero-downtime deploys on ECS — explained with real code faithful to the official documentation.

Published
Reading time
25 min read
Author
友田 陽大
Share

Introduction: Most Production Outages Are Decided by "How You Choose the Server"

No matter how clean your Flask app's code is, get the choice of server and the deployment setup wrong, and production silently breaks. Shipping with flask run as-is, the worker count staying at 1 and clogging under load, request.remote_addr becoming all the proxy's IP behind a load balancer, in-progress requests being cut on every deploy and producing 502s — these are outages born not of "app bugs" but of "a lack of deployment design."

This article is a spoke that deep-dives §8 "Deployment" of the Flask production-operations guide (pillar) to production quality. It covers only deployment design faithful to the official documentation of Flask 3.1.x (the current stable). Concretely, it runs through, with real code, the choice of WSGI server, Gunicorn's worker design, the correct ProxyFix configuration, multi-stage Docker, and graceful shutdown that achieves zero downtime.

The author designed and implemented the backend of a B2B SaaS that won the Minister of Economy, Trade and Industry Award in Python / Flask / SQLAlchemy / PostgreSQL, and ran 221 endpoints in production on API Gateway → ALB → ECS (Fargate). What I show here is the design that was needed in that field experience of "keeping Flask running safely behind a proxy chain."

💡 The versions covered in this article: it assumes Flask 3.1.x. Flask 3.1 has Werkzeug at the WSGI/HTTP layer, and features directly tied to deployment, like ProxyFix and TRUSTED_HOSTS, live here. The WSGI server (Gunicorn, etc.) is not a Flask dependency; it's installed separately.


1. The Big Premise: The Development Server Must Not Be Used in Production

Everything starts from Flask's official clear warning.

"Do not use the development server when deploying to production. It is intended for use only during local development. It is not designed to be particularly secure, stable, or efficient."

In short, "do not deploy the development server to production. It is for local development only, and is not designed to be particularly secure, stable, or efficient." The official docs also flatly state that "production means 'not development.'" Even an internal tool becomes production the moment it is exposed externally.

Both flask run and app.run() are this "development server that must not be used."

# ❌ 本番でこれを起動してはいけない
if __name__ == "__main__":
    app.run()  # これは Werkzeug の開発サーバー。安全でも安定でも効率的でもない

⚠️ Anti-pattern: making if __name__ == "__main__": app.run() the CMD of Docker or the entry point of a process manager. That is exactly "exposing the development-only server to production traffic." Isolate app.run() in if __name__ == "__main__": so it runs only when you hit a bare local script, and never call it from a production path (Docker, systemd, ECS). Production startup is handled by the WSGI server from the next section on.

1.1 Think of the WSGI "App" and the WSGI "Server" Separately

Why can't Flask alone run production? Because Flask is a WSGI application, not a server. In the official words, "Flask is a WSGI application; a WSGI server runs it."

The two roles are clearly divided.

LayerRoleConcrete form
WSGI serverReceives HTTP over TCP, converts the HTTP request into the WSGI environ dict and passes it to the app, and turns the returned response back into HTTPGunicorn / Waitress / uWSGI / mod_wsgi
WSGI appA pure function that receives environ and returns a response. Routing, views, templatesYour Flask app

So production deployment comes down to "loading your WSGI app (app) into a production-grade WSGI server." Just replacing app.run()'s development server with Gunicorn — the structure is this simple. How to load app when you assemble it with an application factory (create_app) is covered in §3.3 (for the factory's own design, see the large-app structure guide).


2. Choosing a WSGI Server: Gunicorn / Waitress / uWSGI / mod_wsgi

The WSGI servers Flask officially lists as "self-hosting" options are Gunicorn, Waitress, mod_wsgi, uWSGI, and gevent. The two that become first candidates in practice are Gunicorn and Waitress; the rest are for specific requirements.

ServerPlatformCharacteristicsWhen to choose
GunicornLinux / WSL (not Windows)Pre-fork model. Rich worker types (sync / gevent, etc.). Straightforward configThe default for Linux production. Containers, ECS, EC2
WaitressCross-platform (Windows OK)Pure-Python implementation. Zero deps, thread-basedWindows servers, when you want to complete in pure Python
uWSGILinuxHigh-functionality, high-performance, but complex config (high learning cost)Large setups needing multi-language or advanced tuning
mod_wsgiBundled with ApacheEmbeds WSGI into Apache httpdWhen riding along on existing Apache assets

The selection guidance is simple.

  • Putting it on a container/VM on Linux → Gunicorn. This is the protagonist of this article. Config is straightforward, the choice of worker types is wide, and it pairs well with container platforms like ECS/Fargate.
  • If you must run on Windows → Waitress. The official docs too position Waitress as "a pure-Python option that runs on Windows." Gunicorn does not support Windows (it works on WSL).
  • uWSGI is powerful but config-heavy. "uWSGI for now" often isn't worth the learning cost, and if Gunicorn can meet the requirements, you should choose Gunicorn.
  • mod_wsgi presupposes Apache. There's little reason to deliberately choose it for a greenfield; it's the option when you have a special circumstance of embedding into existing Apache.

💡 The author's choice: in the lumber-distribution SaaS, Linux containers (ECS/Fargate) were the premise, so I adopted Gunicorn without hesitation. There was room to consider Waitress or uWSGI only when there's a "Windows constraint" or an "extreme tuning requirement"; for a now-typical setup of container × Linux, Gunicorn is the de facto standard.


3. Configuring Gunicorn at Production Quality

3.1 Basic Startup: Understanding the Load Syntax

Gunicorn's app specification is the form {module_import}:{app_variable}. The equivalences the official docs show are exactly how to read it.

# 'from hello import app' と等価
gunicorn -w 4 'hello:app'

# 'from hello import create_app; create_app()' と等価(ファクトリ)
gunicorn -w 4 'hello:create_app()'

hello is the module (hello.py), and after the colon is "the WSGI app to retrieve from that module." Like the latter, adding parentheses lets you use the result of calling a factory function as the WSGI app. If you adopt an application factory (create_app), use the latter 'myapp:create_app()' form.

3.2 Worker Count (-w): Start From CPU × 2

Gunicorn's default worker count is 1. In the official words, "The default is only 1 worker, which is probably not what you want." In production, 1 worker is a fatal bottleneck where, while processing one request, other requests are made to wait.

The starting point is the official CPU × 2 ("a starting value could be CPU * 2"). It is strictly a "starting point," from which you tune with load testing.

# 4 コアのホストなら 8 ワーカーから始める
gunicorn -w 8 'myapp:create_app()'

But the worker count has a memory trade-off. Gunicorn's default (sync worker) is pre-fork, where each worker is a separate process holding the whole app in memory. With 8 workers, roughly the app's memory footprint × 8 becomes the required memory. Always check whether CPU × 2 fits within the container's memory limit (the ECS task definition, etc.).

⚠️ "More workers = faster" is wrong. Worker count plateaus at the CPU-core count and memory. Workers far exceeding the core count only increase context switching and memory consumption; throughput doesn't rise. If you need to scale, increase the number of containers / tasks (horizontal scaling), not the worker count — this is the philosophy of ECS auto-scaling (ECS Fargate production guide).

3.3 sync Workers vs. gevent Workers: Don't Conflate These

The choice of Gunicorn's worker type is the most misunderstood point. The official guidance is clear.

"The default sync worker is appropriate for most use cases. If you need numerous, long running, concurrent connections, Gunicorn provides an asynchronous worker using gevent."

First, the default sync worker is enough. A sync worker is a model where "one worker processes one request to the end," ideal for CPU-bound processing and general APIs with fast responses. The 221 endpoints of the lumber-distribution SaaS were also handled by sync workers without issue for the most part.

The gevent worker is effective only when the requirement is "numerous, long-running, concurrent connections." Concretely, it's workloads where IO waiting dominates — long waits for external-API responses, maintaining many connections like long-polling or SSE.

# gevent ワーカー(IO 待ちが支配的なときだけ)
gunicorn -k gevent 'myapp:create_app()'

When using gevent, you need greenlet>=1.0 in the dependencies.

Here is a point you must absolutely not conflate. The official docs nail it down explicitly.

"This is NOT the same as Python's async/await, or the ASGI server spec."

In other words, the gevent worker is neither Python's async/await nor ASGI. gevent is a mechanism that multiplexes IO waiting via cooperative multitasking with greenlets, and it is a different thing from async def views (covered in §7) and from ASGI frameworks like FastAPI. The naive choice of "gevent because I want async" breaks the code's premises. If you bring in gevent, do so understanding the premise that blocking calls are cooperatively scheduled via monkey-patching.

💡 On threads and eventlet: Gunicorn has --threads (threads per worker) and an eventlet worker too, but these are not in Flask's official documentation. They are the domain of Gunicorn's own documentation (docs.gunicorn.org). When this article touches --threads or eventlet, distinguish them as "Gunicorn features," not "Flask guidance." What Flask official mentions is only sync and gevent.

3.4 Bind and "Don't Run as Root"

To make Gunicorn accessible from outside, specify the bind target with -b (--bind).

gunicorn -w 4 -b 0.0.0.0 'myapp:create_app()'

There are 2 serious warnings here, both noted verbatim by the official docs.

"Gunicorn should not be run as root..."

You must not run Gunicorn as root. If the process is ever hijacked, with root privileges the damage reaches all permissions. As the principle of least privilege, start it as a dedicated unprivileged user (the concrete measure in Docker is §5).

"Don't [bind to 0.0.0.0] when using a reverse proxy setup, otherwise it will be possible to bypass the proxy."

Behind a reverse proxy, you must not bind to 0.0.0.0. 0.0.0.0 listens on all network interfaces, so Gunicorn can be reached directly without going through the proxy, bypassing all of the authentication, WAF, and header shaping the proxy applies. Behind a proxy, bind to an address reachable only from the proxy (a Unix socket, 127.0.0.1, or a dedicated port on the container-internal network).

# プロキシと同一ホストなら localhost に閉じる
gunicorn -w 4 -b 127.0.0.1:8000 'myapp:create_app()'

# Unix ソケット(nginx と同居する典型)
gunicorn -w 4 -b unix:/run/myapp.sock 'myapp:create_app()'

⚠️ 0.0.0.0 in a container is context-dependent. As with ECS/Fargate, where "ALB → a specific container port" is the only connectivity path and the container's network is closed by a security group, binding to 0.0.0.0:8000 inside the container is itself common. What's dangerous is the state of "binding to 0.0.0.0 while that port is directly reachable from outside." Whether you can block direct reach from anywhere but the proxy at the network boundary (SG, VPC) is the criterion.

3.5 Access Logs and Timeouts

Gunicorn's access logs are off by default. In container operations the standard is to aggregate logs to standard output, so enable it with --access-logfile=- (- is stdout).

gunicorn -w 4 --access-logfile=- 'myapp:create_app()'

Distinguish 2 kinds of timeout.

  • --timeout (default 30 seconds): if a worker doesn't respond for this many seconds, the master kills and restarts it. If you have long processing, extend it, but carelessly lengthening it lets a clogged worker squat, so the right move is to offload long processing to an async job (queue) rather than running it synchronously inside a worker.
  • --graceful-timeout (default 30 seconds): the grace, on restart/shutdown, to wait for in-progress requests to be drained. It's the core parameter of graceful shutdown (§6).

3.6 gunicorn.conf.py: Manage Config in Code

When the command-line flags grow, consolidate them in a config file gunicorn.conf.py. It's a more review-friendly, more reproducible setup than scattering them on the CLI.

# gunicorn.conf.py — 本番設定をコードで一元管理
import multiprocessing
import os

# バインド:コンテナ内部ポート。外部到達はネットワーク境界(ALB/SG)で制御
bind = os.getenv("GUNICORN_BIND", "0.0.0.0:8000")

# ワーカー数:CPU×2 を出発点に、環境変数で上書き可能に
workers = int(os.getenv("GUNICORN_WORKERS", multiprocessing.cpu_count() * 2))

# ワーカー種別:既定は sync。IO 待ちが支配的なら "gevent" を環境変数で指定
worker_class = os.getenv("GUNICORN_WORKER_CLASS", "sync")

# ログはすべて stdout/stderr へ(コンテナのログドライバが集約)
accesslog = "-"
errorlog = "-"
loglevel = os.getenv("GUNICORN_LOG_LEVEL", "info")

# タイムアウト:詰まったワーカーを kill。長時間処理はキューに逃がす前提
timeout = int(os.getenv("GUNICORN_TIMEOUT", "30"))

# グレースフルシャットダウン猶予(§6)。ALB のドレイン時間と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))

# プロセス名(ps で見分けやすく)
proc_name = "myapp"
# 設定ファイルは自動で読まれる(カレントの gunicorn.conf.py)
gunicorn 'myapp:create_app()'

# 明示する場合
gunicorn -c gunicorn.conf.py 'myapp:create_app()'

The design of observability — structuring logs and adding request IDs — is split out into the error-handling / observability guide. "Skewering Gunicorn's access logs and the Flask app's structured logs with the same correlation ID" is the production crux.


4. Running Behind a Reverse Proxy / Load Balancer

4.1 Why Put a Proxy in Front

A WSGI server has an HTTP server built in. But the official docs say:

"WSGI servers have HTTP servers built-in. However, a dedicated HTTP server may be safer, more efficient, or more capable. Putting an HTTP server in front of the WSGI server is called a 'reverse proxy.'"

Putting a dedicated HTTP server (reverse proxy) in front makes it safer, more efficient, and more capable — you delegate TLS termination, static-file serving, rate limiting, buffering, and health checks to the proxy layer, and Gunicorn can focus on app processing. As front-line options the official docs list nginx and Apache httpd, and state that PaaS (Cloud Run, Elastic Beanstalk, App Engine, Azure, etc.) are similarly proxy setups. And there's an important sentence.

"You'll probably need to Tell Flask it is Behind a Proxy when using most hosting platforms."

"On most hosting you need to tell Flask it is behind a proxy" — this is the next ProxyFix.

The author's setup was a multi-stage proxy of API Gateway → ALB → ECS (Fargate). Because there are multiple relays between client and app, the ProxyFix configuration required particular care.

4.2 ProxyFix: Tell It the Proxy Hop Count Correctly

Behind a proxy, the client's original information (source IP, protocol, host) arrives at the app stored in headers like X-Forwarded-For / X-Forwarded-Proto / X-Forwarded-Host. The middleware that makes Flask (Werkzeug) interpret these correctly is ProxyFix. The form the official docs show is this.

from werkzeug.middleware.proxy_fix import ProxyFix

app.wsgi_app = ProxyFix(
    app.wsgi_app, x_for=1, x_proto=1, x_host=1, x_prefix=1
)

Each x_* argument is "the COUNT of proxies setting that X-Forwarded- header."*

ArgumentCorresponding headerMeaning
x_forX-Forwarded-ForThe client's source IP (reflected in request.remote_addr)
x_protoX-Forwarded-ProtoThe original scheme (http/https)
x_hostX-Forwarded-HostThe original Host header
x_prefixX-Forwarded-PrefixThe path prefix the proxy stripped

And the official warning gets at the essence of this configuration.

"This middleware should only be used if the application is actually behind a proxy, and should be configured with the number of proxies that are chained in front of it. Since incoming headers can be faked, you must set how many proxies are setting each header so the middleware knows what to trust. ... It can be a security issue if you get this configuration wrong."

The points are 3.

  1. Use it only when actually behind a proxy. Put ProxyFix in when there's no proxy, and it will believe the fake X-Forwarded-* the client sent.
  2. Set the front-line proxy hop count accurately. Because headers can be spoofed, fix "how many proxies set each header" as a number in x_for etc., and trust only that many from the end.
  3. Getting the config wrong becomes a security problem. Over-estimate the hop count, and you treat a fake IP injected by the client as genuine, deceiving all of IP-based rate limiting, audit logs, and access control.

⚠️ Never miscount the hops. x_for=1 means "1 proxy sets X-Forwarded-For." If only the ALB is in front, x_for=1. If the 2 stages of API Gateway → ALB both stack X-Forwarded-For, measure the setup and set the appropriate hop count (in many cases x_for=2). "1 for everything just in case" or "a big number just in case" is forbidden; decide the number only after confirming, with the actual request headers, how many hops your proxy chain stacks each header in. The author observed the actual X-Forwarded-For value in a production-equivalent environment and fixed the hop count.

4.3 Mapping to ALB / nginx

The ProxyFix configuration corresponds directly to the front-line setup.

  • Behind an AWS ALB (ECS/Fargate): the ALB terminates TLS and adds X-Forwarded-For / X-Forwarded-Proto. If there's an API Gateway or CloudFront in front of the ALB and they too stack X-Forwarded-For, set their total hop count in x_for. Because TLS is terminated at the ALB, it arrives at Gunicorn/Flask as plaintext (HTTP). That's exactly why you need to tell Flask, via ProxyFix, with x_proto, that "it was originally HTTPS."
  • Behind nginx (same host or VM): set proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; etc. in nginx, have Gunicorn listen on 127.0.0.1 or a Unix socket, and set ProxyFix to x_for=1 (nginx, 1 hop).

In a setup where the proxy terminates TLS, don't forget to harden 2 security settings on the Flask side.

  • SESSION_COOKIE_SECURE = True: since production presupposes HTTPS, attach the Secure attribute to the session cookie and don't let it be sent in plaintext. TLS is terminated at the ALB/nginx, but as long as x_proto conveys to Flask that the origin was HTTPS, the Secure cookie functions correctly.
  • TRUSTED_HOSTS (added in Flask 3.1): validates the Host header during routing and prevents Host-header attacks (using a fake Host to target link generation or cache poisoning). Behind a proxy, Host can be manipulated from outside, so specify the allowed hosts explicitly.
app.config.update(
    SESSION_COOKIE_SECURE=True,    # HTTPS 限定 Cookie
    SESSION_COOKIE_HTTPONLY=True,
    SESSION_COOKIE_SAMESITE="Lax",
    TRUSTED_HOSTS=["api.example.com"],  # Host ヘッダ検証(3.1+)
)

The detailed design of cookie attributes, CSRF, and TRUSTED_HOSTS is deep-dived in the security implementation guide.

💡 A behavior change in SERVER_NAME (Flask 3.1): in Flask 3.1, setting SERVER_NAME no longer restricts requests to that domain when host_matching=True or subdomain_matching=False. For the production purpose of "accept only a specific host," the correct answer is TRUSTED_HOSTS, not SERVER_NAME.


5. Docker: Assembling the Production Container Correctly

If you put it on a container platform like ECS/Fargate, how you build the image governs production quality. The requirements are "small, non-root, doesn't include the development server, has a health check." A multi-stage build satisfies these.

# syntax=docker/dockerfile:1

# ---- builder:依存をビルド・インストールするステージ ----
FROM python:3.12-slim AS builder

# ビルド時のみ必要なツールはこのステージに閉じ込める
ENV PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /app

# 仮想環境に依存をインストール(次ステージへ丸ごとコピーする)
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install -r requirements.txt gunicorn

# ---- runtime:実行に必要なものだけの最終ステージ ----
FROM python:3.12-slim AS runtime

# 非 root ユーザーを用意(§3.4 の「root で動かさない」を満たす)
RUN useradd --create-home --uid 10001 appuser

# builder からインストール済みの仮想環境だけを持ち込む
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH" \
    PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

WORKDIR /app
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser gunicorn.conf.py ./

# ここで root を捨てる。以降のプロセスは非特権ユーザー
USER appuser

EXPOSE 8000

# /health を叩いて生存確認(§6 の readiness/liveness と連動)
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
    CMD python -c "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8000/health').status==200 else 1)"

# 開発サーバーは絶対に使わない。Gunicorn でファクトリを起動
CMD ["gunicorn", "-c", "gunicorn.conf.py", "myapp:create_app()"]

Let me organize the points.

  • Multi-stage: use compilers and build tools in builder, and bring only the virtual environment (/opt/venv) into runtime. No build tools remain in the final image, shrinking size and attack surface.
  • slim base: keep the foundation small with python:3.12-slim. alpine can struggle to build psycopg etc. due to its musl libc origin, so for Python slim is the safe bet.
  • Non-root USER: switch to the appuser created by useradd with USER, satisfying §3.4's "don't run as root" at the container level.
  • Doesn't include the development server: CMD is Gunicorn. Neither flask run nor app.run() appears.
  • HEALTHCHECK: hit the /health endpoint (the @app.route("/health") defined in pillar §3) and confirm 200. On ECS you separately set a task-definition-side health check, but having one at the Docker level too makes it effective in local/compose.

5.1 .dockerignore

To not pollute the build context and to absolutely keep secrets and unnecessary things out of the image, always place a .dockerignore.

.git
.gitignore
__pycache__/
*.pyc
.venv/
venv/
.env
.env.*
instance/
tests/
.pytest_cache/
*.md
Dockerfile
.dockerignore

⚠️ Always exclude .env and instance/. If .env (secrets) or instance/ (local config, SQLite) gets baked into the image, secrets leak to the registry. Don't put secrets in the image; inject them at runtime via environment variables.

5.2 Inject Config via FLASK_-Prefixed Environment Variables

Don't bake secrets and environment differences into the image; pass them at runtime via environment variables — that's the 12-factor principle. With Flask 3.0's from_prefixed_env(), you can auto-import environment variables starting with FLASK_ into app.config (values typed via json.loads). For the full picture of config management, see the pillar Flask production-operations guide §4.

# ECS のタスク定義 / Secrets Manager から注入する想定
docker run --rm -p 8000:8000 \
  -e FLASK_SECRET_KEY="$(python -c 'import secrets; print(secrets.token_hex())')" \
  -e FLASK_SQLALCHEMY_DATABASE_URI="postgresql+psycopg://..." \
  -e GUNICORN_WORKERS=8 \
  myapp:latest

The concrete steps to run this container as a Fargate task and register it as an ALB target are consolidated in the ECS Fargate production guide.

💡 Beware the multiplication of DB connection pool × worker count: because Gunicorn has each worker as an independent process, SQLAlchemy's connection pool is also held separately per worker. pool_size=5 × 8 workers × 4 tasks = up to 160 connections, eating up PostgreSQL's max_connections — this is an accident that frequently occurs in production. In a multi-worker × multi-task setup, the standard is to interpose a connection pooler like PgBouncer. Details are compiled in the PostgreSQL connection-pooling guide.


6. Graceful Shutdown: The Core of Zero Downtime

If you get 502/504 on every deploy, the cause is almost always "cutting in-progress requests and bringing the container down." Zero downtime is achieved by "the new container becoming acceptable, then the old container draining its in-progress requests and quietly exiting."

6.1 SIGTERM and graceful-timeout

A container orchestrator (ECS, Kubernetes) first sends SIGTERM when stopping a container. On receiving this, Gunicorn

  1. stops accepting new requests,
  2. drains the in-progress (in-flight) requests within the grace of --graceful-timeout,
  3. terminates the process once all workers are cleaned up,

going down gracefully in this order. Workers that don't finish past the grace are force-killed.

# gunicorn.conf.py(再掲・抜粋)
# ALB のデレジスタ遅延(deregistration delay)と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))

What matters here is to align Gunicorn's graceful_timeout with the load balancer's drain time (the ALB's deregistration delay). If the grace from the ALB stopping new routing to a target until in-flight is drained diverges from Gunicorn's grace, you get a mismatch — the process vanishing before draining, or the ALB still routing.

💡 Make PID 1 Gunicorn so it can be brought down by SIGTERM. With Docker's CMD ["gunicorn", ...] (exec form), Gunicorn becomes PID 1 and can receive SIGTERM directly. The shell form (CMD gunicorn ...) makes the shell PID 1, the signal isn't conveyed to Gunicorn, and graceful shutdown stops working — a common container pitfall. This article's Dockerfile uses the exec form (JSON array) for this reason.

6.2 When You Do Cleanup After SIGTERM on the App Side

Normally, leaving it to Gunicorn's graceful processing is enough, but if explicit cleanup at worker exit (closing connections, etc.) is needed, close resources with a gunicorn.conf.py hook (worker_exit, etc.) or, on the app side, with teardown_appcontext (see pillar §5). For resources like DB connections that "should be bound to the request's lifetime," if you manage them with g and teardown_appcontext in the first place, you don't need to handle them individually at shutdown.

6.3 Separate readiness and liveness

Health checks are split into 2 kinds by purpose. Conflate them and you get traffic flowing to a still-starting container, or a merely temporarily-slow container being killed.

KindQuestionBehavior on failureWhat it checks
readiness"Can it receive requests now?"The load balancer doesn't send trafficConfirms up to DB connection and connectivity of required deps
liveness"Is the process alive?"Restarts the containerA lightweight check of whether the process responds
# liveness:軽量。プロセスが応答すれば 200
@app.route("/health")
def liveness():
    return {"status": "ok"}


# readiness:依存の疎通まで確認。起動直後や DB 断のときは 503
@app.route("/ready")
def readiness():
    try:
        db.session.execute(text("SELECT 1"))
    except Exception:
        return {"status": "not ready"}, 503
    return {"status": "ready"}

⚠️ Don't put a heavy dependency check in liveness. Put DB connectivity into liveness, and a merely temporarily-slow DB makes the container judged "dead" and falls into a restart loop. Handle a DB outage with readiness (stop traffic), and have liveness look only at the process's life-or-death — this separation is the crux of stable operation. Health-check design is also covered in the error-handling / observability guide.

6.4 Tying It to a Rolling Deploy (ECS)

Applying everything so far to an ECS rolling deploy, the zero-downtime flow becomes this.

  1. ECS starts a new task (the new image's container).
  2. Once the ALB confirms "acceptable" via readiness (/ready), it registers the new task as a target and begins flowing traffic.
  3. SIGTERM to the old task. The ALB begins deregistration and stops sending new requests to the old task.
  4. The old task's Gunicorn drains in-flight within the grace of graceful_timeout and quietly terminates.

When these 4 steps mesh, users see no errors during the deploy. The concrete task definition / service / deploy settings on Fargate are covered in the ECS Fargate production guide.


7. async def Views: When You May and May Not Use Them

Flask supports async def views from 2.0 (pip install flask[async] is needed). But misunderstand this as a deployment-performance improvement and you'll get hurt. The official caution is decisive.

Even with an async view, each request still monopolizes one worker. Going async does not increase the number of requests you can handle concurrently. async is effective only when "within a single view, you run multiple IOs (external API calls, etc.) in parallel" — it's not an improvement to throughput (concurrent request count).

import asyncio
import httpx


@app.route("/aggregate")
async def aggregate():
    # 1 ビュー内で 3 つの外部 API を並行呼び出し → ここでは async が効く
    async with httpx.AsyncClient() as client:
        a, b, c = await asyncio.gather(
            client.get("https://api.example.com/a"),
            client.get("https://api.example.com/b"),
            client.get("https://api.example.com/c"),
        )
    return {"a": a.json(), "b": b.json(), "c": c.json()}

There's a further constraint. Even if you spawn a background task with asyncio.create_task(), it is canceled the moment the view returns. Flask's async view is for "IO parallelism during request processing," not a place for fire-and-forget background processing.

The judgment is simple.

  • May use: when you want to parallelize multiple independent external IOs within a single view to shrink that view's latency.
  • Must not use / should take another approach: when you want the whole app to be async-first, to handle a large number of concurrent connections, or to run background jobs.

If you want the whole app to be async-first, the official docs recommend Quart (which has a Flask-compatible API) on an ASGI premise. Alternatively, to run an existing Flask app under an ASGI server, you can use asgiref's WsgiToAsgi adapter. For a large number of concurrent connections there's also the option of Gunicorn's gevent worker (§3.3), but that's a different thing from async/await (§3.3), so don't conflate them. The use-cases of Flask / FastAPI / Django themselves are compiled in the technology-selection guide.


8. Production Checklist

Let me put the design so far into a form you confirm without fail before deploying.

CategoryCheck itemBasis (body)
ServerExcluded flask run / app.run() from the production path§1
ServerChose a WSGI server (Gunicorn on Linux)§2
GunicornSet worker count from CPU × 2, fits within the memory limit§3.2
GunicornChose the worker type by requirement (default sync, gevent if IO-heavy)§3.3
GunicornNot conflating gevent with async/await / ASGI§3.3
GunicornNot running as root (non-root user)§3.4 / §5
GunicornNot directly exposed on 0.0.0.0 behind a proxy§3.4
GunicornEmitting access logs to stdout (--access-logfile=-)§3.5
ProxyPut in ProxyFix and set x_for etc. correctly to the proxy hop count§4.2
ProxyVerified ProxyFix's hop count with actual request headers§4.2
ProxySet SESSION_COOKIE_SECURE=True / TRUSTED_HOSTS§4.4
DockerMulti-stage, slim base, non-root USER§5
DockerDoesn't include the development server; CMD is Gunicorn (exec form)§5 / §6.1
DockerExcluded .env / instance/ in .dockerignore§5.1
DockerSecrets not baked into the image; injected via env vars (FLASK_ prefix)§5.2
DBWorker count × task count × pool size within max_connections§5.2
StopGoes down gracefully on SIGTERM, PID 1 is Gunicorn§6.1
StopAligned graceful_timeout with the ALB's drain time§6.1
StopSeparated readiness and liveness§6.3
asyncUnderstand the async def view's "1-worker monopoly" constraint§7

Summary: Deployment Is the Design of "Outside the App"

Most outages in Flask production deployment are born not from the app's code but from "the design of the environment that runs the app." Let me summarize the article's discipline one line at a time.

  1. Abandon the development server. flask run / app.run() is forbidden in production. Flask is a WSGI app; a WSGI server like Gunicorn runs it.
  2. Configure Gunicorn at production quality. Workers from CPU × 2, default sync is enough, gevent only when IO-heavy. Don't run as root, don't directly expose behind a proxy.
  3. Tell Gunicorn the proxy hop count correctly with ProxyFix. x_for etc. is "the number of trusted proxies." Get it wrong and X-Forwarded-For is spoofed, collapsing IP-based control.
  4. Assemble the container correctly. Multi-stage, slim, non-root, development server excluded, /health health check. Don't bake secrets; inject via env vars.
  5. Go down gracefully. With SIGTERM and graceful_timeout, drain in-flight, separate readiness/liveness, and achieve zero downtime with ECS rolling deploys.
  6. Don't over-trust async views. Understand the 1-worker-monopoly constraint, and if async-first is the requirement, consider Quart/ASGI.

The reason the author could stably operate 221 endpoints on API Gateway → ALB → ECS (Fargate) is that I designed this "outside the app" one piece at a time. Invest in deployment quality as much as in code quality. The overall picture is in the Flask production-operations guide (pillar), and the specifics are in the spoke articles linked from there.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading