Introduction: Most Production Outages Are Decided by "How You Choose the Server"
No matter how clean your Flask app's code is, get the choice of server and the deployment setup wrong, and production silently breaks. Shipping with flask run as-is, the worker count staying at 1 and clogging under load, request.remote_addr becoming all the proxy's IP behind a load balancer, in-progress requests being cut on every deploy and producing 502s — these are outages born not of "app bugs" but of "a lack of deployment design."
This article is a spoke that deep-dives §8 "Deployment" of the Flask production-operations guide (pillar) to production quality. It covers only deployment design faithful to the official documentation of Flask 3.1.x (the current stable). Concretely, it runs through, with real code, the choice of WSGI server, Gunicorn's worker design, the correct ProxyFix configuration, multi-stage Docker, and graceful shutdown that achieves zero downtime.
The author designed and implemented the backend of a B2B SaaS that won the Minister of Economy, Trade and Industry Award in Python / Flask / SQLAlchemy / PostgreSQL, and ran 221 endpoints in production on API Gateway → ALB → ECS (Fargate). What I show here is the design that was needed in that field experience of "keeping Flask running safely behind a proxy chain."
💡 The versions covered in this article: it assumes Flask 3.1.x. Flask 3.1 has Werkzeug at the WSGI/HTTP layer, and features directly tied to deployment, like
ProxyFixandTRUSTED_HOSTS, live here. The WSGI server (Gunicorn, etc.) is not a Flask dependency; it's installed separately.
1. The Big Premise: The Development Server Must Not Be Used in Production
Everything starts from Flask's official clear warning.
"Do not use the development server when deploying to production. It is intended for use only during local development. It is not designed to be particularly secure, stable, or efficient."
In short, "do not deploy the development server to production. It is for local development only, and is not designed to be particularly secure, stable, or efficient." The official docs also flatly state that "production means 'not development.'" Even an internal tool becomes production the moment it is exposed externally.
Both flask run and app.run() are this "development server that must not be used."
# ❌ 本番でこれを起動してはいけない
if __name__ == "__main__":
app.run() # これは Werkzeug の開発サーバー。安全でも安定でも効率的でもない
⚠️ Anti-pattern: making
if __name__ == "__main__": app.run()theCMDof Docker or the entry point of a process manager. That is exactly "exposing the development-only server to production traffic." Isolateapp.run()inif __name__ == "__main__":so it runs only when you hit a bare local script, and never call it from a production path (Docker, systemd, ECS). Production startup is handled by the WSGI server from the next section on.
1.1 Think of the WSGI "App" and the WSGI "Server" Separately
Why can't Flask alone run production? Because Flask is a WSGI application, not a server. In the official words, "Flask is a WSGI application; a WSGI server runs it."
The two roles are clearly divided.
| Layer | Role | Concrete form |
|---|---|---|
| WSGI server | Receives HTTP over TCP, converts the HTTP request into the WSGI environ dict and passes it to the app, and turns the returned response back into HTTP | Gunicorn / Waitress / uWSGI / mod_wsgi |
| WSGI app | A pure function that receives environ and returns a response. Routing, views, templates | Your Flask app |
So production deployment comes down to "loading your WSGI app (app) into a production-grade WSGI server." Just replacing app.run()'s development server with Gunicorn — the structure is this simple. How to load app when you assemble it with an application factory (create_app) is covered in §3.3 (for the factory's own design, see the large-app structure guide).
2. Choosing a WSGI Server: Gunicorn / Waitress / uWSGI / mod_wsgi
The WSGI servers Flask officially lists as "self-hosting" options are Gunicorn, Waitress, mod_wsgi, uWSGI, and gevent. The two that become first candidates in practice are Gunicorn and Waitress; the rest are for specific requirements.
| Server | Platform | Characteristics | When to choose |
|---|---|---|---|
| Gunicorn | Linux / WSL (not Windows) | Pre-fork model. Rich worker types (sync / gevent, etc.). Straightforward config | The default for Linux production. Containers, ECS, EC2 |
| Waitress | Cross-platform (Windows OK) | Pure-Python implementation. Zero deps, thread-based | Windows servers, when you want to complete in pure Python |
| uWSGI | Linux | High-functionality, high-performance, but complex config (high learning cost) | Large setups needing multi-language or advanced tuning |
| mod_wsgi | Bundled with Apache | Embeds WSGI into Apache httpd | When riding along on existing Apache assets |
The selection guidance is simple.
- Putting it on a container/VM on Linux → Gunicorn. This is the protagonist of this article. Config is straightforward, the choice of worker types is wide, and it pairs well with container platforms like ECS/Fargate.
- If you must run on Windows → Waitress. The official docs too position Waitress as "a pure-Python option that runs on Windows." Gunicorn does not support Windows (it works on WSL).
- uWSGI is powerful but config-heavy. "uWSGI for now" often isn't worth the learning cost, and if Gunicorn can meet the requirements, you should choose Gunicorn.
- mod_wsgi presupposes Apache. There's little reason to deliberately choose it for a greenfield; it's the option when you have a special circumstance of embedding into existing Apache.
💡 The author's choice: in the lumber-distribution SaaS, Linux containers (ECS/Fargate) were the premise, so I adopted Gunicorn without hesitation. There was room to consider Waitress or uWSGI only when there's a "Windows constraint" or an "extreme tuning requirement"; for a now-typical setup of container × Linux, Gunicorn is the de facto standard.
3. Configuring Gunicorn at Production Quality
3.1 Basic Startup: Understanding the Load Syntax
Gunicorn's app specification is the form {module_import}:{app_variable}. The equivalences the official docs show are exactly how to read it.
# 'from hello import app' と等価
gunicorn -w 4 'hello:app'
# 'from hello import create_app; create_app()' と等価(ファクトリ)
gunicorn -w 4 'hello:create_app()'
hello is the module (hello.py), and after the colon is "the WSGI app to retrieve from that module." Like the latter, adding parentheses lets you use the result of calling a factory function as the WSGI app. If you adopt an application factory (create_app), use the latter 'myapp:create_app()' form.
3.2 Worker Count (-w): Start From CPU × 2
Gunicorn's default worker count is 1. In the official words, "The default is only 1 worker, which is probably not what you want." In production, 1 worker is a fatal bottleneck where, while processing one request, other requests are made to wait.
The starting point is the official CPU × 2 ("a starting value could be CPU * 2"). It is strictly a "starting point," from which you tune with load testing.
# 4 コアのホストなら 8 ワーカーから始める
gunicorn -w 8 'myapp:create_app()'
But the worker count has a memory trade-off. Gunicorn's default (sync worker) is pre-fork, where each worker is a separate process holding the whole app in memory. With 8 workers, roughly the app's memory footprint × 8 becomes the required memory. Always check whether CPU × 2 fits within the container's memory limit (the ECS task definition, etc.).
⚠️ "More workers = faster" is wrong. Worker count plateaus at the CPU-core count and memory. Workers far exceeding the core count only increase context switching and memory consumption; throughput doesn't rise. If you need to scale, increase the number of containers / tasks (horizontal scaling), not the worker count — this is the philosophy of ECS auto-scaling (ECS Fargate production guide).
3.3 sync Workers vs. gevent Workers: Don't Conflate These
The choice of Gunicorn's worker type is the most misunderstood point. The official guidance is clear.
"The default sync worker is appropriate for most use cases. If you need numerous, long running, concurrent connections, Gunicorn provides an asynchronous worker using gevent."
First, the default sync worker is enough. A sync worker is a model where "one worker processes one request to the end," ideal for CPU-bound processing and general APIs with fast responses. The 221 endpoints of the lumber-distribution SaaS were also handled by sync workers without issue for the most part.
The gevent worker is effective only when the requirement is "numerous, long-running, concurrent connections." Concretely, it's workloads where IO waiting dominates — long waits for external-API responses, maintaining many connections like long-polling or SSE.
# gevent ワーカー(IO 待ちが支配的なときだけ)
gunicorn -k gevent 'myapp:create_app()'
When using gevent, you need greenlet>=1.0 in the dependencies.
Here is a point you must absolutely not conflate. The official docs nail it down explicitly.
"This is NOT the same as Python's async/await, or the ASGI server spec."
In other words, the gevent worker is neither Python's async/await nor ASGI. gevent is a mechanism that multiplexes IO waiting via cooperative multitasking with greenlets, and it is a different thing from async def views (covered in §7) and from ASGI frameworks like FastAPI. The naive choice of "gevent because I want async" breaks the code's premises. If you bring in gevent, do so understanding the premise that blocking calls are cooperatively scheduled via monkey-patching.
💡 On threads and eventlet: Gunicorn has
--threads(threads per worker) and aneventletworker too, but these are not in Flask's official documentation. They are the domain of Gunicorn's own documentation (docs.gunicorn.org). When this article touches--threadsoreventlet, distinguish them as "Gunicorn features," not "Flask guidance." What Flask official mentions is only sync and gevent.
3.4 Bind and "Don't Run as Root"
To make Gunicorn accessible from outside, specify the bind target with -b (--bind).
gunicorn -w 4 -b 0.0.0.0 'myapp:create_app()'
There are 2 serious warnings here, both noted verbatim by the official docs.
"Gunicorn should not be run as root..."
You must not run Gunicorn as root. If the process is ever hijacked, with root privileges the damage reaches all permissions. As the principle of least privilege, start it as a dedicated unprivileged user (the concrete measure in Docker is §5).
"Don't [bind to 0.0.0.0] when using a reverse proxy setup, otherwise it will be possible to bypass the proxy."
Behind a reverse proxy, you must not bind to 0.0.0.0. 0.0.0.0 listens on all network interfaces, so Gunicorn can be reached directly without going through the proxy, bypassing all of the authentication, WAF, and header shaping the proxy applies. Behind a proxy, bind to an address reachable only from the proxy (a Unix socket, 127.0.0.1, or a dedicated port on the container-internal network).
# プロキシと同一ホストなら localhost に閉じる
gunicorn -w 4 -b 127.0.0.1:8000 'myapp:create_app()'
# Unix ソケット(nginx と同居する典型)
gunicorn -w 4 -b unix:/run/myapp.sock 'myapp:create_app()'
⚠️
0.0.0.0in a container is context-dependent. As with ECS/Fargate, where "ALB → a specific container port" is the only connectivity path and the container's network is closed by a security group, binding to0.0.0.0:8000inside the container is itself common. What's dangerous is the state of "binding to0.0.0.0while that port is directly reachable from outside." Whether you can block direct reach from anywhere but the proxy at the network boundary (SG, VPC) is the criterion.
3.5 Access Logs and Timeouts
Gunicorn's access logs are off by default. In container operations the standard is to aggregate logs to standard output, so enable it with --access-logfile=- (- is stdout).
gunicorn -w 4 --access-logfile=- 'myapp:create_app()'
Distinguish 2 kinds of timeout.
--timeout(default 30 seconds): if a worker doesn't respond for this many seconds, the master kills and restarts it. If you have long processing, extend it, but carelessly lengthening it lets a clogged worker squat, so the right move is to offload long processing to an async job (queue) rather than running it synchronously inside a worker.--graceful-timeout(default 30 seconds): the grace, on restart/shutdown, to wait for in-progress requests to be drained. It's the core parameter of graceful shutdown (§6).
3.6 gunicorn.conf.py: Manage Config in Code
When the command-line flags grow, consolidate them in a config file gunicorn.conf.py. It's a more review-friendly, more reproducible setup than scattering them on the CLI.
# gunicorn.conf.py — 本番設定をコードで一元管理
import multiprocessing
import os
# バインド:コンテナ内部ポート。外部到達はネットワーク境界(ALB/SG)で制御
bind = os.getenv("GUNICORN_BIND", "0.0.0.0:8000")
# ワーカー数:CPU×2 を出発点に、環境変数で上書き可能に
workers = int(os.getenv("GUNICORN_WORKERS", multiprocessing.cpu_count() * 2))
# ワーカー種別:既定は sync。IO 待ちが支配的なら "gevent" を環境変数で指定
worker_class = os.getenv("GUNICORN_WORKER_CLASS", "sync")
# ログはすべて stdout/stderr へ(コンテナのログドライバが集約)
accesslog = "-"
errorlog = "-"
loglevel = os.getenv("GUNICORN_LOG_LEVEL", "info")
# タイムアウト:詰まったワーカーを kill。長時間処理はキューに逃がす前提
timeout = int(os.getenv("GUNICORN_TIMEOUT", "30"))
# グレースフルシャットダウン猶予(§6)。ALB のドレイン時間と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))
# プロセス名(ps で見分けやすく)
proc_name = "myapp"
# 設定ファイルは自動で読まれる(カレントの gunicorn.conf.py)
gunicorn 'myapp:create_app()'
# 明示する場合
gunicorn -c gunicorn.conf.py 'myapp:create_app()'
The design of observability — structuring logs and adding request IDs — is split out into the error-handling / observability guide. "Skewering Gunicorn's access logs and the Flask app's structured logs with the same correlation ID" is the production crux.
4. Running Behind a Reverse Proxy / Load Balancer
4.1 Why Put a Proxy in Front
A WSGI server has an HTTP server built in. But the official docs say:
"WSGI servers have HTTP servers built-in. However, a dedicated HTTP server may be safer, more efficient, or more capable. Putting an HTTP server in front of the WSGI server is called a 'reverse proxy.'"
Putting a dedicated HTTP server (reverse proxy) in front makes it safer, more efficient, and more capable — you delegate TLS termination, static-file serving, rate limiting, buffering, and health checks to the proxy layer, and Gunicorn can focus on app processing. As front-line options the official docs list nginx and Apache httpd, and state that PaaS (Cloud Run, Elastic Beanstalk, App Engine, Azure, etc.) are similarly proxy setups. And there's an important sentence.
"You'll probably need to Tell Flask it is Behind a Proxy when using most hosting platforms."
"On most hosting you need to tell Flask it is behind a proxy" — this is the next ProxyFix.
The author's setup was a multi-stage proxy of API Gateway → ALB → ECS (Fargate). Because there are multiple relays between client and app, the ProxyFix configuration required particular care.
4.2 ProxyFix: Tell It the Proxy Hop Count Correctly
Behind a proxy, the client's original information (source IP, protocol, host) arrives at the app stored in headers like X-Forwarded-For / X-Forwarded-Proto / X-Forwarded-Host. The middleware that makes Flask (Werkzeug) interpret these correctly is ProxyFix. The form the official docs show is this.
from werkzeug.middleware.proxy_fix import ProxyFix
app.wsgi_app = ProxyFix(
app.wsgi_app, x_for=1, x_proto=1, x_host=1, x_prefix=1
)
Each x_* argument is "the COUNT of proxies setting that X-Forwarded- header."*
| Argument | Corresponding header | Meaning |
|---|---|---|
x_for | X-Forwarded-For | The client's source IP (reflected in request.remote_addr) |
x_proto | X-Forwarded-Proto | The original scheme (http/https) |
x_host | X-Forwarded-Host | The original Host header |
x_prefix | X-Forwarded-Prefix | The path prefix the proxy stripped |
And the official warning gets at the essence of this configuration.
"This middleware should only be used if the application is actually behind a proxy, and should be configured with the number of proxies that are chained in front of it. Since incoming headers can be faked, you must set how many proxies are setting each header so the middleware knows what to trust. ... It can be a security issue if you get this configuration wrong."
The points are 3.
- Use it only when actually behind a proxy. Put
ProxyFixin when there's no proxy, and it will believe the fakeX-Forwarded-*the client sent. - Set the front-line proxy hop count accurately. Because headers can be spoofed, fix "how many proxies set each header" as a number in
x_foretc., and trust only that many from the end. - Getting the config wrong becomes a security problem. Over-estimate the hop count, and you treat a fake IP injected by the client as genuine, deceiving all of IP-based rate limiting, audit logs, and access control.
⚠️ Never miscount the hops.
x_for=1means "1 proxy setsX-Forwarded-For." If only the ALB is in front,x_for=1. If the 2 stages of API Gateway → ALB both stackX-Forwarded-For, measure the setup and set the appropriate hop count (in many casesx_for=2). "1 for everything just in case" or "a big number just in case" is forbidden; decide the number only after confirming, with the actual request headers, how many hops your proxy chain stacks each header in. The author observed the actualX-Forwarded-Forvalue in a production-equivalent environment and fixed the hop count.
4.3 Mapping to ALB / nginx
The ProxyFix configuration corresponds directly to the front-line setup.
- Behind an AWS ALB (ECS/Fargate): the ALB terminates TLS and adds
X-Forwarded-For/X-Forwarded-Proto. If there's an API Gateway or CloudFront in front of the ALB and they too stackX-Forwarded-For, set their total hop count inx_for. Because TLS is terminated at the ALB, it arrives at Gunicorn/Flask as plaintext (HTTP). That's exactly why you need to tell Flask, viaProxyFix, withx_proto, that "it was originally HTTPS." - Behind nginx (same host or VM): set
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;etc. in nginx, have Gunicorn listen on127.0.0.1or a Unix socket, and setProxyFixtox_for=1(nginx, 1 hop).
4.4 Cookie / Host Validation on TLS Termination
In a setup where the proxy terminates TLS, don't forget to harden 2 security settings on the Flask side.
SESSION_COOKIE_SECURE = True: since production presupposes HTTPS, attach theSecureattribute to the session cookie and don't let it be sent in plaintext. TLS is terminated at the ALB/nginx, but as long asx_protoconveys to Flask that the origin was HTTPS, theSecurecookie functions correctly.TRUSTED_HOSTS(added in Flask 3.1): validates theHostheader during routing and prevents Host-header attacks (using a fakeHostto target link generation or cache poisoning). Behind a proxy,Hostcan be manipulated from outside, so specify the allowed hosts explicitly.
app.config.update(
SESSION_COOKIE_SECURE=True, # HTTPS 限定 Cookie
SESSION_COOKIE_HTTPONLY=True,
SESSION_COOKIE_SAMESITE="Lax",
TRUSTED_HOSTS=["api.example.com"], # Host ヘッダ検証(3.1+)
)
The detailed design of cookie attributes, CSRF, and TRUSTED_HOSTS is deep-dived in the security implementation guide.
💡 A behavior change in
SERVER_NAME(Flask 3.1): in Flask 3.1, settingSERVER_NAMEno longer restricts requests to that domain whenhost_matching=Trueorsubdomain_matching=False. For the production purpose of "accept only a specific host," the correct answer isTRUSTED_HOSTS, notSERVER_NAME.
5. Docker: Assembling the Production Container Correctly
If you put it on a container platform like ECS/Fargate, how you build the image governs production quality. The requirements are "small, non-root, doesn't include the development server, has a health check." A multi-stage build satisfies these.
# syntax=docker/dockerfile:1
# ---- builder:依存をビルド・インストールするステージ ----
FROM python:3.12-slim AS builder
# ビルド時のみ必要なツールはこのステージに閉じ込める
ENV PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
WORKDIR /app
# 仮想環境に依存をインストール(次ステージへ丸ごとコピーする)
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
COPY requirements.txt .
RUN pip install -r requirements.txt gunicorn
# ---- runtime:実行に必要なものだけの最終ステージ ----
FROM python:3.12-slim AS runtime
# 非 root ユーザーを用意(§3.4 の「root で動かさない」を満たす)
RUN useradd --create-home --uid 10001 appuser
# builder からインストール済みの仮想環境だけを持ち込む
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH" \
PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
WORKDIR /app
COPY --chown=appuser:appuser src/ ./src/
COPY --chown=appuser:appuser gunicorn.conf.py ./
# ここで root を捨てる。以降のプロセスは非特権ユーザー
USER appuser
EXPOSE 8000
# /health を叩いて生存確認(§6 の readiness/liveness と連動)
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD python -c "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://127.0.0.1:8000/health').status==200 else 1)"
# 開発サーバーは絶対に使わない。Gunicorn でファクトリを起動
CMD ["gunicorn", "-c", "gunicorn.conf.py", "myapp:create_app()"]
Let me organize the points.
- Multi-stage: use compilers and build tools in
builder, and bring only the virtual environment (/opt/venv) intoruntime. No build tools remain in the final image, shrinking size and attack surface. - slim base: keep the foundation small with
python:3.12-slim.alpinecan struggle to build psycopg etc. due to its musl libc origin, so for Pythonslimis the safe bet. - Non-root
USER: switch to theappusercreated byuseraddwithUSER, satisfying §3.4's "don't run as root" at the container level. - Doesn't include the development server:
CMDis Gunicorn. Neitherflask runnorapp.run()appears. HEALTHCHECK: hit the/healthendpoint (the@app.route("/health")defined in pillar §3) and confirm 200. On ECS you separately set a task-definition-side health check, but having one at the Docker level too makes it effective in local/compose.
5.1 .dockerignore
To not pollute the build context and to absolutely keep secrets and unnecessary things out of the image, always place a .dockerignore.
.git
.gitignore
__pycache__/
*.pyc
.venv/
venv/
.env
.env.*
instance/
tests/
.pytest_cache/
*.md
Dockerfile
.dockerignore
⚠️ Always exclude
.envandinstance/. If.env(secrets) orinstance/(local config, SQLite) gets baked into the image, secrets leak to the registry. Don't put secrets in the image; inject them at runtime via environment variables.
5.2 Inject Config via FLASK_-Prefixed Environment Variables
Don't bake secrets and environment differences into the image; pass them at runtime via environment variables — that's the 12-factor principle. With Flask 3.0's from_prefixed_env(), you can auto-import environment variables starting with FLASK_ into app.config (values typed via json.loads). For the full picture of config management, see the pillar Flask production-operations guide §4.
# ECS のタスク定義 / Secrets Manager から注入する想定
docker run --rm -p 8000:8000 \
-e FLASK_SECRET_KEY="$(python -c 'import secrets; print(secrets.token_hex())')" \
-e FLASK_SQLALCHEMY_DATABASE_URI="postgresql+psycopg://..." \
-e GUNICORN_WORKERS=8 \
myapp:latest
The concrete steps to run this container as a Fargate task and register it as an ALB target are consolidated in the ECS Fargate production guide.
💡 Beware the multiplication of DB connection pool × worker count: because Gunicorn has each worker as an independent process, SQLAlchemy's connection pool is also held separately per worker.
pool_size=5× 8 workers × 4 tasks = up to 160 connections, eating up PostgreSQL'smax_connections— this is an accident that frequently occurs in production. In a multi-worker × multi-task setup, the standard is to interpose a connection pooler like PgBouncer. Details are compiled in the PostgreSQL connection-pooling guide.
6. Graceful Shutdown: The Core of Zero Downtime
If you get 502/504 on every deploy, the cause is almost always "cutting in-progress requests and bringing the container down." Zero downtime is achieved by "the new container becoming acceptable, then the old container draining its in-progress requests and quietly exiting."
6.1 SIGTERM and graceful-timeout
A container orchestrator (ECS, Kubernetes) first sends SIGTERM when stopping a container. On receiving this, Gunicorn
- stops accepting new requests,
- drains the in-progress (in-flight) requests within the grace of
--graceful-timeout, - terminates the process once all workers are cleaned up,
going down gracefully in this order. Workers that don't finish past the grace are force-killed.
# gunicorn.conf.py(再掲・抜粋)
# ALB のデレジスタ遅延(deregistration delay)と整合させる
graceful_timeout = int(os.getenv("GUNICORN_GRACEFUL_TIMEOUT", "30"))
What matters here is to align Gunicorn's graceful_timeout with the load balancer's drain time (the ALB's deregistration delay). If the grace from the ALB stopping new routing to a target until in-flight is drained diverges from Gunicorn's grace, you get a mismatch — the process vanishing before draining, or the ALB still routing.
💡 Make PID 1 Gunicorn so it can be brought down by
SIGTERM. With Docker'sCMD ["gunicorn", ...](exec form), Gunicorn becomes PID 1 and can receiveSIGTERMdirectly. The shell form (CMD gunicorn ...) makes the shell PID 1, the signal isn't conveyed to Gunicorn, and graceful shutdown stops working — a common container pitfall. This article's Dockerfile uses the exec form (JSON array) for this reason.
6.2 When You Do Cleanup After SIGTERM on the App Side
Normally, leaving it to Gunicorn's graceful processing is enough, but if explicit cleanup at worker exit (closing connections, etc.) is needed, close resources with a gunicorn.conf.py hook (worker_exit, etc.) or, on the app side, with teardown_appcontext (see pillar §5). For resources like DB connections that "should be bound to the request's lifetime," if you manage them with g and teardown_appcontext in the first place, you don't need to handle them individually at shutdown.
6.3 Separate readiness and liveness
Health checks are split into 2 kinds by purpose. Conflate them and you get traffic flowing to a still-starting container, or a merely temporarily-slow container being killed.
| Kind | Question | Behavior on failure | What it checks |
|---|---|---|---|
| readiness | "Can it receive requests now?" | The load balancer doesn't send traffic | Confirms up to DB connection and connectivity of required deps |
| liveness | "Is the process alive?" | Restarts the container | A lightweight check of whether the process responds |
# liveness:軽量。プロセスが応答すれば 200
@app.route("/health")
def liveness():
return {"status": "ok"}
# readiness:依存の疎通まで確認。起動直後や DB 断のときは 503
@app.route("/ready")
def readiness():
try:
db.session.execute(text("SELECT 1"))
except Exception:
return {"status": "not ready"}, 503
return {"status": "ready"}
⚠️ Don't put a heavy dependency check in liveness. Put DB connectivity into liveness, and a merely temporarily-slow DB makes the container judged "dead" and falls into a restart loop. Handle a DB outage with readiness (stop traffic), and have liveness look only at the process's life-or-death — this separation is the crux of stable operation. Health-check design is also covered in the error-handling / observability guide.
6.4 Tying It to a Rolling Deploy (ECS)
Applying everything so far to an ECS rolling deploy, the zero-downtime flow becomes this.
- ECS starts a new task (the new image's container).
- Once the ALB confirms "acceptable" via readiness (
/ready), it registers the new task as a target and begins flowing traffic. SIGTERMto the old task. The ALB begins deregistration and stops sending new requests to the old task.- The old task's Gunicorn drains in-flight within the grace of
graceful_timeoutand quietly terminates.
When these 4 steps mesh, users see no errors during the deploy. The concrete task definition / service / deploy settings on Fargate are covered in the ECS Fargate production guide.
7. async def Views: When You May and May Not Use Them
Flask supports async def views from 2.0 (pip install flask[async] is needed). But misunderstand this as a deployment-performance improvement and you'll get hurt. The official caution is decisive.
Even with an async view, each request still monopolizes one worker. Going async does not increase the number of requests you can handle concurrently. async is effective only when "within a single view, you run multiple IOs (external API calls, etc.) in parallel" — it's not an improvement to throughput (concurrent request count).
import asyncio
import httpx
@app.route("/aggregate")
async def aggregate():
# 1 ビュー内で 3 つの外部 API を並行呼び出し → ここでは async が効く
async with httpx.AsyncClient() as client:
a, b, c = await asyncio.gather(
client.get("https://api.example.com/a"),
client.get("https://api.example.com/b"),
client.get("https://api.example.com/c"),
)
return {"a": a.json(), "b": b.json(), "c": c.json()}
There's a further constraint. Even if you spawn a background task with asyncio.create_task(), it is canceled the moment the view returns. Flask's async view is for "IO parallelism during request processing," not a place for fire-and-forget background processing.
The judgment is simple.
- May use: when you want to parallelize multiple independent external IOs within a single view to shrink that view's latency.
- Must not use / should take another approach: when you want the whole app to be async-first, to handle a large number of concurrent connections, or to run background jobs.
If you want the whole app to be async-first, the official docs recommend Quart (which has a Flask-compatible API) on an ASGI premise. Alternatively, to run an existing Flask app under an ASGI server, you can use asgiref's WsgiToAsgi adapter. For a large number of concurrent connections there's also the option of Gunicorn's gevent worker (§3.3), but that's a different thing from async/await (§3.3), so don't conflate them. The use-cases of Flask / FastAPI / Django themselves are compiled in the technology-selection guide.
8. Production Checklist
Let me put the design so far into a form you confirm without fail before deploying.
| Category | Check item | Basis (body) |
|---|---|---|
| Server | Excluded flask run / app.run() from the production path | §1 |
| Server | Chose a WSGI server (Gunicorn on Linux) | §2 |
| Gunicorn | Set worker count from CPU × 2, fits within the memory limit | §3.2 |
| Gunicorn | Chose the worker type by requirement (default sync, gevent if IO-heavy) | §3.3 |
| Gunicorn | Not conflating gevent with async/await / ASGI | §3.3 |
| Gunicorn | Not running as root (non-root user) | §3.4 / §5 |
| Gunicorn | Not directly exposed on 0.0.0.0 behind a proxy | §3.4 |
| Gunicorn | Emitting access logs to stdout (--access-logfile=-) | §3.5 |
| Proxy | Put in ProxyFix and set x_for etc. correctly to the proxy hop count | §4.2 |
| Proxy | Verified ProxyFix's hop count with actual request headers | §4.2 |
| Proxy | Set SESSION_COOKIE_SECURE=True / TRUSTED_HOSTS | §4.4 |
| Docker | Multi-stage, slim base, non-root USER | §5 |
| Docker | Doesn't include the development server; CMD is Gunicorn (exec form) | §5 / §6.1 |
| Docker | Excluded .env / instance/ in .dockerignore | §5.1 |
| Docker | Secrets not baked into the image; injected via env vars (FLASK_ prefix) | §5.2 |
| DB | Worker count × task count × pool size within max_connections | §5.2 |
| Stop | Goes down gracefully on SIGTERM, PID 1 is Gunicorn | §6.1 |
| Stop | Aligned graceful_timeout with the ALB's drain time | §6.1 |
| Stop | Separated readiness and liveness | §6.3 |
| async | Understand the async def view's "1-worker monopoly" constraint | §7 |
Summary: Deployment Is the Design of "Outside the App"
Most outages in Flask production deployment are born not from the app's code but from "the design of the environment that runs the app." Let me summarize the article's discipline one line at a time.
- Abandon the development server.
flask run/app.run()is forbidden in production. Flask is a WSGI app; a WSGI server like Gunicorn runs it. - Configure Gunicorn at production quality. Workers from
CPU × 2, default sync is enough, gevent only when IO-heavy. Don't run as root, don't directly expose behind a proxy. - Tell Gunicorn the proxy hop count correctly with
ProxyFix.x_foretc. is "the number of trusted proxies." Get it wrong andX-Forwarded-Foris spoofed, collapsing IP-based control. - Assemble the container correctly. Multi-stage, slim, non-root, development server excluded,
/healthhealth check. Don't bake secrets; inject via env vars. - Go down gracefully. With
SIGTERMandgraceful_timeout, drain in-flight, separate readiness/liveness, and achieve zero downtime with ECS rolling deploys. - Don't over-trust
asyncviews. Understand the 1-worker-monopoly constraint, and if async-first is the requirement, consider Quart/ASGI.
The reason the author could stably operate 221 endpoints on API Gateway → ALB → ECS (Fargate) is that I designed this "outside the app" one piece at a time. Invest in deployment quality as much as in code quality. The overall picture is in the Flask production-operations guide (pillar), and the specifics are in the spoke articles linked from there.