Introduction: performance optimization begins not with "making it fast" but with "measuring"
When a Flask app is called "slow," the most common failure is moving your hands without measuring. Rewriting a list comprehension to map, changing string concatenation to join—such micro-optimizations are noise in most web apps. What decides perceived speed in a production Flask app is, almost without exception, the database (N+1, missing indexes, connection-pool exhaustion), the presence or absence of caching, and worker design. Polishing tricks while missing these won't shrink p95 by a millimeter.
This article is a spoke that digs the performance area of the Flask production-operations guide (pillar) down to production quality. It covers only optimization design that is faithful to the latest official documentation, assuming the Flask 3.1 line (the current stable version). Specifically, it runs through, in real code: (1) the measurement mindset, (2) caching with Flask-Caching (Redis), (3) rate limiting with Flask-Limiter, (4) DB performance (N+1, pools, pagination), (5) the response layer (gzip, async), and (6) the cost efficiency that follows directly from these.
I have designed and implemented the backend of a Minister of Economy, Trade and Industry Award-winning B2B SaaS in Python / Flask / SQLAlchemy / PostgreSQL, and operated 221 endpoints in production on API Gateway → ALB → ECS (Fargate). To balance "speed" and "cost" in a read-dominant business SaaS, what was needed wasn't flashy algorithms but keeping the plain order measure → DB → cache → rate limit → workers. This article traces that order as-is.
💡 Versions covered in this article: assuming the Flask 3.1 line, I use Flask-Caching 2.4.0 and Flask-Limiter 4.1.1. Both are extensions installed separately, not dependencies of Flask (
pip install Flask-Caching==2.4.0 Flask-Limiter==4.1.1 redis). Redis is assumed for storage. All numbers (latency, connection counts) appearing here are examples for explanation, not benchmark values for a specific environment.
1. Measurement first: know where it's slow before anything else
1.1 Get the premise "Flask is synchronous" into your body
Before optimizing, grasp Flask's execution model precisely. Flask is a WSGI (synchronous) framework, and each request occupies one Gunicorn worker to the end. The official docs also state clearly about async def views that "each request still occupies one worker; making it async doesn't change the number of requests that can be handled simultaneously."
From here, you see what decides Flask's throughput.
- Requests handled simultaneously ≒ number of workers (for synchronous sync workers).
- The longer one request's processing time (= waiting on the DB or an external API), the longer its worker is occupied and the longer other requests wait.
- So "shrinking one request's processing time" (caching, DB optimization) and "lining up workers appropriately" (deployment design) are two sides of the same coin.
Worker-count design (starting from CPU × 2, sync or gevent, the trade-off with memory) is consolidated in the production-deployment guide. This article handles the "make one request fast / reject wasteful requests" side. Both take effect together.
1.2 What to measure: not the average but p95 / p99
You must not look at latency by its average. The average is insensitive to outliers and hides the state of "most users are fast, but some are fatally slow." What you should look at are percentiles.
| Metric | Meaning | Why it matters |
|---|---|---|
| p50 (median) | Half the requests are faster than this | "The typical experience" |
| p95 | 95% are faster than this (the slower 5%) | A realistic target line for an SLO |
| p99 | 99% are faster than this (the slowest 1%) | Tail latency. When this clogs, the clog cascades |
An endpoint with a bad p99 occupies its worker for a long time. In synchronous Flask, one slow endpoint ripples into the wait time of all other requests, so crushing "the slowest part" pays off across overall throughput.
1.3 How to measure: three granularities
Measurement is enough in three stages, from coarse to fine.
(a) Per-endpoint duration log—the first step. Record each request's duration with before_request / after_request.
import time
from flask import g, request, current_app
@app.before_request
def _start_timer():
g._t0 = time.perf_counter()
@app.after_request
def _log_latency(response):
elapsed_ms = (time.perf_counter() - g._t0) * 1000
# 構造化ログに乗せる(リクエストID付与は可観測性ガイド参照)
current_app.logger.info(
"request_latency",
extra={"path": request.path, "method": request.method,
"status": response.status_code, "elapsed_ms": round(elapsed_ms, 1)},
)
return response
Aggregating this log tells you "which path is pulling p95/p99." Structuring logs and correlating request IDs are left to the error-handling / observability guide. You can also emit duration in Gunicorn's access log.
(b) Visualizing SQL queries—the majority of slowness is the DB. Turning on SQLAlchemy's engine log only during development reveals how many queries fly per request (N+1 surfaces here). For details, see §4 and the Flask-SQLAlchemy practical guide.
(c) Identifying hotspots with a profiler—only when a CPU-bound process is genuinely suspect, identify hotspots inside the view with cProfile or Werkzeug's ProfilerMiddleware. But the order is to suspect the DB and cache before you get here.
⚠️ Don't optimize on guesses. Rewriting on "this is probably slow" is a gamble that only increases technical debt. Knuth's "premature optimization is the root of all evil" is still true. Always identify the bottleneck by measurement and confirm the effect by measurement—prepare the mechanism that lets you spin this loop (duration log, SQL log) before you start optimizing. This is the highest-ROI move.
💡 My experience: when I got a report that "the list screen is slow" in the lumber-distribution SaaS, the first thing I looked at was the SQL log. The cause wasn't the algorithm but an N+1 pulling line items one at a time (§4.1) and reading master data from the DB every time (the cache target in §2). Had I jumped to "Python is slow" without measuring, it would never have been fixed.
2. Flask-Caching: bring reads to memory speed
2.1 What to cache (and what not to)
Caching is no panacea. Caching the wrong target becomes a failure that keeps serving stale data. Data suited to caching satisfies the following three conditions.
- Read-heavy: the same thing is requested over and over.
- Expensive to produce: heavy DB aggregations, external API calls, complex rendering.
- Tolerant of staleness: being tens of seconds to minutes old causes no business problem.
Conversely, data that must be instantly accurate right after a write (payment balances, confirmed inventory) must not be casually cached. And cases of "offloading slow processing asynchronously"—such as heavy report generation or sending mail—are the domain of background jobs, not caching. Cache is for read-heavy, Celery is for write/slow work—divide them (the Flask + Celery + Redis background-tasks guide).
2.2 Setup: bundle RedisCache in the factory
Since Flask-Caching is an extension, follow the init_app pattern of pillar §3. Create Cache() bare and bind it with init_app.
# extensions.py — どのアプリにも束縛されていない「裸」の拡張
from flask_caching import Cache
cache = Cache()
# __init__.py(create_app 内・抜粋)
from .extensions import cache
def create_app(test_config=None):
app = Flask(__name__, instance_relative_config=True)
# ...設定読み込み...
cache.init_app(app, config={
"CACHE_TYPE": "RedisCache",
"CACHE_REDIS_URL": "redis://localhost:6379/0",
})
return app
You can also pass it at creation as Cache(app), but in a factory configuration init_app is the only choice. CACHE_TYPE has several values by use.
CACHE_TYPE | Backend | Use |
|---|---|---|
NullCache (default) | Does nothing | Cache disabled. This until you explicitly enable |
SimpleCache | An in-process memory dict | Single-process dev/test. Not shared in production multi-worker |
FileSystemCache | Local files | Single host. Disappears in containers |
RedisCache | Redis | The standard for production multi-worker/multi-container |
RedisSentinelCache | Redis Sentinel | Highly-available Redis (failover) |
RedisClusterCache | Redis Cluster | Large-scale, sharding |
MemcachedCache | Memcached | If you've already adopted Memcached |
⚠️ Don't use
SimpleCachein production.SimpleCacheis a process-local memory dict. Because Gunicorn stands up multiple workers (= multiple processes), a value cached by worker A isn't visible from worker B. And if a worker restarts, it's gone. This is exactly the same trap as Flask-Limiter'smemory://problem (§3.3). In production multi-worker/multi-container, always use a shared backend like Redis.
2.3 @cache.cached: cache per view
The simplest is @cache.cached, which caches a view function's entire response. The official minimal form is this.
@app.route("/")
@cache.cached(timeout=50)
def index():
return render_template('index.html')
timeout=50 is a 50-second TTL (time to live). Second and subsequent requests to the same path return from Redis without executing the view.
Note the decorator order: @app.route is outermost, @cache.cached is innermost (closer to the view function). Reversing them won't function correctly.
The main parameters of @cache.cached are as follows.
| Parameter | Meaning |
|---|---|
timeout | Cache lifetime in seconds (TTL) |
key_prefix | The cache key's prefix. Default is 'view/%(request.path)s' (= request.path-based) |
unless | Pass a callable, and when it returns True, bypass the cache (neither read nor write) |
query_string | Set True to hash the order-normalized query parameters into the key |
query_string=True is quietly important. By default only request.path is the key, so /search?q=flask and /search?q=django share the same cache. Setting query_string=True makes a separate cache per query parameter, and moreover normalizes the parameter order (?a=1&b=2 and ?b=2&a=1) to treat them as the same, so the cache doesn't fragment needlessly.
@app.route("/search")
@cache.cached(timeout=60, query_string=True)
def search():
q = request.args.get("q", "")
return jsonify(results=run_expensive_search(q))
⚠️ Don't cache user-specific responses with
@cache.cached. The default key is onlyrequest.path(+ query) and doesn't distinguish the logged-in user. Caching a "different per person" response like an authenticated user's dashboard with bare@cache.cachedbecomes a serious information leak where A's data is shown to B. To make it per-user, makekey_prefixa callable and weave the user ID into the key, or simply decide not to cache at this layer. "Bypass if authenticated" withunlessis also effective.
2.4 @cache.memoize: cache a function "per argument"
If you want to cache not a view but the return value of a heavy internal function, use @cache.memoize. The decisive difference from @cache.cached is that it caches separately per argument value.
@cache.memoize(50)
def get_product_summary(product_id: int) -> dict:
# 重い集計クエリ。同じ product_id なら 50 秒間は再計算しない
return expensive_aggregate(product_id)
get_product_summary(1) and get_product_summary(2) become separate cache entries. Whereas @cache.cached looks at the request path, @cache.memoize looks at the function's arguments, so it works for the typical duplicate work of "doing the same computation with the same arguments over and over." Referencing master data, resolving config values, and transforming external-API responses are good examples.
@cache.cached | @cache.memoize | |
|---|---|---|
| Cache unit | Request path (+ query) | The combination of function arguments |
| Main target | A view function's entire response | The return value of an internal helper function/method |
| User distinction | Doesn't by default (needs key_prefix) | Naturally separated if you include the user ID in the arguments |
2.5 Cache invalidation: this is what's truly hard
As the famous adage "the two hard problems in computer science are cache invalidation and naming" goes, the real difficulty of caching is when and how to discard stale values. There are broadly two strategies.
(a) TTL (time-based) auto-expiry—the simplest method, making it "disappear on its own after N seconds" with timeout. Best for read data whose staleness is business-acceptable. Its advantages: simple implementation, and accidents of missed invalidation rarely happen. The drawback is "even after updating, the stale value can appear for up to N seconds."
(b) Explicit invalidation (event-based)—the method of deleting the corresponding cache yourself at the timing of updating data. A @cache.memoized function can be invalidated by specifying the function and key arguments. cache.clear() clears the entire cache (run within the app context).
def update_product(product_id: int, payload: dict) -> None:
db.session.execute(...) # 実データを更新
db.session.commit()
# 更新したものに対応するキャッシュだけを失効させる
cache.delete_memoized(get_product_summary, product_id)
# 全キャッシュをまとめてクリア(運用バッチ・デプロイ後など)
with app.app_context():
cache.clear()
💡 How to choose an invalidation strategy: when in doubt, make TTL the default. Explicit invalidation breeds the hard-to-find bug of "forget one update site and you keep serving stale data." If "updates aren't frequent and a few tens of seconds of staleness is acceptable," operation runs on just a short TTL (e.g., 30–60 seconds). Only when "updates must be reflected instantly, and the update sites can be clearly identified" should you add explicit invalidation—this is the accident-resistant order. TTL and explicit invalidation can also be combined (a short TTL as insurance, with explicit deletion on update).
2.6 Cache stampede: the rush at the moment of expiry
TTL caches have an easily-overlooked pitfall. The very moment a popular key expires, the large number of requests waiting for it all at once become "cache misses," and everyone hits the same heavy query and rushes the DB—this is a cache stampede (thundering herd / dog-piling). A cache meant to lighten reads produces the worst load spike at the moment of expiry.
There are three directions for countermeasures.
- Stagger the TTL (jitter): add a random fluctuation to the TTL per entry to disperse the expiry timing. The simplest way to prevent simultaneous expiry.
- Pre-regeneration (cache warming): recompute in advance via a background job (Celery) before expiry so user requests don't hit a miss.
- Lock (single-flight): on expiry, only the first request recomputes and the others wait for it. The implementation gets complex, so limit it to genuinely heavy keys.
⚠️ Stampedes "surface the more you succeed." While traffic is small they don't surface, and the more access grows and the higher the cache hit rate, the sharper the spike at expiry. For the most important, highest-cost keys, build in a design where expiries don't come simultaneously (jitter or pre-regeneration) from the start—that's safe.
2.7 HTTP caching: save one more step "outside" the app
Whereas Flask-Caching is server-side (in-app) caching, HTTP caching is a complementary layer that tells clients, CDNs, and proxies "you don't need to refetch." Combining both reduces the very number of times a request reaches the app.
Cache-Control: tell the client/CDN "no need to refetch for 300 seconds" likepublic, max-age=300. Effective for near-static responses.ETag/Last-Modifiedand conditional requests: attach anETag(a content hash, etc.) to the response, and the client next inquires withIf-None-Match; if the content is unchanged, the server can return304 Not Modified(no body). It saves bandwidth and serialization cost.
from flask import request, make_response
@app.route("/api/catalog")
def catalog():
data = get_catalog() # 重い集計(Flask-Cachingでさらに保護してもよい)
resp = make_response(jsonify(data))
resp.headers["Cache-Control"] = "public, max-age=300"
resp.set_etag(compute_etag(data)) # 内容に基づくETag
return resp.make_conditional(request) # If-None-Match一致なら 304 を返す
make_conditional(request) looks at If-None-Match / If-Modified-Since and, if they match, automatically converts to a 304. If you place a CDN (CloudFront, etc.) in front, emitting Cache-Control correctly from the origin (Flask) lets the CDN cache at the edge and further lower the app's load.
3. Flask-Limiter: protect, be fair, and curb costs with rate limiting
3.1 Why rate limiting is needed
Rate limiting is a mechanism that "sets an upper limit on the number of requests per unit time," effective for both performance and security. There are four motivations.
- Abuse / scraping countermeasure: prevent one client from hitting the API continuously and monopolizing resources.
- Cost control: protect usage-billed downstreams (external APIs, LLMs, DBs) from excessive calls. Rate limiting directly affects cost.
- Fairness: prevent some heavy clients from degrading the experience of all other users.
- Brute-force defense: cap brute-force attacks on login and OTP-verification endpoints by attempt count (auth is the Flask auth guide).
And rate limiting is also the first app layer of DoS/DDoS defense. Combining it with the defense-in-depth idea of placing a WAF (Web Application Firewall) in front of the app is the right line (the Flask security implementation guide).
3.2 Setup: default_limits and per-route
Use Flask-Limiter's setup in the official form as-is.
from flask import Flask
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
app = Flask(__name__)
limiter = Limiter(
get_remote_address,
app=app,
default_limits=["200 per day", "50 per hour"],
storage_uri="redis://localhost:6379",
)
In a factory configuration, bundle it with init_app (the practice of pillar §3).
# extensions.py
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
limiter = Limiter(get_remote_address)
# create_app 内
limiter.init_app(app)
default_limits: the limit applied by default to all routes. You can write it in human-readable strings like"200 per day","50 per hour".- per-route
@limiter.limit: override/add a limit on individual routes. @limiter.exempt: exclude a specific route (such as health checks) from rate limiting.
@app.route("/api/login", methods=["POST"])
@limiter.limit("5 per minute") # ブルートフォース対策で厳しめに
def login():
...
@app.route("/api/export")
@limiter.limit("1 per day") # 重い処理は1日1回に絞る
def export():
...
@app.route("/health")
@limiter.exempt # ヘルスチェックは制限しない
def health():
return {"status": "ok"}
When the limit is exceeded, the view function is not called. Flask-Limiter emits 429 Too Many Requests.
3.3 The most important pitfall: don't use memory:// in production
Flask-Limiter's default/easy choice is memory storage (memory://), but the official docs warn clearly about production use.
"should be used with caution in any production setup since: Each application process will have it's own storage [and] The state of the rate limits will not persist beyond the process' life-time."
Translated: "Should be used with caution in production. Each application process has its own storage, and the state of the rate limits doesn't persist beyond the process's lifetime."
This is fatal because Gunicorn stands up multiple worker processes. With memory://, each worker has an independent counter, so:
- If there are 8 workers, a
50 per hourlimit passes through up to effectively50 × 8 = 400 per hour(a separate count per worker). - If a worker restarts, the counter is reset.
- If the load balancer routes to a different worker, it counts anew with yet another counter.
In other words, with memory:// the rate limit "looks like it's working but isn't working at all." This is exactly the same structure of trap as Flask-Caching's SimpleCache (§2.2). In Gunicorn operations assuming multi-worker (the production-deployment guide), always specify a shared backend in storage_uri. The choices the official docs list are Redis (redis://host:port), Memcached, and MongoDB.
limiter = Limiter(
get_remote_address,
app=app,
default_limits=["200 per day", "50 per hour"],
storage_uri="redis://localhost:6379", # ← 全ワーカー/全コンテナで共有
)
⚠️ "It worked locally so it works in production" doesn't hold. During development it's a single process (
flask run), so the rate limit looks like it works even withmemory://. But production Gunicorn is multi-worker, multi-container. Unless you make the storage shared Redis, the limit passes through by the number of workers and containers. Remember Flask-Caching'sSimpleCacheand Flask-Limiter'smemory://as twin traps that break the moment you forget multi-process.
3.4 The key function: who do you count as a unit
What decides what "1 unit" is is the key function. get_remote_address (per client IP) is the default-ish choice, but change it per requirement.
| Key function | Unit | Suited scene |
|---|---|---|
get_remote_address | Client IP | Anonymous APIs, pre-login endpoints |
| A function returning a user ID | Authenticated user | Post-login APIs (fairer/more accurate than IP) |
| A function returning an API key | API key/tenant | B2B APIs. Can vary the limit per plan |
from flask_login import current_user
def user_or_ip():
# 認証済みならユーザー単位、未認証ならIP単位
if current_user.is_authenticated:
return str(current_user.id)
return get_remote_address()
limiter = Limiter(user_or_ip, app=app, storage_uri="redis://localhost:6379")
💡 In B2B SaaS, per-tenant/per-plan works. In a B2B like the lumber-distribution SaaS, rate-limiting per tenant (contracting company) or per API key, rather than per IP, gives fair control matched to the plan. You can express product design like "the free plan is 60 req/min, paid is 600 req/min" with a key function and a dynamic limit.
3.5 Behind a proxy, take the real IP with ProxyFix
When you cut keys with get_remote_address, there's a fatal premise. When Flask is behind a reverse proxy like ALB / nginx, request.remote_addr is the proxy's IP, not the client's. Do nothing, and all clients share one limit slot as the same "proxy IP," making rate limiting meaningless (when one person uses up the slot, everyone gets 429).
The solution is Werkzeug's ProxyFix. The official form is this.
from werkzeug.middleware.proxy_fix import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1) # 前段のプロキシ段数
limiter = Limiter(get_remote_address, app=app)
x_for is the number of proxy hops that set X-Forwarded-For. Get this wrong and it now becomes a security problem where the client can inject a fake X-Forwarded-For to bypass rate limiting. How to count the ProxyFix hops (1 for ALB alone, measure and find 2 for the API Gateway → ALB two-hop case, etc.) is detailed in the production-deployment guide. IP-based rate limiting, audit logs, and access control all presuppose that "the real client IP is obtained."
3.6 429 handling and Retry-After
Return the 429 thrown on limit exceedance kindly as an API. The standard is to tell the client "when they may retry" with the Retry-After header.
from flask import jsonify
@app.errorhandler(429)
def ratelimit_handler(e):
resp = jsonify(error="rate limit exceeded", detail=str(e.description))
resp.status_code = 429
# Flask-Limiter は超過時のレスポンスに Retry-After を付与できる(設定/ヘッダ)
return resp
Flask-Limiter can attach standard rate-limit headers (remaining count, reset time, Retry-After) to the response. Enabling these lets the client side back off autonomously, breaking the vicious cycle of mass-producing 429s with wasteful retries. The design of unifying error responses as JSON is aligned in the error-handling / observability guide.
4. Database: the real bottleneck is almost always here
Do measurement (§1) honestly, and the identity of slowness usually arrives at the DB. Let me crush it in four points. Detailed ORM design is in the Flask-SQLAlchemy practical guide, and a deep dive on connection pooling is in the PostgreSQL connection-pooling guide.
4.1 The N+1 problem: detection and eager loading
The N+1 problem is the ORM's biggest performance bug. After taking a list (1 query), you pull each row's related data one at a time inside a loop, and 1 + N queries fly. For a 100-item list, just pulling one relation is 101 queries—this is the typical identity of "the list is slow."
# ❌ N+1:orders を1回引いて、ループ内で各 order.customer を都度引く
orders = db.session.execute(db.select(Order)).scalars().all()
for order in orders:
print(order.customer.name) # ← ここで order ごとに SELECT が飛ぶ
Detection is most reliable via the SQL log of §1.2. If dozens of similar queries line up in one request, suspect N+1. The solution is eager loading—take the relations in the first 1–2 queries together.
from sqlalchemy.orm import selectinload
# ✅ selectinload:customer を別の1クエリでまとめて先読み(合計2クエリ)
orders = db.session.execute(
db.select(Order).options(selectinload(Order.customer))
).scalars().all()
for order in orders:
print(order.customer.name) # 追加クエリは飛ばない
The choice between selectinload (batch-fetch with an IN clause) and joinedload (one-query-ify with a JOIN), and their suitability for collection/scalar relations, is ORM-side knowledge. For details, see the Flask-SQLAlchemy practical guide. Also always confirm whether indexes are placed on the columns used in WHERE or JOIN. A full scan from a missing index can't be saved even by eager loading.
4.2 The connection pool: the multiplication with worker count has accidents
A DB connection is an "expensive to create" resource, so SQLAlchemy reuses them with a connection pool. The problem is that each Gunicorn worker is an independent process, and the pool is also separate per worker.
Here a multiplication accident occurs.
最大接続数 ≒ pool_size × ワーカー数 × タスク(コンテナ)数
Example: pool_size=5 × 8 workers × 4 tasks = 160 connections. It easily exceeds PostgreSQL's default max_connections (often around 100), and new connections are rejected with FATAL: too many connections. This is an accident that happens frequently in production.
# SQLAlchemy のプール設定(Flask-SQLAlchemy 経由)
app.config["SQLALCHEMY_ENGINE_OPTIONS"] = {
"pool_size": 5, # 常時保持する接続数
"max_overflow": 5, # 一時的に超過を許す数
"pool_pre_ping": True, # 接続を使う前に生存確認(切れたコネクションを掴まない)
"pool_recycle": 1800, # 30分でリサイクル(DB/プロキシのアイドル切断対策)
}
pool_pre_ping=True: send a light ping before using a connection taken from the pool, preventing the error of grabbing a connection that was severed on the DB/proxy side. A near-mandatory setting.pool_recycle: recreate connections after a set number of seconds, avoiding connections silently severed by the DB's or proxy's idle timeout.
⚠️ Design pool size together with worker count. If you decide looking only at
pool_size, the moment you multiply by worker count and container count you break throughmax_connections. Before deploying, always computethe max including pool_size × max_overflow×number of workers×number of tasksand confirm it fits within the DB'smax_connections. This calculation is also in the production-deployment guide checklist.
4.3 PgBouncer: absorb the connection explosion of many workers / serverless
As worker count × container count grows, the §4.2 multiplication inevitably balloons. Furthermore, in serverless like Lambda, connections open instantaneously by the number of concurrent executions and exhaust PostgreSQL in one strike. What absorbs this "connection explosion" is an external connection pooler like PgBouncer.
Inserting PgBouncer between the app and PostgreSQL:
- The app (many workers/Lambdas) may open a large number of connections to PgBouncer.
- PgBouncer multiplexes them onto a small number of real connections to bridge to PostgreSQL.
This lets you handle high concurrency without eating up max_connections. How to choose PgBouncer's pool mode (session / transaction / statement) and serverless caveats (compatibility with prepared statements, etc.) are consolidated in the PostgreSQL connection-pooling guide. If you use many workers × many containers, or serverless, PgBouncer is effectively a mandatory component.
4.4 Pagination: stop loading everything
"The list API loads the entire table into memory before returning it"—this is a typical performance bomb that quietly bloats. While the data is 100 rows it's fine, but the moment it becomes 100,000 rows, memory and the DB scream. Paginate from the start and take only what you need. Flask-SQLAlchemy has db.paginate.
@app.route("/api/orders")
def list_orders():
page = request.args.get("page", 1, type=int)
# 全件ロードせず、ページ単位で取得(LIMIT/OFFSET を自動付与)
pagination = db.paginate(
db.select(Order).order_by(Order.created_at.desc()),
page=page, per_page=50, error_out=False,
)
return jsonify(
items=[o.to_dict() for o in pagination.items],
page=pagination.page, pages=pagination.pages, total=pagination.total,
)
For large tables, OFFSET gets slow on deep pages, so put cursor (keyset) pagination in your field of view as an evolution, but first, just "eradicating full loads and capping with db.paginate" makes most of the problem disappear. For details, see the Flask-SQLAlchemy practical guide.
5. The response layer: trim transfer volume and serialization
After solidifying the DB and cache, the "last push" is optimizing the response itself. The effect isn't as large as the DB, but it pays off at low cost.
5.1 Compression (gzip / brotli)
Text (JSON / HTML) responses can have their transfer volume greatly trimmed by compression. Compression is first choice to do at the reverse proxy (nginx), CDN (CloudFront), or ALB. It avoids using the app's CPU. Only in configurations where the front layer can't compress (no proxy, or special requirements) should you enable app-side gzip/brotli with Flask-Compress.
# 前段プロキシで圧縮できない場合のフォールバック
from flask_compress import Compress
Compress(app) # Accept-Encoding に応じて gzip/brotli で自動圧縮
💡 Push compression to the infra layer. Flask-Compress consumes the app's CPU. If ALB/CloudFront/nginx is in front, leave compression to them, and Flask using its CPU for its actual work handles more requests with the same worker count. Separating "what the app can do" from "what the infra should do" is cost-efficient design.
5.2 The cost of JSON serialization
For large responses, JSON serialization itself becomes a non-negligible CPU cost. The countermeasures are plain.
- Don't return unnecessary fields: narrow the API schema to "columns actually used." Trim them with a serializer's (marshmallow's, etc.)
only/exclude(the marshmallow + Flask-SQLAlchemy guide). - Eliminate N+1 (§4.1): pulling relations during serialization makes serialization and N+1 compound into slowness. Take everything in advance with eager loading.
- Pagination (§4.4): cap one response's size in the first place.
5.3 The "limited" effect of streaming and async
A huge response (a large CSV export, etc.) eats memory if you build it all in memory before returning it, and TTFB (time to first byte) also worsens. Flask can make a streaming response by returning a generator, flowing rows while generating them.
from flask import Response
@app.route("/api/export.csv")
def export_csv():
def generate():
yield "id,name\n"
for row in iter_rows(): # DB から少しずつ取りながら流す
yield f"{row.id},{row.name}\n"
return Response(generate(), mimetype="text/csv")
And as for async def views, as touched on in §1.1, they are not a throughput-improvement measure. Each request still occupies one worker and the number of simultaneous processings doesn't increase. async takes effect only when "parallelizing multiple independent external IOs within one view to shrink that view's latency." If you want the whole app to be async-premised / to handle massive concurrent connections, consider a different path: ASGI (Quart, etc.) or gevent workers. This line is detailed in production-deployment guide §7 and the pillar.
6. Cost efficiency: optimization affects the bill directly
The optimizations so far affect not only latency but infrastructure cost directly. This is a theme I argue consistently across the whole site—balancing "fast, cheap, safe" with one person × generative AI. What supports that "cheap" is this chapter's thinking.
The causality of optimization → cost reduction is clear.
| Measure | Performance effect | Cost effect |
|---|---|---|
| Caching (§2) | Reduces the number of DB/external-API calls | DB load and usage-billed API costs drop |
| Rate limiting (§3) | Caps abuse/runaway | Protects usage-billed downstreams from excessive calls |
| N+1 elimination / indexes (§4) | Shortens one request's DB time | Worker occupancy shrinks, so you handle more with the same worker count |
| Pool/PgBouncer (§4) | Avoids connection exhaustion | A smaller DB instance suffices |
| Compression (§5) | Reduces transfer volume | Data transfer charges (egress) drop |
The core is the chain "one request's processing time shrinks = worker occupancy shrinks = you handle more requests with the same container count = fewer/smaller containers suffice." Because Flask is synchronous and 1 worker = 1 request (§1.1), shortening processing time directly reduces the number of containers needed, directly pushing down ECS/Fargate billing (vCPU, memory × runtime).
💡 My real feel: in operating the lumber-distribution SaaS on ECS (Fargate), what swayed cost was "how to not hit the DB by caching, eliminate N+1 to lighten one request, and line up workers without excess or shortage." In a read-dominant business SaaS, just caching master data and lightening lists with eager loading + pagination visibly changes the number of tasks needed (= the monthly fee). Optimization is a story of "speed" and, unmistakably, a story of "money." Scale horizontally (number of tasks), but after lightening one request—this order decides cost efficiency.
7. Flask performance / cost optimization checklist
Finally, let me fold this article's flow into a practical checklist in the order "measure → DB → cache → rate limit → workers → compression." The effect is largest from the top, so crush them top-down.
| Stage | Check item | Rationale (body) |
|---|---|---|
| Measure | Taking p95/p99 via a duration log (before/after_request) | §1.2 / §1.3 |
| Measure | Detected N+1 by looking at the query count per request in the SQL log | §1.3 / §4.1 |
| Measure | Starting after identifying the bottleneck by measurement, not by guess | §1.3 |
| DB | Resolved N+1 with eager loading (selectinload/joinedload) | §4.1 |
| DB | Indexes on WHERE/JOIN columns, no full scans | §4.1 |
| DB | pool_size × number of workers × number of tasks within max_connections | §4.2 |
| DB | Set pool_pre_ping=True and pool_recycle | §4.2 |
| DB | Inserted PgBouncer for many workers / serverless | §4.3 |
| DB | The list API doesn't full-load and caps with db.paginate | §4.4 |
| Cache | Cached only read-heavy, high-cost, staleness-tolerant data | §2.1 |
| Cache | CACHE_TYPE is RedisCache (avoid SimpleCache in production) | §2.2 |
| Cache | Used @cache.cached for views, @cache.memoize for functions | §2.3 / §2.4 |
| Cache | Not mixing user-specific responses into bare @cache.cached | §2.3 |
| Cache | Decided an invalidation strategy (TTL by default + explicit deletion if needed) | §2.5 |
| Cache | Considered stampede on high-load keys (jitter/pre-regeneration) | §2.6 |
| Cache | Added HTTP caching (Cache-Control/ETag/304) as a complement | §2.7 |
| Rate limit | storage_uri is shared Redis (avoid memory:// in production) | §3.3 |
| Rate limit | Set default_limits + @limiter.limit on important routes | §3.2 |
| Rate limit | Tightened brute-force targets like login | §3.1 / §3.2 |
| Rate limit | Inserted ProxyFix behind a proxy and taking the real IP | §3.5 |
| Rate limit | Returning 429 kindly with JSON + Retry-After | §3.6 |
| Workers | Tuned worker count from CPU × 2 by measurement | §1.1 (→ deployment) |
| Workers | Scale horizontally (number of tasks), after lightening one request | §6 |
| Compression | Pushed gzip/brotli to the infra layer (ALB/nginx/CDN) | §5.1 |
| Transfer | Not returning unnecessary fields, capping size with pagination | §5.2 |
Summary: speed is decided by the "order of measurement," and that directly becomes cost
Flask performance optimization is not a flashy technique but the discipline of a plain order. Let me summarize this article's points one phrase at a time.
- Move after measuring. Identify p95/p99 and the bottleneck with a duration log and SQL log. Don't micro-optimize on guesses. Flask is synchronous and 1 worker = 1 request—the essence of speed is in worker design and DB wait time.
- Caching (Flask-Caching / Redis) brings read-heavy, high-cost reads to memory speed. Use
@cache.cached(views) and@cache.memoize(functions, per argument) appropriately, and don't useSimpleCachein production. The truly hard part is invalidation—make TTL the default, add explicit deletion if needed, and be conscious of stampedes. - Rate limiting (Flask-Limiter) controls abuse, cost, fairness, and brute force.
memory://passes through with multi-worker—a twin trap—so always makestorage_urishared Redis. Behind a proxy, take the real IP withProxyFix. - The database is the real bottleneck. Crush N+1 with eager loading, place indexes, keep
pool_size × workers × taskswithinmax_connections, insert PgBouncer for many workers / serverless, and replace full loads withdb.paginate. - The response layer is the last push. Push compression to the infra layer, trim unnecessary fields, and stream huge responses.
asyncis not a throughput measure (the 1-worker-occupancy constraint). - These directly become cost. When one request gets lighter, you handle more with the same workers and fewer/smaller containers suffice. Optimization is a story of "speed" and at the same time a story of "money."
I could operate the Minister of Economy, Trade and Industry Award-winning B2B SaaS backend on ECS (Fargate) fast and cheaply because I kept this order, "measure → DB → cache → rate limit → workers → compression." The whole picture is in the Flask production-operations guide (pillar), the deployment-side design in the production-deployment guide, and the deep dive on the ORM and connections in the Flask-SQLAlchemy practical guide and the PostgreSQL connection-pooling guide. Speed is not talent but the discipline of keeping the order of measurement.