# Flask Performance Optimization in Practice: Caching with Flask-Caching (Redis), Rate Limiting with Flask-Limiter, and N+1 and Connection Pools

> An implementation guide to Flask performance optimization and cost reduction. Measurement-first (p95/p99), Flask-Caching 2.4.0's Redis cache (@cache.cached/@cache.memoize, invalidation), Flask-Limiter's rate limiting and the must-have-shared-Redis trap, and N+1, connection pools, and PgBouncer—all explained with official-compliant real code.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: Python, Flask, パフォーマンス, Redis, レート制限, 本番運用, バックエンド, コスト最適化
- URL: https://tomodahinata.com/en/blog/flask-performance-caching-rate-limiting-flask-caching-limiter-guide
- Category: Flask in production
- Pillar guide: https://tomodahinata.com/en/blog/flask-production-guide

## Key points

- The iron rule of optimization is 'measure first.' Flask's biggest bottleneck is usually the DB (N+1, missing indexes, pool exhaustion) and caching, not micro-optimization. Flask is synchronous, so 1 worker = 1 request—throughput is decided by worker design and DB wait time
- Use RedisCache with Flask-Caching 2.4.0, and use @cache.cached (views) and @cache.memoize (functions, per argument) appropriately. The hard part is invalidation—design either TTL or explicit cache.delete/clear, and be conscious of stampedes (simultaneous expiry)
- Flask-Limiter 4.1.1's memory:// is forbidden in production. Because Gunicorn has a process per worker, memory storage isn't shared across workers and the limit collapses. Specify shared Redis in storage_uri, and behind a proxy take the real client IP with ProxyFix
- Crush N+1 on the DB with eager loading, and design so pool_size × number of workers × number of tasks doesn't exceed max_connections. With many workers / serverless, insert PgBouncer. Replace full loads with db.paginate
- Cache + appropriate workers + pool means you can handle it with fewer, smaller containers = cost reduction. Explained based on the knowledge of operating a Minister of Economy, Trade and Industry Award-winning B2B SaaS's Flask/SQLAlchemy/PostgreSQL backend on ECS (Fargate)

---

## **Introduction: performance optimization begins not with "making it fast" but with "measuring"**

When a Flask app is called "slow," the most common failure is **moving your hands without measuring.** Rewriting a list comprehension to `map`, changing string concatenation to `join`—such micro-optimizations are noise in most web apps. What decides perceived speed in a production Flask app is, almost without exception, **the database (N+1, missing indexes, connection-pool exhaustion)**, **the presence or absence of caching**, and **worker design.** Polishing tricks while missing these won't shrink p95 by a millimeter.

This article is a spoke that digs the performance area of the [Flask production-operations guide (pillar)](/blog/flask-production-guide) down to production quality. It covers only optimization design that is **faithful to the latest official documentation, assuming the Flask 3.1 line (the current stable version).** Specifically, it runs through, in real code: (1) the measurement mindset, (2) caching with Flask-Caching (Redis), (3) rate limiting with Flask-Limiter, (4) DB performance (N+1, pools, pagination), (5) the response layer (gzip, async), and (6) the cost efficiency that follows directly from these.

I have **designed and implemented the backend of a Minister of Economy, Trade and Industry Award-winning B2B SaaS in Python / Flask / SQLAlchemy / PostgreSQL, and operated 221 endpoints in production on API Gateway → ALB → ECS (Fargate).** To balance "speed" and "cost" in a read-dominant business SaaS, what was needed wasn't flashy algorithms but keeping the plain order **measure → DB → cache → rate limit → workers.** This article traces that order as-is.

> 💡 **Versions covered in this article**: assuming the **Flask 3.1 line**, I use **Flask-Caching 2.4.0** and **Flask-Limiter 4.1.1**. Both are extensions installed separately, not dependencies of Flask (`pip install Flask-Caching==2.4.0 Flask-Limiter==4.1.1 redis`). Redis is assumed for storage. All numbers (latency, connection counts) appearing here are **examples for explanation**, not benchmark values for a specific environment.

---

## **1. Measurement first: know where it's slow before anything else**

### 1.1 Get the premise "Flask is synchronous" into your body

Before optimizing, grasp Flask's execution model precisely. Flask is a **WSGI (synchronous) framework**, and each request **occupies one Gunicorn worker to the end.** The official docs also state clearly about `async def` views that "each request still occupies one worker; making it async *doesn't change the number of requests that can be handled simultaneously*."

From here, you see what decides Flask's throughput.

- **Requests handled simultaneously ≒ number of workers** (for synchronous sync workers).
- The longer one request's processing time (= waiting on the DB or an external API), the longer its worker is occupied and the longer other requests wait.
- So **"shrinking one request's processing time" (caching, DB optimization)** and **"lining up workers appropriately" (deployment design)** are two sides of the same coin.

Worker-count design (starting from `CPU × 2`, sync or gevent, the trade-off with memory) is consolidated in the [production-deployment guide](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide). This article handles the "make one request fast / reject wasteful requests" side. Both take effect together.

### 1.2 What to measure: not the average but p95 / p99

You **must not look at latency by its average.** The average is insensitive to outliers and hides the state of "most users are fast, but some are fatally slow." What you should look at are percentiles.

| Metric | Meaning | Why it matters |
|---|---|---|
| **p50 (median)** | Half the requests are faster than this | "The typical experience" |
| **p95** | 95% are faster than this (the slower 5%) | A realistic target line for an SLO |
| **p99** | 99% are faster than this (the slowest 1%) | Tail latency. When this clogs, the clog cascades |

An endpoint with a bad p99 **occupies its worker for a long time.** In synchronous Flask, one slow endpoint ripples into the wait time of all other requests, so crushing "the slowest part" pays off across overall throughput.

### 1.3 How to measure: three granularities

Measurement is enough in three stages, from coarse to fine.

**(a) Per-endpoint duration log**—the first step. Record each request's duration with `before_request` / `after_request`.

```python
import time

from flask import g, request, current_app


@app.before_request
def _start_timer():
    g._t0 = time.perf_counter()


@app.after_request
def _log_latency(response):
    elapsed_ms = (time.perf_counter() - g._t0) * 1000
    # 構造化ログに乗せる（リクエストID付与は可観測性ガイド参照）
    current_app.logger.info(
        "request_latency",
        extra={"path": request.path, "method": request.method,
               "status": response.status_code, "elapsed_ms": round(elapsed_ms, 1)},
    )
    return response
```

Aggregating this log tells you "which path is pulling p95/p99." Structuring logs and correlating request IDs are left to the [error-handling / observability guide](/blog/flask-error-handling-logging-observability-guide). You can also emit duration in Gunicorn's access log.

**(b) Visualizing SQL queries**—the majority of slowness is the DB. Turning on SQLAlchemy's engine log only during development reveals how many queries fly per request (N+1 surfaces here). For details, see §4 and the [Flask-SQLAlchemy practical guide](/blog/flask-sqlalchemy-flask-migrate-database-production-guide).

**(c) Identifying hotspots with a profiler**—only when a CPU-bound process is genuinely suspect, identify hotspots inside the view with `cProfile` or Werkzeug's `ProfilerMiddleware`. But **the order is to suspect the DB and cache before you get here.**

> ⚠️ **Don't optimize on guesses.** Rewriting on "this is probably slow" is a gamble that only increases technical debt. Knuth's "premature optimization is the root of all evil" is still true. **Always identify the bottleneck by measurement and confirm the effect by measurement**—prepare the mechanism that lets you spin this loop (duration log, SQL log) *before* you start optimizing. This is the highest-ROI move.

> 💡 **My experience**: when I got a report that "the list screen is slow" in the lumber-distribution SaaS, the first thing I looked at was the SQL log. The cause wasn't the algorithm but an N+1 pulling line items one at a time (§4.1) and reading master data from the DB every time (the cache target in §2). Had I jumped to "Python is slow" without measuring, it would never have been fixed.

---

## **2. Flask-Caching: bring reads to memory speed**

### 2.1 What to cache (and what not to)

Caching is no panacea. **Caching the wrong target becomes a failure that keeps serving stale data.** Data suited to caching satisfies the following three conditions.

1. **Read-heavy**: the same thing is requested over and over.
2. **Expensive to produce**: heavy DB aggregations, external API calls, complex rendering.
3. **Tolerant of staleness**: being tens of seconds to minutes old causes no business problem.

Conversely, **data that must be instantly accurate right after a write** (payment balances, confirmed inventory) must not be casually cached. And cases of "offloading slow processing asynchronously"—such as heavy report generation or sending mail—are the domain of **background jobs**, not caching. **Cache is for read-heavy, Celery is for write/slow work**—divide them (the [Flask + Celery + Redis background-tasks guide](/blog/flask-celery-redis-background-tasks-production-guide)).

### 2.2 Setup: bundle RedisCache in the factory

Since Flask-Caching is an extension, follow the `init_app` pattern of pillar §3. Create `Cache()` bare and bind it with `init_app`.

```python
# extensions.py — どのアプリにも束縛されていない「裸」の拡張
from flask_caching import Cache

cache = Cache()
```

```python
# __init__.py（create_app 内・抜粋）
from .extensions import cache


def create_app(test_config=None):
    app = Flask(__name__, instance_relative_config=True)
    # ...設定読み込み...
    cache.init_app(app, config={
        "CACHE_TYPE": "RedisCache",
        "CACHE_REDIS_URL": "redis://localhost:6379/0",
    })
    return app
```

You can also pass it at creation as `Cache(app)`, but in a factory configuration `init_app` is the only choice. `CACHE_TYPE` has several values by use.

| `CACHE_TYPE` | Backend | Use |
|---|---|---|
| `NullCache` (**default**) | Does nothing | Cache disabled. This until you explicitly enable |
| `SimpleCache` | An in-process memory dict | Single-process dev/test. **Not shared in production multi-worker** |
| `FileSystemCache` | Local files | Single host. Disappears in containers |
| `RedisCache` | Redis | **The standard for production multi-worker/multi-container** |
| `RedisSentinelCache` | Redis Sentinel | Highly-available Redis (failover) |
| `RedisClusterCache` | Redis Cluster | Large-scale, sharding |
| `MemcachedCache` | Memcached | If you've already adopted Memcached |

> ⚠️ **Don't use `SimpleCache` in production.** `SimpleCache` is a **process-local memory dict.** Because Gunicorn stands up multiple workers (= multiple processes), a value cached by worker A isn't visible from worker B. And if a worker restarts, it's gone. This is **exactly the same trap as Flask-Limiter's `memory://` problem (§3.3).** In production multi-worker/multi-container, always use a **shared backend** like Redis.

### 2.3 `@cache.cached`: cache per view

The simplest is `@cache.cached`, which caches a view function's entire response. The official minimal form is this.

```python
@app.route("/")
@cache.cached(timeout=50)
def index():
    return render_template('index.html')
```

`timeout=50` is a **50-second** TTL (time to live). Second and subsequent requests to the same path return from Redis without executing the view.

Note the decorator order: `@app.route` is **outermost**, `@cache.cached` is **innermost** (closer to the view function). Reversing them won't function correctly.

The main parameters of `@cache.cached` are as follows.

| Parameter | Meaning |
|---|---|
| `timeout` | Cache lifetime in seconds (TTL) |
| `key_prefix` | The cache key's prefix. **Default is `'view/%(request.path)s'`** (= `request.path`-based) |
| `unless` | Pass a callable, and when it returns `True`, **bypass** the cache (neither read nor write) |
| `query_string` | Set `True` to **hash the order-normalized query parameters** into the key |

`query_string=True` is quietly important. By default only `request.path` is the key, so `/search?q=flask` and `/search?q=django` **share the same cache.** Setting `query_string=True` makes a separate cache per query parameter, and moreover **normalizes the parameter order (`?a=1&b=2` and `?b=2&a=1`)** to treat them as the same, so the cache doesn't fragment needlessly.

```python
@app.route("/search")
@cache.cached(timeout=60, query_string=True)
def search():
    q = request.args.get("q", "")
    return jsonify(results=run_expensive_search(q))
```

> ⚠️ **Don't cache user-specific responses with `@cache.cached`.** The default key is only `request.path` (+ query) and **doesn't distinguish the logged-in user.** Caching a "different per person" response like an authenticated user's dashboard with bare `@cache.cached` becomes a serious information leak where **A's data is shown to B.** To make it per-user, make `key_prefix` a callable and weave the user ID into the key, or simply decide not to cache at this layer. "Bypass if authenticated" with `unless` is also effective.

### 2.4 `@cache.memoize`: cache a function "per argument"

If you want to cache not a view but the return value of a heavy **internal function**, use `@cache.memoize`. The decisive difference from `@cache.cached` is that it **caches separately per argument value.**

```python
@cache.memoize(50)
def get_product_summary(product_id: int) -> dict:
    # 重い集計クエリ。同じ product_id なら 50 秒間は再計算しない
    return expensive_aggregate(product_id)
```

`get_product_summary(1)` and `get_product_summary(2)` become separate cache entries. Whereas `@cache.cached` looks at the request path, `@cache.memoize` looks at the **function's arguments**, so it works for the typical duplicate work of "doing the same computation with the same arguments over and over." Referencing master data, resolving config values, and transforming external-API responses are good examples.

| | `@cache.cached` | `@cache.memoize` |
|---|---|---|
| Cache unit | Request path (+ query) | **The combination of function arguments** |
| Main target | A view function's entire response | The return value of an internal helper function/method |
| User distinction | Doesn't by default (needs `key_prefix`) | Naturally separated if you include the user ID in the arguments |

### 2.5 Cache invalidation: this is what's truly hard

As the famous adage "the two hard problems in computer science are cache invalidation and naming" goes, **the real difficulty of caching is when and how to discard stale values.** There are broadly two strategies.

**(a) TTL (time-based) auto-expiry**—the simplest method, making it "disappear on its own after N seconds" with `timeout`. Best for read data whose staleness is business-acceptable. Its advantages: simple implementation, and accidents of missed invalidation rarely happen. The drawback is "even after updating, the stale value can appear for up to N seconds."

**(b) Explicit invalidation (event-based)**—the method of **deleting the corresponding cache yourself** at the timing of updating data. A `@cache.memoize`d function can be invalidated by specifying the function and key arguments. `cache.clear()` **clears the entire cache** (run within the app context).

```python
def update_product(product_id: int, payload: dict) -> None:
    db.session.execute(...)  # 実データを更新
    db.session.commit()
    # 更新したものに対応するキャッシュだけを失効させる
    cache.delete_memoized(get_product_summary, product_id)
```

```python
# 全キャッシュをまとめてクリア（運用バッチ・デプロイ後など）
with app.app_context():
    cache.clear()
```

> 💡 **How to choose an invalidation strategy**: when in doubt, **make TTL the default.** Explicit invalidation breeds the hard-to-find bug of "forget one update site and you keep serving stale data." **If "updates aren't frequent and a few tens of seconds of staleness is acceptable," operation runs on just a short TTL (e.g., 30–60 seconds).** Only when "updates must be reflected instantly, and the update sites can be clearly identified" should you add explicit invalidation—this is the accident-resistant order. TTL and explicit invalidation can also be combined (a short TTL as insurance, with explicit deletion on update).

### 2.6 Cache stampede: the rush at the moment of expiry

TTL caches have an easily-overlooked pitfall. **The very moment** a popular key expires, the large number of requests waiting for it **all at once** become "cache misses," and everyone hits the same heavy query and rushes the DB—this is a **cache stampede (thundering herd / dog-piling).** A cache meant to lighten reads produces the worst load spike at the moment of expiry.

There are three directions for countermeasures.

- **Stagger the TTL (jitter)**: add a random fluctuation to the TTL per entry to disperse the expiry timing. The simplest way to prevent simultaneous expiry.
- **Pre-regeneration (cache warming)**: recompute in advance via a background job ([Celery](/blog/flask-celery-redis-background-tasks-production-guide)) before expiry so user requests don't hit a miss.
- **Lock (single-flight)**: on expiry, only the first request recomputes and the others wait for it. The implementation gets complex, so limit it to genuinely heavy keys.

> ⚠️ **Stampedes "surface the more you succeed."** While traffic is small they don't surface, and the more access grows and the higher the cache hit rate, the sharper the spike at expiry. **For the most important, highest-cost keys, build in a design where expiries don't come simultaneously (jitter or pre-regeneration)** from the start—that's safe.

### 2.7 HTTP caching: save one more step "outside" the app

Whereas Flask-Caching is **server-side (in-app)** caching, **HTTP caching** is a complementary layer that tells clients, CDNs, and proxies "you don't need to refetch." Combining both reduces the very number of times a request reaches the app.

- **`Cache-Control`**: tell the client/CDN "no need to refetch for 300 seconds" like `public, max-age=300`. Effective for near-static responses.
- **`ETag` / `Last-Modified` and conditional requests**: attach an `ETag` (a content hash, etc.) to the response, and the client next inquires with `If-None-Match`; if the content is unchanged, the server can return **`304 Not Modified`** (no body). It **saves bandwidth and serialization cost.**

```python
from flask import request, make_response


@app.route("/api/catalog")
def catalog():
    data = get_catalog()                       # 重い集計（Flask-Cachingでさらに保護してもよい）
    resp = make_response(jsonify(data))
    resp.headers["Cache-Control"] = "public, max-age=300"
    resp.set_etag(compute_etag(data))          # 内容に基づくETag
    return resp.make_conditional(request)      # If-None-Match一致なら 304 を返す
```

`make_conditional(request)` looks at `If-None-Match` / `If-Modified-Since` and, if they match, automatically converts to a `304`. If you place a CDN (CloudFront, etc.) in front, emitting `Cache-Control` correctly from the origin (Flask) lets the CDN cache at the edge and further lower the app's load.

---

## **3. Flask-Limiter: protect, be fair, and curb costs with rate limiting**

### 3.1 Why rate limiting is needed

Rate limiting is a mechanism that "sets an upper limit on the number of requests per unit time," effective for both performance and security. There are four motivations.

1. **Abuse / scraping countermeasure**: prevent one client from hitting the API continuously and monopolizing resources.
2. **Cost control**: protect usage-billed downstreams (external APIs, LLMs, DBs) from excessive calls. Rate limiting **directly affects cost.**
3. **Fairness**: prevent some heavy clients from degrading the experience of all other users.
4. **Brute-force defense**: cap brute-force attacks on login and OTP-verification endpoints by attempt count (auth is the [Flask auth guide](/blog/flask-authentication-flask-login-jwt-extended-guide)).

And rate limiting is also **the first app layer of DoS/DDoS defense.** Combining it with the defense-in-depth idea of placing a WAF (Web Application Firewall) in front of the app is the right line (the [Flask security implementation guide](/blog/flask-security-sessions-csrf-secure-cookies-guide)).

### 3.2 Setup: default_limits and per-route

Use Flask-Limiter's setup in the official form as-is.

```python
from flask import Flask
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

app = Flask(__name__)
limiter = Limiter(
    get_remote_address,
    app=app,
    default_limits=["200 per day", "50 per hour"],
    storage_uri="redis://localhost:6379",
)
```

In a factory configuration, bundle it with `init_app` (the practice of pillar §3).

```python
# extensions.py
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

limiter = Limiter(get_remote_address)

# create_app 内
limiter.init_app(app)
```

- **`default_limits`**: the limit applied by default to all routes. You can write it in human-readable strings like `"200 per day"`, `"50 per hour"`.
- **per-route `@limiter.limit`**: override/add a limit on individual routes.
- **`@limiter.exempt`**: exclude a specific route (such as health checks) from rate limiting.

```python
@app.route("/api/login", methods=["POST"])
@limiter.limit("5 per minute")            # ブルートフォース対策で厳しめに
def login():
    ...


@app.route("/api/export")
@limiter.limit("1 per day")               # 重い処理は1日1回に絞る
def export():
    ...


@app.route("/health")
@limiter.exempt                           # ヘルスチェックは制限しない
def health():
    return {"status": "ok"}
```

**When the limit is exceeded, the view function is not called.** Flask-Limiter emits **429 Too Many Requests.**

### 3.3 The most important pitfall: don't use `memory://` in production

Flask-Limiter's default/easy choice is **memory storage (`memory://`)**, but the official docs warn clearly about production use.

> "should be used with caution in any production setup since: Each application process will have it's own storage [and] The state of the rate limits will not persist beyond the process' life-time."

Translated: **"Should be used with caution in production. Each application process has *its own* storage, and the state of the rate limits doesn't persist beyond the process's lifetime."**

This is fatal because **Gunicorn stands up multiple worker processes.** With `memory://`, each worker has an **independent counter**, so:

- If there are 8 workers, a `50 per hour` limit passes through up to **effectively `50 × 8 = 400 per hour`** (a separate count per worker).
- If a worker restarts, the counter is reset.
- If the load balancer routes to a different worker, it counts anew with yet another counter.

In other words, **with `memory://` the rate limit "looks like it's working but isn't working at all."** This is **exactly the same structure of trap as Flask-Caching's `SimpleCache` (§2.2).** In Gunicorn operations assuming multi-worker (the [production-deployment guide](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide)), always specify a **shared backend** in `storage_uri`. The choices the official docs list are **Redis (`redis://host:port`), Memcached, and MongoDB.**

```python
limiter = Limiter(
    get_remote_address,
    app=app,
    default_limits=["200 per day", "50 per hour"],
    storage_uri="redis://localhost:6379",   # ← 全ワーカー/全コンテナで共有
)
```

> ⚠️ **"It worked locally so it works in production" doesn't hold.** During development it's a single process (`flask run`), so the rate limit looks like it works even with `memory://`. But production Gunicorn is multi-worker, multi-container. **Unless you make the storage shared Redis, the limit passes through by the number of workers and containers.** Remember Flask-Caching's `SimpleCache` and Flask-Limiter's `memory://` as **twin traps that break the moment you forget multi-process.**

### 3.4 The key function: who do you count as a unit

What decides what "1 unit" is is the **key function.** `get_remote_address` (per client IP) is the default-ish choice, but change it per requirement.

| Key function | Unit | Suited scene |
|---|---|---|
| `get_remote_address` | Client IP | Anonymous APIs, pre-login endpoints |
| A function returning a user ID | Authenticated user | Post-login APIs (fairer/more accurate than IP) |
| A function returning an API key | API key/tenant | B2B APIs. Can vary the limit per plan |

```python
from flask_login import current_user

def user_or_ip():
    # 認証済みならユーザー単位、未認証ならIP単位
    if current_user.is_authenticated:
        return str(current_user.id)
    return get_remote_address()

limiter = Limiter(user_or_ip, app=app, storage_uri="redis://localhost:6379")
```

> 💡 **In B2B SaaS, per-tenant/per-plan works.** In a B2B like the lumber-distribution SaaS, rate-limiting per **tenant (contracting company) or per API key**, rather than per IP, gives fair control matched to the plan. You can express product design like "the free plan is 60 req/min, paid is 600 req/min" with a key function and a dynamic limit.

### 3.5 Behind a proxy, take the real IP with ProxyFix

When you cut keys with `get_remote_address`, there's a **fatal premise.** When Flask is **behind a reverse proxy** like ALB / nginx, `request.remote_addr` is **the proxy's IP, not the client's.** Do nothing, and **all clients share one limit slot as the same "proxy IP,"** making rate limiting meaningless (when one person uses up the slot, everyone gets 429).

The solution is Werkzeug's `ProxyFix`. The official form is this.

```python
from werkzeug.middleware.proxy_fix import ProxyFix

app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)   # 前段のプロキシ段数
limiter = Limiter(get_remote_address, app=app)
```

`x_for` is the **number of proxy hops that set `X-Forwarded-For`.** Get this wrong and it now becomes a **security problem** where the client can inject a fake `X-Forwarded-For` to bypass rate limiting. How to count the `ProxyFix` hops (1 for ALB alone, measure and find 2 for the API Gateway → ALB two-hop case, etc.) is detailed in the [production-deployment guide](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide). IP-based rate limiting, audit logs, and access control all presuppose that "the real client IP is obtained."

### 3.6 429 handling and Retry-After

Return the **429** thrown on limit exceedance kindly as an API. The standard is to tell the client "when they may retry" with the `Retry-After` header.

```python
from flask import jsonify


@app.errorhandler(429)
def ratelimit_handler(e):
    resp = jsonify(error="rate limit exceeded", detail=str(e.description))
    resp.status_code = 429
    # Flask-Limiter は超過時のレスポンスに Retry-After を付与できる（設定/ヘッダ）
    return resp
```

Flask-Limiter can attach standard rate-limit headers (remaining count, reset time, `Retry-After`) to the response. Enabling these lets the client side **back off autonomously**, breaking the vicious cycle of mass-producing 429s with wasteful retries. The design of unifying error responses as JSON is aligned in the [error-handling / observability guide](/blog/flask-error-handling-logging-observability-guide).

---

## **4. Database: the real bottleneck is almost always here**

Do measurement (§1) honestly, and the identity of slowness usually arrives at the DB. Let me crush it in four points. Detailed ORM design is in the [Flask-SQLAlchemy practical guide](/blog/flask-sqlalchemy-flask-migrate-database-production-guide), and a deep dive on connection pooling is in the [PostgreSQL connection-pooling guide](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide).

### 4.1 The N+1 problem: detection and eager loading

The **N+1 problem** is the ORM's biggest performance bug. After taking a list (1 query), you pull each row's related data **one at a time inside a loop**, and `1 + N` queries fly. For a 100-item list, just pulling one relation is 101 queries—this is the typical identity of "the list is slow."

```python
# ❌ N+1：orders を1回引いて、ループ内で各 order.customer を都度引く
orders = db.session.execute(db.select(Order)).scalars().all()
for order in orders:
    print(order.customer.name)   # ← ここで order ごとに SELECT が飛ぶ
```

**Detection** is most reliable via the SQL log of §1.2. If dozens of similar queries line up in one request, suspect N+1. **The solution** is **eager loading**—take the relations in the first 1–2 queries together.

```python
from sqlalchemy.orm import selectinload

# ✅ selectinload：customer を別の1クエリでまとめて先読み（合計2クエリ）
orders = db.session.execute(
    db.select(Order).options(selectinload(Order.customer))
).scalars().all()
for order in orders:
    print(order.customer.name)   # 追加クエリは飛ばない
```

The choice between `selectinload` (batch-fetch with an IN clause) and `joinedload` (one-query-ify with a JOIN), and their suitability for collection/scalar relations, is ORM-side knowledge. For details, see the [Flask-SQLAlchemy practical guide](/blog/flask-sqlalchemy-flask-migrate-database-production-guide). Also always confirm whether **indexes are placed** on the columns used in `WHERE` or `JOIN`. A full scan from a missing index can't be saved even by eager loading.

### 4.2 The connection pool: the multiplication with worker count has accidents

A DB connection is an "expensive to create" resource, so SQLAlchemy reuses them with a **connection pool.** The problem is that **each Gunicorn worker is an independent process, and the pool is also separate per worker.**

Here a multiplication accident occurs.

```
最大接続数 ≒ pool_size × ワーカー数 × タスク（コンテナ）数
```

Example: `pool_size=5` × 8 workers × 4 tasks = **160 connections.** It easily exceeds PostgreSQL's default `max_connections` (often around 100), and new connections are rejected with `FATAL: too many connections`. This is an accident that happens frequently in production.

```python
# SQLAlchemy のプール設定（Flask-SQLAlchemy 経由）
app.config["SQLALCHEMY_ENGINE_OPTIONS"] = {
    "pool_size": 5,         # 常時保持する接続数
    "max_overflow": 5,      # 一時的に超過を許す数
    "pool_pre_ping": True,  # 接続を使う前に生存確認（切れたコネクションを掴まない）
    "pool_recycle": 1800,   # 30分でリサイクル（DB/プロキシのアイドル切断対策）
}
```

- **`pool_pre_ping=True`**: send a light ping before using a connection taken from the pool, preventing the error of grabbing a connection that was severed on the DB/proxy side. A **near-mandatory** setting.
- **`pool_recycle`**: recreate connections after a set number of seconds, avoiding connections silently severed by the DB's or proxy's idle timeout.

> ⚠️ **Design pool size together with worker count.** If you decide looking only at `pool_size`, the moment you multiply by worker count and container count you break through `max_connections`. **Before deploying, always compute `the max including pool_size × max_overflow` × `number of workers` × `number of tasks` and confirm it fits within the DB's `max_connections`.** This calculation is also in the [production-deployment guide](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide) checklist.

### 4.3 PgBouncer: absorb the connection explosion of many workers / serverless

As worker count × container count grows, the §4.2 multiplication inevitably balloons. Furthermore, in **serverless** like Lambda, connections open instantaneously by the number of concurrent executions and exhaust PostgreSQL in one strike. What absorbs this "connection explosion" is an **external connection pooler** like **PgBouncer.**

Inserting PgBouncer between the app and PostgreSQL:

- The app (many workers/Lambdas) may open a large number of connections to PgBouncer.
- PgBouncer **multiplexes them onto a small number of real connections** to bridge to PostgreSQL.

This lets you handle high concurrency without eating up `max_connections`. How to choose PgBouncer's pool mode (session / transaction / statement) and serverless caveats (compatibility with prepared statements, etc.) are consolidated in the [PostgreSQL connection-pooling guide](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide). **If you use many workers × many containers, or serverless, PgBouncer is effectively a mandatory component.**

### 4.4 Pagination: stop loading everything

"The list API loads the entire table into memory before returning it"—this is a typical performance bomb that quietly bloats. While the data is 100 rows it's fine, but the moment it becomes 100,000 rows, memory and the DB scream. **Paginate from the start** and take only what you need. Flask-SQLAlchemy has `db.paginate`.

```python
@app.route("/api/orders")
def list_orders():
    page = request.args.get("page", 1, type=int)
    # 全件ロードせず、ページ単位で取得（LIMIT/OFFSET を自動付与）
    pagination = db.paginate(
        db.select(Order).order_by(Order.created_at.desc()),
        page=page, per_page=50, error_out=False,
    )
    return jsonify(
        items=[o.to_dict() for o in pagination.items],
        page=pagination.page, pages=pagination.pages, total=pagination.total,
    )
```

For large tables, `OFFSET` gets slow on deep pages, so put **cursor (keyset) pagination** in your field of view as an evolution, but first, just "eradicating full loads and capping with `db.paginate`" makes most of the problem disappear. For details, see the [Flask-SQLAlchemy practical guide](/blog/flask-sqlalchemy-flask-migrate-database-production-guide).

---

## **5. The response layer: trim transfer volume and serialization**

After solidifying the DB and cache, the "last push" is optimizing the response itself. The effect isn't as large as the DB, but it pays off at low cost.

### 5.1 Compression (gzip / brotli)

Text (JSON / HTML) responses can have their transfer volume greatly trimmed by **compression.** Compression is **first choice to do at the reverse proxy (nginx), CDN (CloudFront), or ALB.** It avoids using the app's CPU. Only in configurations where the front layer can't compress (no proxy, or special requirements) should you enable app-side gzip/brotli with `Flask-Compress`.

```python
# 前段プロキシで圧縮できない場合のフォールバック
from flask_compress import Compress

Compress(app)   # Accept-Encoding に応じて gzip/brotli で自動圧縮
```

> 💡 **Push compression to the infra layer.** Flask-Compress consumes the app's CPU. If ALB/CloudFront/nginx is in front, leave compression to them, and Flask using its CPU for its actual work handles more requests with the same worker count. **Separating "what the app can do" from "what the infra should do"** is cost-efficient design.

### 5.2 The cost of JSON serialization

For large responses, **JSON serialization itself becomes a non-negligible CPU cost.** The countermeasures are plain.

- **Don't return unnecessary fields**: narrow the API schema to "columns actually used." Trim them with a serializer's (marshmallow's, etc.) `only` / `exclude` (the [marshmallow + Flask-SQLAlchemy guide](/blog/marshmallow-flask-sqlalchemy-rest-api-production-guide)).
- **Eliminate N+1 (§4.1)**: pulling relations during serialization makes serialization and N+1 compound into slowness. Take everything in advance with eager loading.
- **Pagination (§4.4)**: cap one response's size in the first place.

### 5.3 The "limited" effect of streaming and async

A huge response (a large CSV export, etc.) **eats memory if you build it all in memory before returning it**, and TTFB (time to first byte) also worsens. Flask can make a **streaming response** by returning a generator, flowing rows while generating them.

```python
from flask import Response


@app.route("/api/export.csv")
def export_csv():
    def generate():
        yield "id,name\n"
        for row in iter_rows():          # DB から少しずつ取りながら流す
            yield f"{row.id},{row.name}\n"
    return Response(generate(), mimetype="text/csv")
```

And as for `async def` views, as touched on in §1.1, **they are not a throughput-improvement measure.** Each request still occupies one worker and the number of simultaneous processings doesn't increase. async takes effect only when "**parallelizing multiple independent external IOs within one view** to shrink that view's latency." If you want the whole app to be async-premised / to handle massive concurrent connections, consider a different path: ASGI (Quart, etc.) or gevent workers. This line is detailed in [production-deployment guide §7](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide) and the [pillar](/blog/flask-production-guide).

---

## **6. Cost efficiency: optimization affects the bill directly**

The optimizations so far affect not only latency but **infrastructure cost directly.** This is a theme I argue consistently across the whole site—balancing "fast, cheap, safe" with one person × generative AI. What supports that "cheap" is this chapter's thinking.

**The causality of optimization → cost reduction** is clear.

| Measure | Performance effect | Cost effect |
|---|---|---|
| **Caching (§2)** | Reduces the number of DB/external-API calls | DB load and usage-billed API costs drop |
| **Rate limiting (§3)** | Caps abuse/runaway | Protects usage-billed downstreams from excessive calls |
| **N+1 elimination / indexes (§4)** | Shortens one request's DB time | Worker occupancy shrinks, so **you handle more with the same worker count** |
| **Pool/PgBouncer (§4)** | Avoids connection exhaustion | A smaller DB instance suffices |
| **Compression (§5)** | Reduces transfer volume | Data transfer charges (egress) drop |

The core is the chain **"one request's processing time shrinks = worker occupancy shrinks = you handle more requests with the same container count = fewer/smaller containers suffice."** Because Flask is synchronous and 1 worker = 1 request (§1.1), **shortening processing time directly reduces the number of containers needed**, directly pushing down ECS/Fargate billing (vCPU, memory × runtime).

> 💡 **My real feel**: in operating the lumber-distribution SaaS on ECS (Fargate), what swayed cost was "how to not hit the DB by caching, eliminate N+1 to lighten one request, and line up workers without excess or shortage." In a read-dominant business SaaS, just **caching master data and lightening lists with eager loading + pagination** visibly changes the number of tasks needed (= the monthly fee). Optimization is a story of "speed" and, unmistakably, a story of "money." Scale [horizontally (number of tasks)](/blog/aws-ecs-fargate-production-guide), but **after lightening one request**—this order decides cost efficiency.

---

## **7. Flask performance / cost optimization checklist**

Finally, let me fold this article's flow into a practical checklist in the order "measure → DB → cache → rate limit → workers → compression." **The effect is largest from the top**, so crush them top-down.

| Stage | Check item | Rationale (body) |
|---|---|---|
| **Measure** | Taking p95/p99 via a duration log (`before/after_request`) | §1.2 / §1.3 |
| **Measure** | Detected N+1 by looking at the query count per request in the SQL log | §1.3 / §4.1 |
| **Measure** | Starting after identifying the bottleneck by measurement, not by guess | §1.3 |
| **DB** | Resolved N+1 with eager loading (`selectinload`/`joinedload`) | §4.1 |
| **DB** | Indexes on `WHERE`/`JOIN` columns, no full scans | §4.1 |
| **DB** | `pool_size × number of workers × number of tasks` within `max_connections` | §4.2 |
| **DB** | Set `pool_pre_ping=True` and `pool_recycle` | §4.2 |
| **DB** | Inserted PgBouncer for many workers / serverless | §4.3 |
| **DB** | The list API doesn't full-load and caps with `db.paginate` | §4.4 |
| **Cache** | Cached only read-heavy, high-cost, staleness-tolerant data | §2.1 |
| **Cache** | `CACHE_TYPE` is `RedisCache` (avoid `SimpleCache` in production) | §2.2 |
| **Cache** | Used `@cache.cached` for views, `@cache.memoize` for functions | §2.3 / §2.4 |
| **Cache** | Not mixing user-specific responses into bare `@cache.cached` | §2.3 |
| **Cache** | Decided an invalidation strategy (TTL by default + explicit deletion if needed) | §2.5 |
| **Cache** | Considered stampede on high-load keys (jitter/pre-regeneration) | §2.6 |
| **Cache** | Added HTTP caching (`Cache-Control`/ETag/304) as a complement | §2.7 |
| **Rate limit** | `storage_uri` is shared Redis (avoid `memory://` in production) | §3.3 |
| **Rate limit** | Set `default_limits` + `@limiter.limit` on important routes | §3.2 |
| **Rate limit** | Tightened brute-force targets like login | §3.1 / §3.2 |
| **Rate limit** | Inserted `ProxyFix` behind a proxy and taking the real IP | §3.5 |
| **Rate limit** | Returning 429 kindly with JSON + `Retry-After` | §3.6 |
| **Workers** | Tuned worker count from `CPU × 2` by measurement | §1.1 (→ deployment) |
| **Workers** | Scale horizontally (number of tasks), after lightening one request | §6 |
| **Compression** | Pushed gzip/brotli to the infra layer (ALB/nginx/CDN) | §5.1 |
| **Transfer** | Not returning unnecessary fields, capping size with pagination | §5.2 |

---

## **Summary: speed is decided by the "order of measurement," and that directly becomes cost**

Flask performance optimization is not a flashy technique but **the discipline of a plain order.** Let me summarize this article's points one phrase at a time.

1. **Move after measuring.** Identify p95/p99 and the bottleneck with a duration log and SQL log. Don't micro-optimize on guesses. Flask is synchronous and 1 worker = 1 request—the essence of speed is in worker design and DB wait time.
2. **Caching (Flask-Caching / Redis)** brings read-heavy, high-cost reads to memory speed. Use `@cache.cached` (views) and `@cache.memoize` (functions, per argument) appropriately, and don't use `SimpleCache` in production. The truly hard part is invalidation—make TTL the default, add explicit deletion if needed, and be conscious of stampedes.
3. **Rate limiting (Flask-Limiter)** controls abuse, cost, fairness, and brute force. `memory://` passes through with multi-worker—a twin trap—so always make `storage_uri` shared Redis. Behind a proxy, take the real IP with `ProxyFix`.
4. **The database** is the real bottleneck. Crush N+1 with eager loading, place indexes, keep `pool_size × workers × tasks` within `max_connections`, insert PgBouncer for many workers / serverless, and replace full loads with `db.paginate`.
5. **The response layer** is the last push. Push compression to the infra layer, trim unnecessary fields, and stream huge responses. `async` is not a throughput measure (the 1-worker-occupancy constraint).
6. **These directly become cost.** When one request gets lighter, you handle more with the same workers and fewer/smaller containers suffice. Optimization is a story of "speed" and at the same time a story of "money."

I could operate the Minister of Economy, Trade and Industry Award-winning B2B SaaS backend on ECS (Fargate) fast and cheaply because I kept this order, "measure → DB → cache → rate limit → workers → compression." The whole picture is in the [Flask production-operations guide (pillar)](/blog/flask-production-guide), the deployment-side design in the [production-deployment guide](/blog/flask-deployment-gunicorn-docker-production-wsgi-guide), and the deep dive on the ORM and connections in the [Flask-SQLAlchemy practical guide](/blog/flask-sqlalchemy-flask-migrate-database-production-guide) and the [PostgreSQL connection-pooling guide](/blog/postgresql-connection-pooling-pgbouncer-serverless-guide). Speed is not talent but the discipline of keeping the order of measurement.