Skip to main content
友田 陽大
B2B SaaS & DX strategy
B2B SaaS
アーキテクチャ設計
AWS
Cognito
Terraform
Python
セキュリティ
Stripe

Dissecting the Architecture of a METI-Minister's-Award B2B SaaS: Multi-Tenant Authorization, Idempotent Payments, and 4 Rounds of Security Audit

We dissect a B2B SaaS that achieved DX of the lumber supply chain, with real code as the single source of truth. We explain — at the implementation level — industry-based multi-tenant authorization, Cognito RS256 and JWKS caching, Stripe Connect's two-layer idempotency, parallel document generation with ThreadPoolExecutor, 4 rounds of security audit, Slack-alarm spoofing countermeasures, and cost optimization.

Published
Reading time
17 min read
Author
友田 陽大
Share

In the past two articles, I wrote about decision-making — the "7 lessons" gained from the DX of the lumber-distribution industry, and the "technology-selection framework." This article is the sequel, shifting the viewpoint from "what I learned" to "how I actually built it, and how I raised it to a quality that withstands production operation."

The subject is a B2B subscription SaaS that won the METI Minister's Award and also obtained Kyoto Prefecture certification. It's a marketplace-type product where companies place/receive orders, exchange documents, settle payments, and rate each other, spanning the multi-stage commerce flow of forestry, market, sawmill, precut, builder, and manufacturer.

This article has one rule. The single source of truth is the real code. Not an architecture diagram or slides — from the actually-running backend (Python / Flask), frontend (React / TypeScript), and infrastructure (Terraform / AWS), I extract and explain only the design decisions that become the material on which an enterprise judges "it's safe to entrust this to this person."

Premise on the numbers: the quantitative values in the text (221 endpoints, 204 migrations, 2,153 tests, 48 FK indexes, 17 Terraform modules, 12 Lambdas, etc.) are all actual measurements mechanically counted from the repository. On the other hand, "business ROI (% of effort reduced, amount of cost reduced)" requires the client's actual data, so this article doesn't deal with it. The policy is: don't fabricate.


1. The whole-system picture: the production request path

First, the whole picture. It's a configuration one level deeper than the common "CloudFront → ALB → server."

[Browser] ── CloudFront ──> S3 (React SPA / immutable assets)

[Browser] ── API Gateway (Cognito authorizer)
                  │  ※ Authentication terminates before entering the VPC
                  └─ VPC Link ─> NLB ─> ALB (internal) ─> ECS Fargate ─> RDS PostgreSQL 16

[Stripe] ──> HTTP API ─> Lambda (webhook ×3) ─> RDS / DynamoDB
[S3 upload] ──> Lambda (Excel/CSV import) ─> RDS
[EventBridge scheduled] ──> Lambda (billing outbox send / reconciliation)
[CloudWatch alarm] ──> SNS ─> Lambda (Slack notification)

Why stack 2 LBs as API Gateway → NLB → ALB? This is not "redundancy" but a necessity derived from AWS constraints.

  • API Gateway's private integration (VPC Link v1) can only point to an NLB. So an NLB is needed at the entrance.
  • L7 routing, ingress control via security groups, and desync countermeasures are needed, so an ALB is placed behind it. The ALB is operated with desync_mitigation_mode = "defensive" and drop_invalid_header_fields = true.
  • Placing the Cognito authorizer at the edge of API Gateway terminates authentication before entering the VPC, so unauthenticated requests never reach ECS.

Heavy processing and async processing are all separated onto the Lambda side. Webhooks, Excel import, billing reconciliation, Slack notifications — by separating these from the Flask app proper, the API server can focus on the single responsibility of "processing synchronous requests," and the deploy unit becomes independent too (SRP).


2. Multi-tenant authorization: industry-based access control

The hardest part of this product's design is that "users are highly diverse." To the 7 industries forestry / market / sawmill / precut / sawmill-and-precut / builder / manufacturer, the viewer and manager roles are added. The "operations executable" and "information viewable" differ completely per industry.

Industry is an IntEnum, authorization is a frozenset whitelist

Scattering the industry as a string makes typos and missed checks into accidents. So the industry is defined centrally as an IntEnum, and the permitted industries per feature are held declaratively as a frozenset whitelist.

class IndustryCode(IntEnum):
    FORESTRY = 0        # 林業
    MARKET = 1          # 市場
    SAWMILL = 2         # 製材所
    PRECUT = 3          # プレカット
    SAWMILL_PRECUT = 4  # 製材所兼プレカット
    BUILDER = 5         # 工務店
    MANUFACTURER = 6    # メーカー
    VIEWER = 99
    ADMINISTRATOR = 100

# 「丸太を発注できるのは誰か」を frozenset で宣言的に定義する
WOOD_ORDER_INDUSTRIES = frozenset({IndustryCode.PRECUT, IndustryCode.SAWMILL_PRECUT})
LOG_RECEIVER_INDUSTRIES = frozenset({IndustryCode.FORESTRY, IndustryCode.MARKET})


def check_industry(user: User, allowed: frozenset[IndustryCode]) -> None:
    # 管理者は常に通す。それ以外は許可業種外なら 403。
    if user.industry == IndustryCode.ADMINISTRATOR:
        return
    if user.industry not in allowed:
        # 404 ではなく 403 を返す(リソース存在の有無を漏らさない=列挙攻撃対策)
        raise Forbidden()

There are 3 points.

  1. Consolidate the authorization decision into the router layer. The UseCase / Repository are written on the premise of receiving an "already-authorized User," so authorization if statements don't scatter through the business logic.
  2. A mismatch is 403 (not 404). This is an explicit design decision to prevent the enumeration attack of "brute-forcing IDs to guess existence."
  3. The administrator bypass is in just one place. Confining the exception to a single point keeps authorization loopholes from proliferating.

A "two-layer schema boundary" that doesn't leak PII

In a marketplace where you search for trading partners across companies, the requirement arises of "I want to show the counterpart company's overview, but not the email, phone, or corporate number." This is solved structurally by schema separation.

class UserDumpSchema(BaseSchema):
    """相互に取引関係がある相手にだけ使う。PII を含む完全な表現。"""
    email = fields.Email()
    phone_number = fields.String()
    corporate_number = fields.String()
    # ... PII を含む全フィールド

class UserPublicSchema(BaseSchema):
    """企業横断の検索・閲覧で使う公開スキーマ。PII は『許可リスト方式』で構造的に除外。"""
    user_id = fields.UUID()
    company_name = fields.String()
    industry = fields.Integer()
    prefecture = fields.String()
    average_rate = fields.Float()      # 0〜5 の企業評価
    evaluation_count = fields.Integer()
    # email / phone_number / corporate_number は『定義していない』ので絶対に出ない

The crux is that it's implemented with a whitelist ("show only these"), not a blacklist ("hide these"). Even if you add a new PII field to User, it absolutely won't appear on the public path unless you explicitly add it to UserPublicSchema. In the penetration test described later, I detected exactly this kind of mix-up here (a cross-company API was using UserDumpSchema) and fixed it the same day.

Frontend: the multi-stage gate ProtectedRoute

Backend authorization is the last bastion, but for UX we gate progressively on the front too. On the React side, ProtectedRoute judges in the order "authentication → profile complete → administrator → subscription active → industry," and if rejected at any of them, returns the appropriate redirect.

function ProtectedRoute({
  requiredIndustries,
  requiresAdmin,
  children,
}: {
  requiredIndustries?: ReadonlySet<Industry>;
  requiresAdmin?: boolean;
  children: React.ReactNode;
}) {
  const { user, isLoading } = useAuth();

  if (isLoading) return <Loading />;
  if (!user) return <Navigate to="/login" replace />;
  if (isProfileIncomplete(user)) return <Navigate to="/onboarding" replace />;
  if (requiresAdmin && !user.is_admin) return <Navigate to="/" replace />;
  if (IS_PROD && user.subscription_status !== "active")
    return <Navigate to="/subscription" replace />;
  if (requiredIndustries && !requiredIndustries.has(user.industry))
    return <Navigate to="/" replace />;

  return <UserContext value={user}>{children}</UserContext>;
}

The per-industry feature boundaries (for market users, for sawmills, for direct-ship senders…) are expressed by routes that thinly wrap this ProtectedRoute. Since you pass "the set of permitted industries" as data, you can survey the feature-to-role correspondence at a glance, and changes are localized (ETC).


3. The authentication foundation: Cognito JWT (RS256) and JWKS caching

Authentication is Amazon Cognito. The backend verifies the passed JWT with RS256. Not "leaving it to a library and being done" here is what separates production quality.

def verify_token(token: str) -> dict:
    signing_key = get_jwks_client().get_signing_key_from_jwt(token)
    claims = jwt.decode(
        token,
        signing_key.key,
        algorithms=["RS256"],
        audience=COGNITO_CLIENT_ID,
        issuer=COGNITO_ISSUER,
        # exp / iat / iss / aud / token_use の存在を必須化する
        options={"require": ["exp", "iat", "iss", "aud", "token_use"]},
    )
    # access トークンの誤用を防ぐ:id トークン以外は拒否する
    if claims.get("token_use") != "id":
        raise Unauthorized("invalid token_use")
    return claims

Since algorithms=["RS256"] is explicit, the classic alg=none attack and downgrade to HS256 don't hold. The token_use == "id" verification is to block the confusion of hitting an id-token endpoint with an access token.

JWKS is cached with a "double-checked-locking singleton"

Fetching the JWKS (public key set) per request increases latency and needlessly loads Cognito. On the other hand, Fargate is multi-worker, so a naive global variable contends. So with double-checked locking we generate just one client.

_jwks_lock = threading.Lock()

def get_jwks_client() -> PyJWKClient:
    client = current_app.config.get("COGNITO_JWKS_CLIENT")
    if client is not None:           # 1st check(ロックなしの高速パス)
        return client
    with _jwks_lock:
        client = current_app.config.get("COGNITO_JWKS_CLIENT")
        if client is None:           # 2nd check(ロック内で再確認)
            client = PyJWKClient(COGNITO_JWKS_URI)
            current_app.config["COGNITO_JWKS_CLIENT"] = client
        return client

The JWKS refresh interval was initially 24 hours, but following a security-audit finding I shortened it to 6 hours and pre-warm at startup (synchronously fetching once).

Also, the trade-off of stateless JWT is honestly documented. Cognito's GlobalSignOut invalidates the refresh token, but the id/access tokens are valid until exp (about 60 minutes). To invalidate these immediately requires a denylist (ElastiCache), but judging it not worth the cost/latency/SPOF, I accept it with "a short TTL + request logs + security headers." Leaving accepted risks in a ledger is, for enterprise-facing work, far more trusted than "hiding" them.


4. Idempotent payments: Stripe Connect's two-layer idempotency

Because this product has "recurring billing (subscription)" and "per-transaction settlement" coexisting, it adopts Stripe Connect. In payments, the requirement is to absolutely never cause double charges or lost events, even under network outages and retries.

The subscription state lives on User, formally validated by DB CHECK constraints

class User(Base):
    stripe_customer_id: Mapped[str | None]
    stripe_subscription_id: Mapped[str | None]
    stripe_connect_account_id: Mapped[str | None]
    subscription_status: Mapped[SubscriptionStatus]

    __table_args__ = (
        # Stripe ID の形式を DB レベルで検証(偽造・不正値の混入を防ぐ)
        CheckConstraint("stripe_customer_id LIKE 'cus_%'", name="ck_stripe_customer_id"),
        CheckConstraint("stripe_subscription_id LIKE 'sub_%'", name="ck_stripe_subscription_id"),
        CheckConstraint("stripe_connect_account_id LIKE 'acct_%'", name="ck_stripe_account_id"),
    )

The amount is always resolved on the server side. Passing the client-supplied amount straight to Stripe was a typical "amount tampering" vulnerability detected in the early audit. Now it goes through an AmountResolver that recalculates the amount from the order contents.

Layer 1: a content-addressed idempotency key

A Stripe API call carries an idempotency key, but making the key random can't distinguish "a retry of the same operation" from "a re-operation with changed content." So we weave a hash of the content into the key.

def idempotency_key(adjustment_id: str, scope: str, params: dict) -> str:
    digest = hashlib.sha256(canonical_json(params).encode()).hexdigest()[:12]
    # 同じ内容 → 同じキー(安全に再送できる)
    # 内容が変わる → 別キー(24h の IdempotencyKeyConflict を踏まない)
    return f"adj_{adjustment_id}_{scope}_{digest}"

Layer 2: deduplicate Webhooks with DynamoDB's conditional write

Stripe Webhooks are delivered "at least once." That is, the same event flies in multiple times. We make this idempotent with DynamoDB's conditional PutItem.

def already_processed(event_id: str) -> bool:
    try:
        table.put_item(
            Item={"event_id": event_id, "ttl": now() + THIRTY_DAYS},
            ConditionExpression="attribute_not_exists(event_id)",
        )
        return False  # 初回 → 処理する
    except ClientError as e:
        if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
            return True   # 既処理 → スキップ
        # DynamoDB 側の障害は fail-open(Stripe が再送するので最終的整合に倒す)
        log.warning("idempotency check degraded", exc_info=True)
        return False

There are 2 clear judgments here of "which way to fall on failure."

  • If the table-name environment variable is unset, throw a RuntimeError at import time and stop startup (fail-closed). This prevents the accident of the idempotency mechanism being silently disabled.
  • When DynamoDB itself is down, fail-open (proceed with processing). Since Stripe resends, falling to eventual consistency does less harm than stopping payments — that's the judgment.

Furthermore, billing reconciliation is implemented with a transactional outbox. It writes the outbox row in the same DB transaction as the business transaction, and a separate Lambda (launched every 5 minutes by EventBridge) sends the unsent portion to Stripe. Even if sending fails, the row remains, so it's reliably delivered. An hourly reconciliation Lambda does the matching and records it to an audit-log table.


5. Parallelizing and making heavy processing async

The Excel/PDF generation of "quote / delivery note / invoice" and "turning existing Excel into a DB" are the heaviest processing in this product too. Naively running these synchronously freezes the admin screen.

Document generation: thread-parallel with ThreadPoolExecutor

For one order, generate the order form, delivery note, and invoice simultaneously. Since Flask's app context is thread-local, the point is to explicitly re-establish it in each thread.

def parallel_create_documents(app, order_id: str) -> None:
    tasks = [create_order_form, create_delivery_note, create_invoice]

    def run(task):
        with app.app_context():                      # スレッドごとにコンテキストを張る
            doc = (
                Document.query
                .options(selectinload(Document.lines))  # N+1 を選択ロードで回避
                .filter_by(order_id=order_id)
                .with_for_update()                       # 行ロックで競合生成を防ぐ
                .one()
            )
            return task(doc)

    with ThreadPoolExecutor(max_workers=len(tasks)) as pool:
        futures = [pool.submit(run, t) for t in tasks]
        for f in as_completed(futures):
            f.result()  # 最初の例外を伝播させる

Excel uses openpyxl, and PDF uses LibreOffice's headless conversion (the judgment is that, rather than Celery or asyncio, a thread pool is straightforward and sufficient for CPU/IO-bound document generation).

Excel/CSV import: an S3-event-driven Lambda

The processing that bulk-imports Excel, the industry's common language, is separated from the API server into an S3-upload-triggered Lambda. It opens openpyxl with read_only=True and bulk-INSERTs with psycopg2's execute_values. We put in 2 defenses here too.

  • Uploads are capped at 50MB (pre-checked with HeadObject to prevent OOM).
  • CSV/Excel formula-injection neutralization (CWE-1236): values starting with = + @ - get a leading ' to prevent them from being executed as formulas in the spreadsheet software that opens them.

Frontend: polling with exponential backoff + Page Visibility support

When the front waits for heavy processing to complete, a fixed-interval setInterval keeps hitting the API even in a background tab, wasting cost. So we implement a polling hook with exponential backoff + linkage to the screen's visibility state.

usePollingWithBackoff({
  baseIntervalMs: 1000,
  maxIntervalMs: 30_000,           // 1 → 2 → 4 → … → 30s と伸ばす
  onTick: async () => {
    const next = await refetch();
    // 全件が success / failure に達したら停止
    return { shouldStop: next.every(isTerminal) };
  },
});
// document.hidden の間はリクエストを抑止し、再表示で base に戻す

Since it doesn't waste the API in a background tab, it helps both user experience and cloud cost.


6. Database efficiency and reliability

On PostgreSQL 16, I stack plain but effective improvements.

ImprovementContent
Added 48 FK indexes with zero downtimeAdded the missing foreign-key indexes with CREATE INDEX CONCURRENTLY + IF NOT EXISTS. Doesn't take an ACCESS EXCLUSIVE lock in production
Budgeting the connection poolpool_size=5 / max_overflow=5 / pool_recycle=1800 / pool_pre_ping=True. (5+5)×8 tasks = 80 < the limit of db.t4g.micro. Prevents connection exhaustion on scale-out
Resolving the N+1 of the daily reportChanged an upsert that flushed in proportion to the number of sites into add_all + a single flush. Makes the flush count constant regardless of the number of sites
Savepoint isolation of testsSet a savepoint per test and roll back. Can run all of them in about 11 seconds, making CI fast

By running CREATE INDEX CONCURRENTLY inside an autocommit_block and aligning the naming convention to SQLAlchemy's auto-generation (ix_<table>_<column>), I keep migration drift at zero. Migrations are version-controlled with Alembic over 204 generations, and the operation never edits existing migrations.


7. 4 rounds of security audit

This is the core of winning enterprise trust. This product has gone through 4 rounds of security audit and credential rotation.

RoundMethodMain findings and response
R1Static source audit40 items (Critical 4 / High 17 / Medium 14 / Low 5). Centered on payment integrity: tampering of the client-specified amount, test-charge bypass, the admin session Cookie's httponly=False. Closed 4/4 Critical
R3Live staging assessment (about 250 requests)Critical 2: plaintext credentials in Lambda environment variables (→ migrated to Secrets Manager), Webhook idempotency fail-open (→ fail-closed). Path traversal, JWT forgery (alg=none/tampered aud), CORS reflection, and Webhook signature verification all defended successfully
R4Black-box + white-box penetrationScanned all 221 endpoints with Cognito users of the actual 15 roles. 0 missing-authorization findings. 1 High (cross-tenant PII exposure on a cross-company API) fixed the same day, confirmed 0 on re-assessment
R5Category coverage (SSRF/CSRF/XXE/IDOR/SSTI, etc., 14 kinds)0 new findings. Added implementations of RDS rds.force_ssl=1 enforcement, notification rate limiting, etc.

R4's "0 missing-authorization findings across all 221 endpoints" is the result of proving, from an actual attacker's viewpoint, that every route is guarded by API Gateway's Cognito authorizer. Even /health doesn't pass through unauthenticated.

The security headers are in place too.

Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: strict-origin-when-cross-origin
Content-Security-Policy: default-src 'none'; frame-ancestors 'none'

And the residual risks that were "couldn't be fully fixed / decided not to fix" are stated explicitly in a ledger (the aforementioned JWT-invalidation trade-off, the user-existence guessability via Cognito's SignUp, etc.). This is an operation close to ADRs (records of design decisions), letting a third party trace "knowing what, they chose what."


8. Observability and recoverability

Countermeasures against Slack-alarm spoofing (log injection)

Operational alerts flow to Slack, but the initial "failure marker" was a position-dependent string looking at the leading token of the log. Because CloudWatch strips leading whitespace, input with an embedded newline could fire a fake operational alarm — classic log injection (CWE-117).

As a countermeasure, I changed the marker to single-line top-level JSON.

# 偽装可能だった旧実装(位置依存の文字列):  "[w1=..., FAILED]"
# 偽装不能な新実装(構造化):
log.error(json.dumps({"marker": "SLACK_DELIVERY_FAILED", "reason": reason}))

Since CloudWatch's metric filter matches with { $.marker = "SLACK_DELIVERY_FAILED" }, no matter what's written in the body, the marker key can't be forged.

"Out-of-band" escalation that notices even if Slack is down

If the notification path itself breaks, it's putting the cart before the horse. So we detect Slack delivery failure with the above marker and escalate to a path that doesn't depend on Slack: CloudWatch metric filter → SNS email. Further, so the marker string doesn't drift among the 3 places of backend / Lambda / Terraform, we mechanically guarantee synchronization with a contract test.

The Slack-notification handler for ERROR logs itself also classifies the response into "permanent failure (4xx)" and "temporary failure," retries only temporary failures with exponential backoff, and aborts permanent failures immediately, emitting the above marker. Because it's a mixed thread/greenlet environment, it shares a requests.Session and handles it thread-safely.


9. Cost optimization

"World-class quality" does not mean ignoring cost. Rather, to run a production SaaS solo to small-scale, cost design is exactly where ability shows.

  • Fold the billed resources to zero with a single environment_active. During the stopped period, the NAT, ALB/NLB, VPC Link, RDS (count=0), ECS (desired=0), and the Secrets Manager VPC endpoint all disappear. Restart is from code too.
  • Adopt Graviton (ARM / t4g) for RDS and the bastion, improving cost efficiency vs x86.
  • Staging is 100% Fargate Spot (interruption-tolerant, about 70% reduction). Production is on-demand.
  • A single NAT (drop redundancy for cost optimality). VPC endpoints are production-only.
  • Scheduled scaling: production ECS at min2/max8 during weekday business hours, min1/max4 at night.
  • Terraform state is S3 native lock (use_lockfile = true), eliminating the cost of a DynamoDB lock table.
  • S3 lifecycle (production-only): Standard → IA(30d) → Glacier(90d) → delete(365d).
  • CMK is production-only; staging uses AWS-managed keys.

"Cheap in normal times, scaling only when needed, foldable to zero during unused periods" — the point is that this elasticity is declared in code.


10. CI/CD and quality gates

Finally, the mechanisms that maintain this quality without relying on human review.

Deployment runs via GitHub Actions OIDC (holding no long-lived AWS keys at all). The backend is docker build → ECR push → ecs update-service --force-new-deployment, the front is S3 sync (immutable assets are immutable, the root is no-cache) → CloudFront invalidation.

Terraform is also automated with 3 pipelines.

  • plan: run fmt-check / validate / tfsec on a PR and comment the plan.
  • apply: main → staging, the production branch → production. The apply role gets a permissions boundary, structurally forbidding org takeover or stopping the audit foundation.
  • drift: run on a schedule weekday mornings, and when drift is detected, auto-file a GitHub Issue and auto-close on recovery.

Quality gates are two-stage. With pre-commit (changed files only, seconds) and pre-push (a full mirror of CI), we run the following.

LayerFormatLintTypeSecurityTest
BackendRuffRuff / Bandit / Vulture / deptrymypypip-auditpytest (Docker, 2,153)
FrontendPrettierESLinttsc --noEmitnpm auditVitest
Infraterraform fmtterraform validatetfsecterraform plan

Plus gitleaks (secret scanning), Trivy (image CVE), hadolint, Dependabot (continuously updating 6 ecosystems). Conventional Commits is required, and --no-verify and force push to main are forbidden.


Summary

Behind the result of the METI Minister's Award lies not flashy features but an accumulation of plain, consistent design decisions.

  • Multi-tenant authorization is made structurally leak-proof with a frozenset whitelist + router-layer consolidation + a PII whitelist schema.
  • Payments don't double-charge even under retries, with "server-side amount resolution + content-addressed idempotency key + DynamoDB deduplication + outbox."
  • Heavy processing is separated into thread parallelism and event-driven Lambdas, and the front waits with visibility-linked backoff polling.
  • Reliability is made provable in the form of 4 audit rounds, 0 missing-authorization findings across 221 APIs, log-injection countermeasures, out-of-band escalation, and an accepted-risk ledger.
  • Cost is elastic and cheap with environment_active, Graviton, Spot, and a single NAT.

The difference between "building something that works" and "building a SaaS that withstands production operation and a third party's attacks and audits" lies in exactly these one-by-one judgments. If you're considering DX of a legacy industry, or new development / turnaround of a B2B SaaS, I undertake it one-stop at this level, from requirements definition through infrastructure, security, and operations.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading