"Can you still trace last year's unauthorized access in the GuardDuty console?" — this is a question I often throw out in AWS security audits.
The answer is, in many cases, "no." GuardDuty findings are optimal for notifying you of the current threat, but they flow by and eventually vanish from view. They're archived, fall out of the console's default view, and half a year or a year later, when an auditor says "match that finding from back then against CloudTrail events and VPC Flow Logs and produce the blast radius in SQL" — with the GuardDuty console alone, you can no longer reach it.
This isn't a defect of GuardDuty. GuardDuty is a detection engine, neither a long-term store nor an analysis foundation. Yet real-world operations have, apart from "now" detection, 3 demands — (1) long-term retention for compliance (years), (2) cross-correlation matching finding × CloudTrail × VPC Flow in one query, (3) a supply line flowing findings to a SIEM. What answers these 3 is Amazon Security Lake.
This article is an implementation guide for aggregating GuardDuty findings into Amazon Security Lake and doing long-term retention, cross-analysis, and SIEM integration with OCSF (Open Cybersecurity Schema Framework). As a subject, I also weave in my experience cross-implementing IAM, observability, and DR on a serverless payment platform atop multi-account AWS — because it handled actual money, carbon credits, and local currencies, "who did what, when" had to be made reproducible over a years span and retained in an audit-resistant form.
The rule of this article: Security Lake's definition, native sources, source identifiers, OCSF conversion, subscriber types, and Terraform arguments are based on the AWS official documentation and the Terraform AWS Provider (as of June 2026). Supported sources, schema versions, and pricing get revised, so always confirm the latest official information before shipping to production. And the most important accuracy crux of this article — there is no "direct GuardDuty source" in Security Lake. GuardDuty findings only enter via the path "GuardDuty → Security Hub (GuardDuty integration) → Security Lake's AWS Security Hub CSPM findings (
SH_FINDINGS)." Mistake this and you get the accident of "I built a Security Lake but not a single GuardDuty finding is in it." The code is shaped close to real operation, but Security Lake does not replace detection (GuardDuty), investigation (Detective), or defense (WAF, least-privilege IAM).
0. The Mental Model: The 3 Layers of detect → investigate → long-term aggregation & analysis
Before starting the design, let me fix security operations on 3 time axes. This is the foundation of everything in this article.
GuardDuty does DETECT (detection = now), Detective does INVESTIGATE (investigation = interactively, up to 1 year), and Security Lake does AGGREGATE & ANALYZE (aggregation, normalization, long-term retention = years, SQL/SIEM). These 3 layers merely differ in time axis; they're not rivals.
| Layer | Time axis | Service | The question it answers | Data form |
|---|---|---|---|---|
| DETECT | Now (real time) | GuardDuty | "Is something bad happening now?" | finding (GuardDuty's own format) |
| INVESTIGATE | Up to 1 year, interactive | Amazon Detective | "Why did it happen, and how far were we breached?" | A behavior graph (visualization / pivoting) |
| AGGREGATE & ANALYZE | Years, SQL/SIEM | Amazon Security Lake | "What was happening across the past? How do we present it for audit?" | OCSF + Parquet (your own S3) |
Many "we put in GuardDuty" projects stop at DETECT. Even a somewhat-advanced site reaches INVESTIGATE by adding Detective (the GuardDuty × Detective investigation workflow). But Detective is an interactive investigation tool, and what its behavior graph shows is up to 1 year. "JOIN 3 years of findings with CloudTrail and produce a compliance report in SQL," "flow findings to Splunk/QRadar and put them on an existing SIEM's rules" — this is not Detective's defensive range.
From here, 3 consequences emerge.
- Detective and Security Lake are different layers, answering different questions. Detective is "investigate this one case interactively, with a graph, going back up to 1 year." Security Lake is "normalize all findings to the same schema (OCSF) as CloudTrail and VPC Flow, pile them in your own S3 for years, and cross-analyze with SQL or a SIEM." The former is depth (one case's context), the latter is breadth and permanence (whole, long-term, machine-readable).
- Security Lake neither "detects" nor "investigates." Finding new threats is GuardDuty's job, chasing them interactively is Detective's, stopping them is the job of WAF and least-privilege IAM. What Security Lake handles is only "collect, normalize, hold long, and distribute." So it can be safely placed as the company-wide, all-Region foundation.
- GuardDuty findings don't enter Security Lake "directly." This is this article's biggest pitfall. The next chapter pins down the wiring precisely — get it wrong and you build a data lake with nothing in it.
Pin down these 3 points, and the task of "putting in Security Lake" turns out to actually be the 3 designs of "① correctly wire the GuardDuty → Security Hub → Security Lake path → ② design native sources and multi-Region → ③ cut subscribers (SQL or SIEM)." Let me build them in order.
1. What Is Amazon Security Lake: An "OCSF-Normalized" Security Data Lake Standing on Your Own S3
First, let me fix in one line what Security Lake is and isn't.
Amazon Security Lake = a fully managed service that auto-aggregates security logs/events from AWS, SaaS, on-prem, and third parties into a dedicated data lake on your own AWS account's S3, normalizes them to Apache Parquet + OCSF (the standard schema), and centrally manages long-term retention and delivery to subscribers.
The official definition is this — "Amazon Security Lake is a fully managed security data lake service." And decisively important is ownership — "The data lake is backed by Amazon Simple Storage Service (Amazon S3) buckets, and you retain ownership over your data." That is, the data stays in your S3, and you own it. It's structurally different from being sucked up into a lock-in-laden external SaaS.
1.1 The 4 Core Functions
| Function | The official point | The scene where this works |
|---|---|---|
| Aggregate into your own account | Collect cloud, on-prem, and custom logs across accounts/Regions. S3-backed, ownership is yours | Satisfies the "keep audit data under our own management" requirement directly |
| Normalize (OCSF + Parquet) | Auto partition, Parquet-ify, and convert to OCSF native sources. No post-processing | finding, CloudTrail, and VPC Flow can be JOINed in the same schema |
| Multi-tier access to subscribers | Grant access per source / per Region. Choose notification of new objects or query | Distribute just enough for SQL analysis or SIEM supply |
| Multi-account, multi-Region | Bulk-enable across all Regions / all accounts. Aggregate with a rollup Region (data-residency support) | The whole org's findings in one place, controlling data location |
| Lifecycle management | Optimize cost with custom retention settings + automatic storage tiering | Run years-long retention cheaply with S3 tiering |
1.2 Why "Normalization (OCSF)" Works — JOIN in the Same Schema
Security Lake's true value lies in normalization to OCSF. The official words — "Security Lake automatically partitions incoming data from natively supported AWS services and converts it to a storage- and query-efficient Parquet format. It also transforms data from natively supported AWS services to the Open Cybersecurity Schema Framework (OCSF) open-source schema."
What's good about it? A GuardDuty finding, a CloudTrail API call, a VPC Flow Logs network record — raw, their formats are all different. To cross-query them, you'd originally have to parse each log for Athena and write the ETL for JOINing (exactly the work Detective touts as "you don't have to do yourself"). Security Lake aligns these to the common schema OCSF + Parquet's columnar format and piles them up. So —
-- イメージ:OCSF に正規化されているから、finding と API 活動を「同じ語彙」で扱える
-- (具体的なテーブル名・カラムは 5 章で実物を出します)
SELECT ...
FROM security_hub_findings_ocsf AS f -- GuardDuty 由来の finding(SH_FINDINGS)
JOIN cloudtrail_management_ocsf AS c -- 同じ OCSF の語彙で結合できる
ON f.<actor> = c.<actor>
— cross-analysis like "starting from a finding, pull the APIs that its subject was hitting at the same time" holds up with just SQL, without writing ETL. This is the decisive difference between "a log pile that's merely collected" and "a normalized data lake."
1.3 The AWS Services Security Lake Uses Internally (Know the Mechanism and You Can Read Operations)
Security Lake is a managed service assembled atop multiple AWS services. The internal dependencies the official docs list:
| Service | Its role in Security Lake (official) |
|---|---|
| Amazon EventBridge | "to notify subscribers when objects are written to the data lake." (notification of object writes) |
| AWS Glue | A crawler creates/updates Glue Data Catalog tables. Holds the partition metadata of Lake Formation tables |
| AWS Lake Formation | Creates a separate table per source (schema, partitions, data location). Subscribers can query these |
| AWS Lambda | ETL jobs to raw data, partition registration to Glue |
| Amazon S3 | Stores the data body (storage class / retention follow S3's. S3 Select unsupported) |
| Amazon SQS | Manages event-driven processing and notifications |
"One Lake Formation table is created per source" — this one sentence becomes the foundation when you later write SQL (it splits into a finding table, a CloudTrail table, …).
1.4 What Security Lake "Doesn't Do" (Stating the Boundary)
To prevent confusion, let me state the boundary.
- It doesn't detect new threats. That's the job of GuardDuty, Inspector, and Macie. Security Lake only aggregates and retains logs including the already-detected results (findings).
- It doesn't stop attacks. Containment is EventBridge → automated response or a runbook (the automated response in the GuardDuty main article).
- It doesn't draw an interactive investigation graph. That's Amazon Detective. Security Lake is the storage / distribution of normalized raw data, not a visualized behavior graph.
- The "aggregated, normalized view (single pane)" of findings itself is Security Hub's role. The division of roles among the 5 security services is covered in the comparison article. Security Hub lays them out on the desk, and Security Lake stows those findings in OCSF into the long-term warehouse — that's the before-and-after relationship.
2. [Most Important] The Wiring: GuardDuty Findings Enter Security Lake "Only via Security Hub"
This is the one point in this article you must absolutely not miss.
2.1 There Is No "Direct GuardDuty Source" in Security Lake
The AWS services Security Lake natively supports are only the following 6 systems, officially enumerated.
| Native source (official) | What logs |
|---|---|
| AWS CloudTrail management and data events (S3, Lambda) | Management events + S3 / Lambda data events |
| Amazon EKS Audit Logs | Kubernetes control-plane audit logs |
| Amazon Route 53 resolver query logs | DNS name-resolution queries |
| AWS Security Hub CSPM findings | findings aggregated in Security Hub (← GuardDuty enters here) |
| Amazon VPC Flow Logs | Network traffic |
| AWS WAFv2 logs | WAF request logs |
"GuardDuty" is not in this list. There's no means to add GuardDuty directly as a Security Lake source. So where do findings come from — AWS Security Hub CSPM findings.
2.2 The Correct Path: GuardDuty → Security Hub → Security Lake
GuardDuty is integrated with Security Hub, and GuardDuty findings are ingested into Security Hub. And Security Hub's findings are a native source of Security Lake. Therefore the wiring until a finding reaches Security Lake is the following 3 stages.
[1] GuardDuty を有効化(detector ON)
│ finding を生成
▼
[2] Security Hub(CSPM)を有効化し、GuardDuty 統合を ON
│ GuardDuty finding を ASFF として取り込む
▼
[3] Security Lake で「AWS Security Hub CSPM findings(SH_FINDINGS)」をソース追加
│ Security Hub の findings を OCSF + Parquet に正規化
▼
自分の S3 上のデータレイク(Lake Formation テーブル:Security Hub findings)
→ Athena/Redshift/OpenSearch(クエリ)/ SQS 通知 → SIEM(データ)
Only when these 3 stages are in place does a GuardDuty finding land in Security Lake. If any is missing —
- Skip [2] (no Security Hub, or GuardDuty integration OFF) → there's no finding in Security Hub, so none enters Security Lake either.
- Forget to add
SH_FINDINGSin [3] → CloudTrail and VPC Flow enter, but findings stay forever empty.
Detecting a wiring mistake: in production, confirm both "is collection starting for that Region with
GetDataLakeSources(aws securitylake get-data-lake-sources) even though I enabledSH_FINDINGS" and "do GuardDuty findings actually exist on the Security Hub side." Security Lake only pulls 'findings that are in Security Hub', so if the upstream (Security Hub's GuardDuty integration) is empty, the downstream is empty too. The sure way is to emit a sample with GuardDuty'screate-sample-findingsand verify end-to-end that it flows Security Hub → Security Lake.
2.3 How It Looks on OCSF (the class name depends on the normalization version — caution)
After normalization, which OCSF event class do Security Hub CSPM findings map to? Per the official "OCSF source identification" table, the class differs by normalization version.
| Source version | Security Hub CSPM's class_name (official) |
|---|---|
Version 1 (metadata.version 1.0.0-rc.2) | Security Finding |
Version 2 (metadata.version 1.1.0) | Vulnerability Finding, Compliance Finding, or Detection Finding (depending on the ASFF value) |
That is, which OCSF class a finding falls into changes by which source_version you ingest with. GuardDuty's threat-detection findings, by nature, hit ASFF's detection family, so under Version 2 it's natural for them to be organized into the Detection Finding class, but the actual class classification depends on the finding's contents (ASFF's ProductFields, etc.).
An honest note (confirm): because the OCSF class classification changes with the normalization version and the finding's contents, this article treats as confirmed information only up to the official table's description — "Version 1 =
Security Finding/ Version 2 =Vulnerability/Compliance/Detection Finding" — and does not assert "GuardDuty findings always become Detection Finding." In production, confirm themetadata.versionandclass_name(class_uid) of the records actually landed in your lake with Athena before writing queries (5.3 shows the query to actually confirm).
3. Native Sources and Identifiers: The "Exact Names" to Use in Terraform/CLI
When you add a source to Security Lake, the identifier to use in the API / CLI / Terraform is a fixed string. Write it vaguely and it's rejected, so let me fix it here.
3.1 The List of Source Identifiers (Verified with the Terraform AWS Provider)
The values you can pass to source_name of aws_securitylake_aws_log_source (and CreateAwsLogSource) are, per the Terraform AWS Provider docs, the following 8.
source_name | The corresponding AWS source | Notes |
|---|---|---|
SH_FINDINGS | AWS Security Hub CSPM findings | ← GuardDuty findings enter via here (Chapter 2) |
CLOUD_TRAIL_MGMT | CloudTrail management events | The only exception that "needs separate logging config" (a CloudTrail trail) |
LAMBDA_EXECUTION | CloudTrail's Lambda data events | — |
S3_DATA | CloudTrail's S3 data events | — |
VPC_FLOW | VPC Flow Logs | — |
ROUTE53 | Route 53 resolver query logs | — |
EKS_AUDIT | EKS audit logs | — |
WAF | WAFv2 logs | — |
If GuardDuty aggregation is the goal, first put in SH_FINDINGS. If you want cross-correlation to work, also put in CLOUD_TRAIL_MGMT (who hit which API) and VPC_FLOW (which IP they communicated with) — this gets the material to "start from a finding and match against API and network."
3.2 "No Separate Logging Config Needed" — An Independent Duplicated Stream
Here, the same thought as GuardDuty itself, an important fact that makes operations easier. The official words — "To add one or more of the preceding services as a log source in Security Lake, you don't need to separately configure logging in these services, except CloudTrail management events. ... Security Lake pulls data directly from these services through an independent and duplicated stream of events."
That is —
SH_FINDINGS,VPC_FLOW,ROUTE53,EKS_AUDIT,WAF, etc., don't need separate log-output config on each service's side. Security Lake pulls directly via an independent duplicated stream. Even if there's already a logging config, you don't need to change it.- The sole exception is
CLOUD_TRAIL_MGMT— only CloudTrail management events presuppose a separate CloudTrail trail config.
Thanks to this "duplicated stream" design, you can add sources without breaking your existing log operations. It's the same idea as GuardDuty itself "not needing you to enable, store, and pay for CloudTrail / VPC Flow / DNS yourself" (the main guide's mental model).
3.3 Add a Source via the CLI (the finding path in one command)
Before landing it in Terraform, grasp the behavior via the CLI. sourceName and regions are required; accounts and sourceVersion are optional.
# Security Hub findings(= GuardDuty finding の入口)をソース追加する。
# - sourceName=SH_FINDINGS が GuardDuty finding を OCSF 化して取り込む唯一の口
# - accounts 省略時は「組織の全アカウント」が対象になる(公式仕様)
# - regions は Security Lake を有効化済みのリージョンだけを指定する(未有効化はエラー)
aws securitylake create-aws-log-source \
--sources sourceName=SH_FINDINGS,accounts='["123456789012","111122223333"]',regions='["ap-northeast-1"]',sourceVersion="2.0"
# 横断相関の材料も併せて入れる(CloudTrail 管理イベント・VPC Flow)。
aws securitylake create-aws-log-source \
--sources sourceName=CLOUD_TRAIL_MGMT,regions='["ap-northeast-1"]'
aws securitylake create-aws-log-source \
--sources sourceName=VPC_FLOW,regions='["ap-northeast-1"]'
# 収集が始まったか確認(finding が本当に入る経路かを早期に検証)。
aws securitylake get-data-lake-sources --accounts "123456789012"
The trap of omitting
accounts: the official docs state plainly — "if you don't provide the accounts parameter, the command applies to the entire set of accounts in your organization." Unspecified = the whole org. To avoid unintentionally spreading to all accounts, specifyingaccountsis safer in a staged rollout.
4. Multi-Account, Multi-Region and Retention: Delegated Administrator, Rollup Region, Lifecycle
If you have multiple accounts, centrally manage Security Lake in the org too. Like GuardDuty and Detective, the standard is to make a dedicated security account the delegated administrator and aggregate there (if you made the security-tooling account the administrator in the GuardDuty delegated-administrator design, aligning Security Lake to the same account makes operations consistent).
4.1 The Rollup Region — Bring Data to One Place, and Protect Residency Too
The core of the multi-Region design is the rollup Region. The official words — "you can also designate rollup Regions to consolidate security log and event data from multiple Regions. This can help you comply with data residency compliance requirements."
- You can aggregate the data of multiple "contributing Regions" into a designated "rollup Region." Instead of querying lakes scattered in each Region separately, you can cross them in one Region.
- At the same time it works for data residency — location control like "bring EU data to a rollup Region within the EU" is possible.
- Subscribers, too, made in the rollup Region can access the data of multiple Regions at once (official: "To give a subscriber access to data from multiple Regions, you can specify the Region where you create the subscriber as a rollup Region and have other Regions contribute data to it.")
GuardDuty was recommended to be "enabled in all Regions to prevent missing global events." Bringing those all-Region findings (= Security Hub findings) to a rollup Region and doing SQL in one place is the straightforward form of multi-Region finding analysis.
4.2 Lifecycle: Run "Years of Retention" Cheaply with S3 Tiering + Expiration
Cost management of long-term retention is done with S3 storage-class transitions + expiration. Security Lake officially provides "customizable retention settings" and "automated storage tiering." Being normalized to Parquet (columnar, high compression efficiency), it's also more space-efficient than piling raw logs as-is.
The design crux: the recent stuff is queried frequently so use the standard class, transition to cheaper classes the older it gets, and auto-delete at the retention deadline. For example, declare a tiering like "STANDARD_IA at 31 days, ONEZONE_IA at 80 days, delete at 300 days" in the next chapter's Terraform lifecycle_configuration. If the compliance requirement is "N-year retention," match expiration to that period.
A cost note (reference, confirm): Parquet + S3 tiering is a design that's cost-efficient for years-span retention. However, Security Lake itself, Athena's scan volume, and Glue's crawler/catalog each have their own charges. Even if the "cost to store" can be kept cheap, the "cost to query" rises with the SQL's scan volume, so the iron rule is to narrow with partitions (Region, date) (Chapter 5). Since the specific unit prices vary by region and time, confirm production estimates with the official pricing. The FinOps-viewpoint thinking is the same as the GuardDuty cost-optimization article — not "everything at the longest retention" but "only as much as the requirement needs."
5. In Practice: Pull "High-Severity GuardDuty Findings in the Last 90 Days" with SQL in Athena
This is the climax on the analysis side. SQL the OCSF-normalized Security Hub findings table (= the landing point of GuardDuty findings) in Athena — this is the biggest reward for putting in Security Lake.
5.1 A Text Architecture Diagram (detect → aggregate → analysis/SIEM)
[ DETECT ] GuardDuty(全リージョン)
│ finding 生成
▼
Security Hub(CSPM・GuardDuty 統合 ON)
│ finding を ASFF として集約
▼
[ AGGREGATE ] Amazon Security Lake(自分の S3)
│ ソース: SH_FINDINGS(+ CLOUD_TRAIL_MGMT, VPC_FLOW …)
│ 変換: OCSF + Apache Parquet(ソースごとに Lake Formation テーブル)
│ 集約: ロールアップリージョン / ライフサイクル: S3 階層化+有効期限
▼
┌──────────────────────────────┬───────────────────────────────┐
│ ① クエリアクセス(query) │ ② データアクセス(data) │
│ Lake Formation テーブル │ 新規 S3 オブジェクトを │
│ → Athena / Redshift / │ SQS(or HTTPS)で通知 │
│ OpenSearch で SQL │ → サードパーティ SIEM が消費 │
│ (横断分析・監査レポート) │ (Splunk / QRadar / 既存基盤) │
└──────────────────────────────┴───────────────────────────────┘
5.2 First "Confirm the Actual Tables and Columns," Then Query
The OCSF table names / columns differ by normalization version and environment (database name), so don't write a production query out of the blue; first confirm the substance. This is the very principle of "build the verification path first."
-- ① Security Lake が作った Glue データベース/テーブルを確認する。
-- (Security Hub findings 用のテーブルが SH_FINDINGS に対応して作られている)
SHOW TABLES IN amazon_security_lake_glue_db_ap_northeast_1;
-- ② finding テーブルの実カラムと、落ちている OCSF クラス/バージョンを先に確かめる。
-- 2.3 で述べた「class_name はバージョン依存」を、ここで実データで固定する。
SELECT metadata.product.name AS product,
metadata.version AS ocsf_version,
class_name,
count(*) AS n
FROM amazon_security_lake_glue_db_ap_northeast_1.amazon_security_lake_table_ap_northeast_1_sh_findings_2_0
GROUP BY 1, 2, 3
ORDER BY n DESC;
Table names vary by environment (confirm): the above database name (
amazon_security_lake_glue_db_<region>) / table name (..._sh_findings_<version>, etc.) are templates based on a general naming pattern. Confirm the actual names by runningSHOW DATABASES/SHOW TABLESin your own lake (they differ by Region and source version). This article strongly recommends the order "first confirm the substance withSHOW/DESCRIBE→ peek at one row withSELECT *→ write the production query."
5.3 The Production Query: Last 90 Days, High Severity, by Account, by Finding Type
Once you've confirmed the actual columns, write the body of the cross-analysis. Always narrow with partitions (Region, date) is the cost iron rule.
-- 直近90日の「高重大度な GuardDuty 由来 finding」を、アカウント別・finding 型別に集計する。
-- - SH_FINDINGS テーブル = GuardDuty finding の着地点(2 章)
-- - パーティション列(region/eventday 等)で先に絞り、Athena のスキャン量=課金を抑える
-- - severity / finding の型名カラムは OCSF バージョンで位置が違うため 5.2 で確認した実カラムに合わせる
SELECT
cloud.account_uid AS account_id, -- どのアカウントの finding か
finding_info.title AS finding_title,-- finding 種別(GuardDuty 型名相当)
severity AS severity, -- OCSF の severity(文字列ラベル)
count(*) AS occurrences,
max(time) AS last_seen -- 直近の発生時刻
FROM amazon_security_lake_glue_db_ap_northeast_1.amazon_security_lake_table_ap_northeast_1_sh_findings_2_0
WHERE
-- ① パーティションで先に絞る(フルスキャン=高額を避ける)
region = 'ap-northeast-1'
AND eventday >= date_format(current_date - interval '90' day, '%Y%m%d')
-- ② GuardDuty 由来に限定(Security Hub の製品名で絞る/環境のカラムに合わせる)
AND metadata.product.name = 'GuardDuty'
-- ③ 高重大度のみ(Critical/High。ラベルは実データで確認した値に合わせる)
AND severity IN ('Critical', 'High')
GROUP BY 1, 2, 3
ORDER BY occurrences DESC, last_seen DESC
LIMIT 100;
Column names vary by the OCSF schema and environment (confirm): the above
cloud.account_uid/finding_info.title/severity/time/metadata.product.nameare templates based on OCSF's representative fields. In the actualSH_FINDINGStable, due to the conversion from ASFF, the columns' nesting position or names may differ. AlwaysDESCRIBE/SELECT *per 5.2 first, then rewrite to match the actual columns. "Whether it's GuardDuty-derived" too is, depending on the environment, sometimes judged not bymetadata.product.namebut underfinding_infoor bymetadata.product.feature.name.
5.4 The Power of Cross-Querying: JOIN finding × CloudTrail
Security Lake's forte shows in matching findings not alone but against other sources. If you've also put in CLOUD_TRAIL_MGMT, you can pull, in one query, "the APIs that the subject (IAM) of a Critical finding was hitting in the same time band."
-- Critical finding を起点に、その主体(account)が前後1時間に叩いた API を突き合わせる。
-- finding(SH_FINDINGS)と CloudTrail 管理イベント(CLOUD_TRAIL_MGMT)を OCSF の語彙で JOIN。
WITH critical_findings AS (
SELECT cloud.account_uid AS account_id, time AS finding_time
FROM amazon_security_lake_glue_db_ap_northeast_1.amazon_security_lake_table_ap_northeast_1_sh_findings_2_0
WHERE region = 'ap-northeast-1'
AND eventday >= date_format(current_date - interval '7' day, '%Y%m%d')
AND metadata.product.name = 'GuardDuty'
AND severity = 'Critical'
)
SELECT f.account_id,
c.api.operation AS api_called, -- 同時間帯に呼ばれた API
c.actor.user.uid AS actor,
c.time AS api_time
FROM critical_findings AS f
JOIN amazon_security_lake_glue_db_ap_northeast_1.amazon_security_lake_table_ap_northeast_1_cloud_trail_mgmt_2_0 AS c
ON c.cloud.account_uid = f.account_id
-- finding 発生の前後1時間に絞る(攻撃の文脈に焦点を当て、ノイズとスキャン量を抑える)
AND c.time BETWEEN f.finding_time - 3600000 AND f.finding_time + 3600000
WHERE c.region = 'ap-northeast-1'
ORDER BY f.account_id, api_time;
This is the analysis "absolutely impossible with the GuardDuty console alone" — not seeing a finding as a point, but placing it within the whole logs to pull the context. What Detective does with an interactive graph, Security Lake does with reproducible SQL (= a deliverable you can present at audit).
6. Subscribers: Query Access (SQL) or Data Access (SIEM)
The "consuming side" of Security Lake's data is the subscriber. The official docs define 2 access types — which you choose splits the design.
6.1 The 2 Access Types (the official definitions)
| Access type | Official behavior | What it's used for |
|---|---|---|
| Query access | "Subscribers with query access can query data that Security Lake collects. These subscribers directly query AWS Lake Formation tables in your S3 bucket with services like Amazon Athena." | In-house SQL analysis — query the lake directly with Athena / Redshift / OpenSearch. Audit reports, cross-correlation (Chapter 5) |
| Data access | "Subscribers with data access ... are notified of new objects ... as the data is written to the S3 bucket. By default, ... through an HTTPS endpoint ... Alternatively, ... by polling an Amazon SQS queue." | Supply to an external SIEM — notify new S3 objects via SQS/HTTPS, and a third party (Splunk/QRadar, etc.) consumes them |
The principle of use:
- "We want to analyze with SQL ourselves" → query access. Hit the Lake Formation tables directly with Athena. You query the lake's substance without making a new data copy.
- "We want to flow findings into an existing SIEM" → data access. The SIEM receives an SQS notification that "a new Parquet object was written" and fetches the OCSF records from S3. The straightforward move when you want to put findings on a SIEM's rule/correlation engine.
6.2 Enforce Least Privilege — Distribute Per Source, Per Region
The official docs state plainly — "To control costs and adhere to least privilege access best practices, you provide subscribers access to data on a per-source basis." and "Subscribers only have access to the source data in the AWS Region that you select."
That is — you can narrow access at the granularity of "give the SIEM team just findings (SH_FINDINGS), just ap-northeast-1." You can design a least privilege of just findings without passing VPC Flow or CloudTrail, just a specific Region. In a high-data-sensitivity environment like a payment foundation, this granularity works (pass the SIEM vendor only findings, keep the raw VPC Flow in in-house query access, etc.).
A constraint (official): "The maximum number of sources that Security Lake allows to add per subscriber is 10." The sources you can add to one subscriber are at most 10 (the total of AWS sources + custom sources). A design splitting subscribers by use (SIEM use, in-house analysis use) also pairs well with this constraint.
7. Terraform: Declare data lake → source → subscriber
Let me land the design so far in Terraform. The order is aws_securitylake_data_lake (the lake body) → aws_securitylake_aws_log_source (sources) → aws_securitylake_subscriber (subscribers). The argument names are verified against the Terraform AWS Provider docs, but since they may differ by provider version, confirm with terraform plan before applying.
7.1 The Data Lake Body (encryption, lifecycle, Region)
# ── 委任管理者(security-tooling)アカウントで実行 ──
# データレイク本体。configuration はリージョン単位で複数指定できる。
resource "aws_securitylake_data_lake" "this" {
# Glue テーブルの作成・更新に使うロール(Security Lake が引き受ける)。
meta_store_manager_role_arn = aws_iam_role.meta_store_manager.arn
configuration {
region = "ap-northeast-1" # 例: ここをロールアップリージョンにする
# 暗号化。"S3_MANAGED_KEY" で SSE-S3、または KMS キー ID を指定して SSE-KMS。
encryption_configuration {
kms_key_id = "S3_MANAGED_KEY"
}
# ライフサイクル:古いデータを安いクラスへ遷移し、保持期限で自動削除。
# 「直近は標準・古いほど安く・N日で削除」で長期保管コストを抑える(4.2)。
lifecycle_configuration {
transition {
days = 31
storage_class = "STANDARD_IA"
}
transition {
days = 80
storage_class = "ONEZONE_IA"
}
expiration {
days = 365 # コンプライアンス要件に合わせて保持期間を設定
}
}
}
}
7.2 Adding Sources: SH_FINDINGS (= the entrance for GuardDuty findings) + the material for correlation
# Security Hub findings = GuardDuty finding の唯一の入口(2 章)。
# 単一の aws_securitylake_aws_log_source で「全リージョン・全アカウント」を表現するのが公式の推奨。
resource "aws_securitylake_aws_log_source" "sh_findings" {
source {
source_name = "SH_FINDINGS" # ← GuardDuty 直接ソースは存在しない。ここ経由のみ
source_version = "2.0" # OCSF クラスはバージョン依存(2.3)。適用前に確認
regions = ["ap-northeast-1"]
accounts = ["123456789012"] # 省略すると組織全体(3.3 の罠)
}
# レイク本体が先に出来ていること。
depends_on = [aws_securitylake_data_lake.this]
}
# 横断相関の材料:CloudTrail 管理イベント(誰が何の API を叩いたか)。
resource "aws_securitylake_aws_log_source" "cloudtrail_mgmt" {
source {
source_name = "CLOUD_TRAIL_MGMT" # 唯一「別途ログ設定が要る」例外(3.2)
regions = ["ap-northeast-1"]
}
depends_on = [aws_securitylake_data_lake.this]
}
# 横断相関の材料:VPC Flow Logs(どの IP と通信したか)。
resource "aws_securitylake_aws_log_source" "vpc_flow" {
source {
source_name = "VPC_FLOW"
regions = ["ap-northeast-1"]
}
depends_on = [aws_securitylake_data_lake.this]
}
7.3 Subscriber: An Example of In-House SQL Analysis (query access)
# 社内のセキュリティ分析チームに「finding を SQL で読む」権限を与える(クエリアクセス)。
# access_type = "LAKEFORMATION" で Lake Formation テーブルを Athena から直接クエリできる。
resource "aws_securitylake_subscriber" "internal_analytics" {
subscriber_name = "internal-sec-analytics"
access_type = "LAKEFORMATION" # クエリアクセス。SIEM 供給なら "S3"(データアクセス)
# 渡すソースを finding に絞る=最小権限(6.2)。1 サブスクライバ最大10ソース。
source {
aws_log_source_resource {
source_name = "SH_FINDINGS"
source_version = "2.0"
}
}
# サブスクライバ側の AWS アイデンティティ(クロスアカウントの信頼確立)。
subscriber_identity {
external_id = "internal-analytics-ext-id" # 混乱した代理を防ぐ external_id
principal = "123456789012" # 分析チームのアカウント/ロール
}
depends_on = [aws_securitylake_data_lake.this]
}
A note on Terraform arguments (confirm):
access_typeper the docs represents "the Amazon S3 or Lake Formation access type" (= data access / query access), but the accepted string literals ("S3"/"LAKEFORMATION") may differ in notation by provider version. Including the internal fields ofreplication_configuration(regions/role_arn), always confirm withterraform planand that provider version's docs before applying. This article is a template based on the Terraform AWS Provider's description (as of June 2026).
8. Operations: Verification, Cost, and Not Over-Trusting
8.1 Verify End-to-End That "Findings Are Actually In"
The biggest accident is "the lake is built but findings are empty." Because the wiring (Chapter 2) is 3 stages, confirm each stage in order.
- GuardDuty: fire a sample finding with
aws guardduty create-sample-findings(the main guide's 5.3). - Security Hub: confirm that finding is ingested into Security Hub (is the GuardDuty integration ON?).
- Security Lake: confirm the collection state of
SH_FINDINGSwithaws securitylake get-data-lake-sources→SELECTthe actual records in Athena (5.2).
Only when these 3 points go green can you say "aggregation is working." The success of a type check or terraform apply is no proof that 'the wiring is correct' — being able to pull the actual data in Athena is the proof.
8.2 The Way to Think About Cost (reference, confirm)
- The cost to store: curb it with Parquet + S3 tiering (4.2). Set
expirationexactly to the requirement. - The cost to query: Athena is charged by scan volume. Always narrow with partitions (
region/eventday) (5.3). ASELECT *full-period scan is strictly forbidden. - Incidental cost: Security Lake itself, Glue crawler/catalog, and Lambda ETL each have charges. "All sources, all Regions, longest retention just in case" swells the bill — same as GuardDuty's FinOps, work backward from the requirement and only as much as needed. Since the specific unit prices vary by region and time, confirm with the official pricing.
8.3 Don't Over-Trust Security Lake — It's the "Aggregation & Analysis" Layer Within the 3 Layers
Finally, back to Chapter 0. Security Lake is neither detection, nor investigation, nor defense.
- Finding new threats is GuardDuty (the main guide). Security Lake doesn't produce findings.
- Chasing "why, how far" interactively is Detective (up to 1 year, graph). Security Lake is years, SQL/SIEM — the role differs.
- Aggregating, normalizing, and laying findings out on the desk is Security Hub. Security Lake stows those findings in the OCSF long-term warehouse and enables cross-querying and SIEM supply.
What Security Lake handles is the layer of "leaving the results of detection and investigation in a machine-readable, cross-able, long-term-retained form, and distributing them." GuardDuty, Detective, Security Hub, and Security Lake are not rivals but a complementary relationship dividing roles by time axis and use — this is the starting point of correct design.
Summary: A GuardDuty × Security Lake Long-Term-Aggregation Cheat Sheet
A quick reference for when you're lost.
- Think in 3 layers: GuardDuty=DETECT (now) → Detective=INVESTIGATE (up to 1 year, interactive graph) → Security Lake=AGGREGATE & ANALYZE (years, OCSF/Parquet, SQL/SIEM). Security Lake is neither detection, nor investigation, nor defense.
- [Most important] Wiring: there is no "direct GuardDuty source" in Security Lake. The path is GuardDuty → Security Hub (GuardDuty integration ON) → Security Lake's
SH_FINDINGSonly. If any is missing, not a single finding enters. - Native sources (6 systems): CloudTrail (management + S3/Lambda data), EKS audit, Route 53, Security Hub CSPM findings, VPC Flow, WAFv2. Identifiers are
CLOUD_TRAIL_MGMT/S3_DATA/LAMBDA_EXECUTION/EKS_AUDIT/ROUTE53/SH_FINDINGS/VPC_FLOW/WAF. Except CloudTrail management events, no separate logging config needed (an independent duplicated stream). - Normalization: convert everything to OCSF + Apache Parquet, a Lake Formation table per source. So you can JOIN finding × CloudTrail × VPC Flow in the same vocabulary. The OCSF class is version-dependent (V1=
Security Finding/ V2=Vulnerability/Compliance/Detection Finding) — confirm with actual data before writing queries. - Multi-account, Region: aggregate to a delegated administrator, bring multiple Regions to one place with a rollup Region (data-residency support).
- Retention: run years of retention cheaply with S3 tiering (transition) + expiration. Parquet is space-efficient too.
- Subscribers (2 types): query access (Lake Formation → SQL with Athena/Redshift/OpenSearch = in-house analysis) and data access (notify new S3 objects via SQS/HTTPS = SIEM supply). Least privilege per source, per Region, at most 10 sources per subscriber.
- Verification: confirm end-to-end that "findings are actually in" via
create-sample-findings→ Security Hub →get-data-lake-sources→ AthenaSELECT. A type check or apply success is no proof. - Cost (confirm): storing is cheap, but querying is charged by scan volume so narrow with partitions. Security Lake/Glue/Athena each charge. Work backward from the requirement.
GuardDuty findings are strong at "glowing red now." But to answer a year-later audit, cross-correlation, and SIEM integration, you need a foundation that normalizes them with OCSF, keeps them in your own S3 for years, and distributes them to SQL and SIEM. The biggest leverage lies not in detection itself but in the aggregation design that turns the results of detection from "a notification that flows by and vanishes" into "a reproducible data asset."
On a multi-account serverless payment platform, I cross-implemented the IAM, observability, and DR of a foundation handling actual money, carbon credits, and local currencies, and made "who did what, when" reproducible through the structure of code and the data pipeline. I won't claim to have operated Security Lake on a named project. But I can design and implement this end-to-end of detect → investigate → long-term aggregation & analysis for your AWS environment — grounded in the real experience of cross-implementing the observability, DR, and data pipelines of multi-account AWS.
"We put in GuardDuty, but have no foundation for cross-past analysis, compliance long-term retention, or SIEM integration" — that aggregation layer, I can accompany you on, fast and safe with one person × generative AI (Claude Code), from the exact wiring of GuardDuty→Security Hub→Security Lake, through OCSF cross-querying, the rollup Region, subscriber design, Terraform implementation, and cost optimization. Feel free to consult me even from the requirements-organizing stage.
References (Official Documentation)
- What is Amazon Security Lake? — a fully managed security data lake, your own S3, retaining ownership, OCSF + Parquet normalization, the rollup Region, the internal dependency services (EventBridge/Glue/Lake Formation/Lambda/S3/SQS)
- Collecting data from AWS services in Security Lake — the 6 native sources, "except CloudTrail management events, no separate logging config needed / an independent duplicated stream,"
CreateAwsLogSource(sourceName/regionsrequired,accounts/sourceVersionoptional) - Subscriber management in Security Lake — the 2 access types (data access = SQS/HTTPS notification / query access = query Lake Formation tables with Athena etc.), per-source least privilege, at most 10 sources per subscriber
- OCSF in Security Lake — OCSF conversion, the per-source class_name (Security Hub CSPM: V1=Security Finding / V2=Vulnerability・Compliance・Detection Finding), the metadata values of source identification
- aws_securitylake_data_lake (Terraform AWS Provider) —
meta_store_manager_role_arn/configuration {region, encryption_configuration, lifecycle_configuration {transition, expiration}, replication_configuration} - aws_securitylake_aws_log_source (Terraform AWS Provider) —
source {source_name, source_version, accounts, regions}, the validsource_name(ROUTE53/VPC_FLOW/SH_FINDINGS/CLOUD_TRAIL_MGMT/LAMBDA_EXECUTION/S3_DATA/EKS_AUDIT/WAF) - aws_securitylake_subscriber (Terraform AWS Provider) —
subscriber_name/access_type(S3 or Lake Formation) /source {aws_log_source_resource {source_name, source_version}}/subscriber_identity {external_id, principal} - Amazon Security Lake pricing — ingestion volume, retention, incidental-service charges (the specific unit prices vary by region and time — confirm)