# Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

> Taking UVR5/MDX-Net and Demucs source separation from one-file-manual to production scale. It designs an idempotent queue-driven foundation of S3 event → SQS → GPU worker (AWS Batch / ECS) → S3, with concrete boto3 and Terraform code covering a visibility-timeout heartbeat, graceful termination on Spot interruption, DLQ, structured logs, and S3-key idempotency.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音源分離, AWS, GPU, MLOps, SQS, バッチ処理, Python, Terraform
- URL: https://tomodahinata.com/en/blog/audio-source-separation-aws-gpu-batch-pipeline
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- The crux of production scale is the queue-driven 'S3 event → SQS → GPU worker → S3.' Since SQS is at-least-once delivery, making the worker idempotent with an S3-key existence check is mandatory.
- Since separation takes minutes, periodically extend the SQS visibility timeout (default 30 seconds, max 12 hours) with ChangeMessageVisibility (heartbeat). Split ultra-long ones with Step Functions.
- For GPU, g4dn (NVIDIA T4 16GB, cheapest for ML inference) is the basic. For queue-driven one-shot batches, AWS Batch (GPU is resourceRequirements type=GPU, with built-in Spot integration and retry) is purpose-built.
- Cost optimization is Spot instances + queue-depth auto-scaling + an idempotent cache. Receive the 2-minute notice of Spot interruption with SIGTERM and safely let go of the in-progress job.
- Isolate poison messages to a DLQ (maxReceiveCount). The model is resident, ffmpeg is bundled, and the model is baked into the image to eliminate cold-start re-download.

---

## The goal of this article

Separating "one file on your machine" with [UVR5/MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) or [Demucs](/blog/demucs-v4-music-source-separation-production-guide) is easy. The problem is the stage of **processing thousands of files a day, without crashing, cheaply, and idempotently.** Here many PoCs hit "the limit of the manual script."

This article shows a **queue-driven batch foundation that scales source separation in production on AWS,** with design philosophy and working code. When you finish reading, you can assemble the following.

1. You can design a **loosely-coupled pipeline of S3 event → SQS → GPU worker → S3.**
2. You can safely process a minutes-long long job with a **visibility-timeout heartbeat.**
3. You can implement an **idempotent, resilient GPU worker** that withstands **Spot interruption, poison messages, and double delivery.**

> 📐 **This article's positioning**: the **general pattern** of turning source separation into a production service (a type-safe API ingress, a job queue, OOM graceful degradation, testability, and other platform-independent design) is handled in the [GPU-worker-foundation article](/blog/music-source-separation-production-api-gpu-worker-queue). This article is the **"platform implementation edition" of concretely implementing it on AWS** — narrowed to **AWS-specific points** like SQS's visibility timeout and message limit, GPU requests in AWS Batch/ECS, S3 event-driven, Spot, Terraform, and g4dn cost. See there for the general design, here for the AWS implementation.

> **About the author (disclosure of credibility)**: I **single-handedly designed, implemented, and run in production an AI-video-localization platform** that fully automates "source separation → transcription → translation → multilingual dubbing → mouth synchronization." I run its first stage, source separation, with exactly this article's **queue-driven GPU batch.** This article's "heartbeat," "Spot graceful termination," and "idempotent cache" aren't demo knowledge but the very design implemented **to not crash jobs in production.** The project overview is in the [track record](/case-studies/ai-video-localization-lipsync), and the idempotency principles cultivated in a payment platform are summarized in the [reliability-cluster article](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide).

---

## 30-second summary (conclusion first)

| Point | Conclusion |
| --- | --- |
| **Overall composition** | **S3 (input) → S3 event → SQS → GPU worker → S3 (output).** Loosely coupled, retryable, idempotent |
| **Why queue-driven** | absorb spikes, auto-scale workers by queue depth. Failures retry, poison goes to the DLQ |
| **Idempotency** | SQS is **at-least-once delivery.** The worker prevents double processing with **an S3-output-key existence check** |
| **Long jobs** | **periodically extend (heartbeat)** the visibility timeout (default 30 seconds, **max 12 hours**) with `ChangeMessageVisibility` |
| **GPU foundation** | for queue-driven batches, **AWS Batch** (GPU = `resourceRequirements type=GPU`, with built-in Spot and retry) is the prime choice. Resident is ECS |
| **GPU type** | **g4dn (NVIDIA T4 16GB) = cheapest for ML inference.** If VRAM is needed for quality/length, g5(A10G 24GB)/g6(L4 24GB) |
| **Cost** | **Spot + auto-scale + idempotent cache.** Receive Spot interruption with SIGTERM and let go safely |
| **Large input** | the SQS message limit is **1 MiB** (was 256 KB). **Always put the audio body in S3 and pass a pointer** |
| **Poison messages** | isolate to the **DLQ** when over `maxReceiveCount`, and don't clog the main flow |

---

## Architecture: a loosely-coupled queue-driven pipeline

```text
            ┌──────────────┐  ObjectCreated   ┌─────────────┐
  upload →  │ S3 input bkt  │ ──event notify─→ │  SQS queue  │ ──┐
            └──────────────┘                  └─────────────┘   │ long poll
                                                    │ fail N×    │
                                                    ▼            ▼
                                             ┌──────────┐  ┌───────────────────┐
                                             │   DLQ    │  │  GPU worker pool   │
                                             └──────────┘  │ (AWS Batch / ECS)  │
                                                           │ separate + idemp.  │
                                                           └───────────────────┘
                                                                    │ stems
                                                                    ▼
                                                           ┌─────────────────┐
                                                           │  S3 output bkt   │
                                                           └─────────────────┘
```

The core of the design is **"each component doesn't know the others."** The upload side doesn't know the worker exists, and the worker, even if it scales, doesn't break consistency. S3 is the source of truth, SQS is the buffer, the worker is an idempotent transformer — this division of roles (SRP) makes a foundation that withstands both spikes and failures.

> ⚠️ **Beware of an infinite loop**: if a worker started by an S3 event writes back to the **same bucket,** the notification recurses and runs away. **Always separate the bucket (or prefix) for input and output.** An S3 notification can go to **SNS / SQS / Lambda / EventBridge,** but note that **SQS FIFO can't be a direct notification destination** (via EventBridge if needed).

---

## Why design on the premise of "at least once"

SQS is **at-least-once delivery.** AWS official clearly states it too.

> *due to the at-least-once delivery model of Amazon SQS, there's no absolute guarantee that a message won't be delivered more than once during the visibility timeout period.*

In other words, **the same audio being processed twice can happen as the "normal path."** Rather than swallowing this, **absorb it with idempotency.** The most robust is **"if the output S3 key already exists, treat it as success without processing"** — since separation is heavy GPU processing, avoiding double execution directly becomes cost reduction. The general principles of idempotency are detailed in [idempotent async-processing design](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide).

```python
# 冪等キー：音声の内容ではなく"入力S3キー + モデル"で出力先を決定論的に決める
def output_prefix(input_key: str, model: str) -> str:
    """同じ入力×同じモデルなら必ず同じ出力prefix。存在すれば再処理しない。"""
    safe_model = model.replace("/", "_")
    return f"stems/{safe_model}/{input_key}"
```

---

## The crux of long jobs: the visibility-timeout heartbeat

Source separation takes **a few dozen seconds to a few minutes per song.** Processing beyond SQS's **visibility timeout (default 30 seconds)** **re-delivers** the message **to another worker** and causes **double processing.**

The countermeasure is a **heartbeat** — during processing, keep **periodically extending** the visibility with `ChangeMessageVisibility`. It's also AWS official's recommendation.

> *Implement a heartbeat mechanism to periodically extend the visibility timeout, ensuring the message remains invisible until processing is complete.*

But **the visibility timeout has an upper limit of "max 12 hours from the first receipt"** (extending doesn't reset this limit). For ultra-long ones over this, the official guidance is to **split with Step Functions.**

```python
# heartbeat.py — 処理中だけ可視性を延長し続けるコンテキストマネージャ
import threading
from contextlib import contextmanager

@contextmanager
def visibility_heartbeat(sqs, queue_url: str, receipt_handle: str,
                         *, extend_to: int = 300, interval: int = 120):
    """interval秒ごとに可視性をextend_to秒へ延長。長時間ジョブの二重配信を防ぐ。"""
    stop = threading.Event()

    def beat() -> None:
        while not stop.wait(interval):           # interval待つ。stopが立てば即終了
            try:
                sqs.change_message_visibility(
                    QueueUrl=queue_url,
                    ReceiptHandle=receipt_handle,
                    VisibilityTimeout=extend_to,  # 最初の受信から最大12hの制約に注意
                )
            except Exception:                     # 延長失敗は致命的でない。次の周期で再試行
                pass

    t = threading.Thread(target=beat, daemon=True)
    t.start()
    try:
        yield
    finally:
        stop.set()                                # 処理完了で確実に停止（リーク防止）
        t.join(timeout=1)
```

---

## Choosing the GPU foundation: AWS Batch or ECS

There are mainly 2 places to put a GPU worker. **For queue-driven one-shot batch inference, AWS Batch is purpose-built.**

| Viewpoint | AWS Batch | ECS (resident service) |
| --- | --- | --- |
| Positioning | specialized for batch compute (auto-scales queue + compute environment) | resident operation of containers |
| GPU specification | `resourceRequirements`'s `type=GPU` | task definition's `resourceRequirements` |
| Spot integration | **built-in** (can mix Spot/On-Demand) | configure capacity providers yourself |
| Retry | the job definition's `attempts` (1–10) built-in | implement on the app/service side |
| Suited use | **mass processing of queue-driven one-shot jobs** | low-latency resident, fine-grained control |

AWS Batch's GPU request is written in the job definition like this (the official form).

```json
{
  "resourceRequirements": [
    { "type": "GPU",    "value": "1" },
    { "type": "VCPU",   "value": "4" },
    { "type": "MEMORY", "value": "16384" }
  ]
}
```

> 💡 **How to choose the GPU type**: in AWS official words, **g4dn (NVIDIA T4, 16GB) is "the lowest-cost GPU instance for ML inference."** It's enough for a mass batch of MDX-Net. For VRAM-hungry models like RoFormer or long files, raise to **g5 (A10G, 24GB) / g6 (L4, 24GB) / g6e (L40S, 48GB).** When using a GPU on ECS, the **GPU-optimized AMI** and `ECS_ENABLE_GPU_SUPPORT=true` are needed.

---

## The world's best worker implementation (boto3)

Consolidate the design requirements into one worker — **① model resident, ② S3-key idempotency, ③ visibility heartbeat, ④ graceful termination on Spot interruption, ⑤ leave poison messages to the DLQ, ⑥ structured logs.**

```python
# worker.py — 本番GPU音源分離ワーカー（SQS駆動・冪等・回復性）
from __future__ import annotations

import json
import logging
import os
import signal
import sys
import tempfile
from pathlib import Path
from urllib.parse import unquote_plus

import boto3
from audio_separator.separator import Separator

from heartbeat import visibility_heartbeat   # 前掲

# --- 構造化ログ（相関IDを必ず通す。音声内容＝PIIは出さない）---
logging.basicConfig(level=logging.INFO, format='%(message)s')
log = logging.getLogger("worker")

def emit(event: str, **fields) -> None:
    log.info(json.dumps({"event": event, **fields}, ensure_ascii=False))

QUEUE_URL = os.environ["QUEUE_URL"]
OUT_BUCKET = os.environ["OUTPUT_BUCKET"]
MODEL = os.environ.get("SEPARATION_MODEL", "UVR-MDX-NET-Inst_HQ_3.onnx")
MODEL_CACHE = os.environ.get("MODEL_CACHE_DIR", "/models")   # イメージに焼き込み済み推奨

sqs = boto3.client("sqs")
s3 = boto3.client("s3")

_shutdown = False
def _on_sigterm(*_):                 # Spot中断(2分前通知)/ECS停止で届く
    global _shutdown
    _shutdown = True
    emit("shutdown_requested")       # 新規受信を止め、処理中ジョブを安全に終える
signal.signal(signal.SIGTERM, _on_sigterm)

# モデルは一度だけVRAMへ。リクエスト間で再利用（ロードは高コスト）
_separator: Separator | None = None
def get_separator() -> Separator:
    global _separator
    if _separator is None:
        _separator = Separator(output_format="flac", model_file_dir=MODEL_CACHE)
        _separator.load_model(model_filename=MODEL)
        emit("model_loaded", model=MODEL)
    return _separator


def _already_done(prefix: str) -> bool:
    """出力S3キーが存在するなら処理済み＝冪等にスキップ（二重GPU実行を防ぐ）。"""
    resp = s3.list_objects_v2(Bucket=OUT_BUCKET, Prefix=prefix, MaxKeys=1)
    return resp.get("KeyCount", 0) > 0


def process_message(msg: dict) -> None:
    body = json.loads(msg["Body"])
    rec = body["Records"][0]["s3"]                 # S3イベント通知の形
    in_bucket = rec["bucket"]["name"]
    in_key = unquote_plus(rec["object"]["key"])    # S3通知のkeyはURLエンコード済（空白→'+'・特殊文字→%XX）
    prefix = f"stems/{MODEL.replace('/', '_')}/{in_key}"   # 決定論的な出力先
    cid = msg["MessageId"]

    if _already_done(prefix):
        emit("skip_idempotent", cid=cid, key=in_key)
        return                                     # 既に成果物あり＝成功扱い

    with visibility_heartbeat(sqs, QUEUE_URL, msg["ReceiptHandle"]):
        with tempfile.TemporaryDirectory() as tmp:
            src = Path(tmp) / Path(in_key).name
            s3.download_file(in_bucket, in_key, str(src))
            sep = get_separator()
            sep.output_dir = tmp
            stems = sep.separate(str(src))         # GPU処理（ハートビートが守る）
            for stem in stems:
                key = f"{prefix}/{Path(stem).name}"
                s3.upload_file(stem, OUT_BUCKET, key)
            emit("separated", cid=cid, key=in_key, stems=len(stems))


def main() -> None:
    get_separator()                                # 起動時ウォームアップ
    while not _shutdown:
        resp = sqs.receive_message(
            QueueUrl=QUEUE_URL, MaxNumberOfMessages=1,
            WaitTimeSeconds=20,                    # ロングポーリング（空ポーリング課金を削減）
        )
        for msg in resp.get("Messages", []):
            try:
                process_message(msg)
                sqs.delete_message(QueueUrl=QUEUE_URL,
                                   ReceiptHandle=msg["ReceiptHandle"])  # 成功時のみ削除
            except Exception:
                # 失敗は削除しない → 可視性切れで再配信 → maxReceiveCount超でDLQへ
                emit("process_failed", cid=msg.get("MessageId"))
                log.exception("processing error")
    emit("drained_and_exit")
    sys.exit(0)


if __name__ == "__main__":
    main()
```

The reason this worker is production-quality lies in **the handling of failure.**

- **`delete_message` only on success.** A failed message isn't deleted and is **auto-re-delivered** on visibility expiry.
- **When re-delivery exceeds `maxReceiveCount`, it's isolated to the DLQ** and doesn't clog the main flow (config in the next section).
- **On SIGTERM (Spot interruption / ECS stop), stop new receipts and complete only the in-progress.** Since an interrupted job isn't `delete`d, **another worker picks it up again** — zero misses.

---

## Infrastructure with Terraform (queue + DLQ + S3 notification)

Declaratively define the queue, the **dead-letter queue (DLQ),** the redrive policy, and the S3 → SQS notification.

```hcl
# main.tf — SQS + DLQ + S3イベント通知
resource "aws_sqs_queue" "dlq" {
  name                      = "separation-dlq"
  message_retention_seconds = 1209600          # 14日。原因調査の猶予を確保
}

resource "aws_sqs_queue" "jobs" {
  name                       = "separation-jobs"
  visibility_timeout_seconds = 300             # ハートビート前提の初期値
  receive_wait_time_seconds  = 20              # ロングポーリング既定

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 5                     # 5回失敗で毒メッセージをDLQへ
  })
}

# S3 → SQS（ObjectCreatedのみ。入力prefixに限定して再帰を防ぐ）
resource "aws_s3_bucket_notification" "ingest" {
  bucket = aws_s3_bucket.input.id
  queue {
    queue_arn     = aws_sqs_queue.jobs.arn
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = "uploads/"
  }
}
```

> 📦 **Handling large audio**: the SQS message limit is **1 MiB** (expanded from 256 KB in 2025). Since an audio file isn't small, **always put the body in S3 and put only the S3 key (a pointer) in the message.** With the S3 notification method above, since the message is only event metadata, it naturally takes this form.

---

## Cost optimization: Spot × auto-scale × idempotent cache

GPU is expensive. Carve the production unit price in three stages.

1. **Spot instances**: a GPU batch, if you build in interruption resistance, Spot pays off. AWS Batch can **mix Spot/On-Demand,** and re-submits on interruption. The worker above **safely lets go on SIGTERM (the 2-minute-before notice of interruption),** so a job lost to Spot is **picked up again by re-delivery,** not the DLQ.
2. **Queue-depth auto-scale**: with `ApproximateNumberOfMessagesVisible` as the metric, **increase workers when the queue piles up, and shrink to 0 when empty.** Don't hold an idle GPU.
3. **Idempotent cache**: **zero reprocessing** with the S3-key existence check. Don't pass the same song through the GPU twice on re-upload, re-send, or Spot re-submission.

> 💴 **The way of thinking about the cost model**: unit price ≈ **(processing seconds per song ÷ 3600) × instance hourly rate ÷ Spot discount rate.** g4dn (T4), as AWS official says "**the cheapest GPU for ML inference,**" is the first candidate for a mass batch of MDX-Net. Confirm the latest exact pricing on the [pricing page](https://aws.amazon.com/ec2/pricing/) (this article doesn't assert the absolute value of the unit price).

---

## Observability: a state where you can trace a stalled process at a glance

The scariest thing in a mass batch is "**you can't tell which job got stuck where.**" In addition to the worker's structured logs (`emit`), prepare at least this.

- Put **queue depth and DLQ count** into a CloudWatch alarm (DLQ > 0 is immediate notification = poison-message detection).
- Metric-ize **processing time, success/failure, and idempotent-skip count** to visualize throughput and cost.
- Always pass a **correlation ID (MessageId / S3 key)** through the logs. **Don't emit the audio content (which can be PII).**

The systematic build-out of observability is summarized in the [OpenTelemetry article](/blog/opentelemetry-observability-production-tracing-metrics-logs), and retry/backoff/circuit-breaker in the [resilience-patterns article](/blog/retry-backoff-circuit-breaker-resilience-patterns-guide).

---

## Testability: make a boundary that runs without a GPU

GPU-dependent code is hard to run in CI — that's exactly why **carving the boundary** is the world's best design.

- **Pure logic** like `output_prefix` / `_already_done` **can be unit-tested without I/O.**
- **Mock S3 / SQS with [moto](https://github.com/getmoto/moto)** and **verify the worker's delivery, idempotency, and DLQ route without a GPU.**
- **Swap the separation body (`Separator.separate`) via an interface** and pass the pipeline with dummy output.

```python
# test_idempotency.py — GPU不要。冪等ロジックだけを検証
from worker import output_prefix  # 純関数として切り出しておく

def test_output_prefix_is_deterministic():
    a = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
    b = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
    assert a == b                       # 同じ入力×モデルなら必ず同じ出力先＝冪等の土台

def test_model_namespaced():
    p1 = output_prefix("uploads/x.wav", "model_a")
    p2 = output_prefix("uploads/x.wav", "model_b")
    assert p1 != p2                     # モデルが違えば出力先も分かれる
```

---

## Frequently asked questions (FAQ)

**Q. Why a container (Batch/ECS), not Lambda?**
A. Source separation is heavy processing that needs a **GPU and minutes-long execution time.** Since Lambda doesn't support GPU and has constraints on execution time, **a GPU container (AWS Batch / ECS)** suits. Carving out only light preprocessing (normalization, queue submission) to Lambda is effective.

**Q. Should the visibility timeout just be long from the start?**
A. Too long **delays re-delivery on failure,** too short causes **double processing.** So "**short + extend with a heartbeat**" is the standard. Note that even with extension, **12 hours from the first receipt is the upper limit,** so split ultra-long ones with Step Functions.

**Q. Won't a job disappear on Spot interruption?**
A. It won't. Since the worker **`delete_message`s only on success,** an interrupted job is **re-delivered on visibility expiry** and picked up again by another worker. If you complete only the in-progress on SIGTERM, no misses occur.

**Q. What's the guarantee that the same file isn't processed twice?**
A. Since SQS is **at-least-once delivery,** "absolutely once" isn't guaranteed. So **make it idempotent with an output-S3-key existence check** and absorb double execution as success. This also prevents wasted GPU cost.

**Q. The cold start is slow with the model's first download.**
A. **Bake the model into the container image** (download once at build time), or point `model_file_dir` to a **persistent volume.** Since `/tmp` is volatile, avoid re-downloading per container (details in the [troubleshooting article](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom)).

---

## Summary: from a manual script to a "non-crashing foundation"

The essence of scaling source separation in production lies not in the model but in **foundation design premised on "at least once, fails, gets interrupted."**

1. **Loosely coupled**: S3 → SQS → GPU worker → S3. Each doesn't know the others.
2. **Idempotent**: absorb double processing with an output-S3-key existence check (= cost reduction).
3. **Resilient**: visibility heartbeat, delete only on success, DLQ, Spot graceful termination.
4. **Cost**: Spot × queue-depth auto-scale × idempotent cache.
5. **Observability / testability**: correlation ID, DLQ alarm, a boundary that runs without a GPU.

> I actually run this foundation in production on an **AI-video-localization platform that has audio separation as its first stage.** Please consult me, along with the [track record](/case-studies/ai-video-localization-lipsync), on the design that raises GPU-using audio/video AI from "works in a PoC" to "**doesn't crash while holding down cost in production.**" With **one person × generative AI,** I support end-to-end from design to production operation.

---

## Sources / official resources

- **SQS visibility timeout**: [Amazon SQS visibility timeout](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html) (default 30 seconds, max 12 hours, heartbeat, at-least-once)
- **SQS message limit**: [SQS quotas](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html) / [expanded to 1 MiB (2025)](https://aws.amazon.com/about-aws/whats-new/2025/08/amazon-sqs-max-payload-size-1mib/)
- **AWS Batch (GPU)**: [Job definition parameters](https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html) / [What is AWS Batch](https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html)
- **ECS GPU**: [Working with GPUs on Amazon ECS](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html)
- **GPU instances**: [Amazon EC2 G4 (T4)](https://aws.amazon.com/ec2/instance-types/g4/)
- **S3 event notifications**: [Amazon S3 Event Notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html) (destinations, at-least-once, FIFO unsupported)

※ AWS's specs, limits, and pricing get updated. **Always confirm the primary sources before implementing.** This article doesn't assert the Spot-interruption behavior or the absolute value of pricing, and presupposes confirmation of the latest official documentation / pricing page.
