Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

The goal of this article

Separating "one file on your machine" with UVR5/MDX-Net or Demucs is easy. The problem is the stage of processing thousands of files a day, without crashing, cheaply, and idempotently. Here many PoCs hit "the limit of the manual script."

This article shows a queue-driven batch foundation that scales source separation in production on AWS, with design philosophy and working code. When you finish reading, you can assemble the following.

You can design a loosely-coupled pipeline of S3 event → SQS → GPU worker → S3.
You can safely process a minutes-long long job with a visibility-timeout heartbeat.
You can implement an idempotent, resilient GPU worker that withstands Spot interruption, poison messages, and double delivery.

📐 This article's positioning: the general pattern of turning source separation into a production service (a type-safe API ingress, a job queue, OOM graceful degradation, testability, and other platform-independent design) is handled in the GPU-worker-foundation article. This article is the "platform implementation edition" of concretely implementing it on AWS — narrowed to AWS-specific points like SQS's visibility timeout and message limit, GPU requests in AWS Batch/ECS, S3 event-driven, Spot, Terraform, and g4dn cost. See there for the general design, here for the AWS implementation.

About the author (disclosure of credibility): I single-handedly designed, implemented, and run in production an AI-video-localization platform that fully automates "source separation → transcription → translation → multilingual dubbing → mouth synchronization." I run its first stage, source separation, with exactly this article's queue-driven GPU batch. This article's "heartbeat," "Spot graceful termination," and "idempotent cache" aren't demo knowledge but the very design implemented to not crash jobs in production. The project overview is in the track record, and the idempotency principles cultivated in a payment platform are summarized in the reliability-cluster article.

30-second summary (conclusion first)

Point	Conclusion
Overall composition	S3 (input) → S3 event → SQS → GPU worker → S3 (output). Loosely coupled, retryable, idempotent
Why queue-driven	absorb spikes, auto-scale workers by queue depth. Failures retry, poison goes to the DLQ
Idempotency	SQS is at-least-once delivery. The worker prevents double processing with an S3-output-key existence check
Long jobs	periodically extend (heartbeat) the visibility timeout (default 30 seconds, max 12 hours) with `ChangeMessageVisibility`
GPU foundation	for queue-driven batches, AWS Batch (GPU = `resourceRequirements type=GPU`, with built-in Spot and retry) is the prime choice. Resident is ECS
GPU type	g4dn (NVIDIA T4 16GB) = cheapest for ML inference. If VRAM is needed for quality/length, g5(A10G 24GB)/g6(L4 24GB)
Cost	Spot + auto-scale + idempotent cache. Receive Spot interruption with SIGTERM and let go safely
Large input	the SQS message limit is 1 MiB (was 256 KB). Always put the audio body in S3 and pass a pointer
Poison messages	isolate to the DLQ when over `maxReceiveCount`, and don't clog the main flow

Architecture: a loosely-coupled queue-driven pipeline

            ┌──────────────┐  ObjectCreated   ┌─────────────┐
  upload →  │ S3 input bkt  │ ──event notify─→ │  SQS queue  │ ──┐
            └──────────────┘                  └─────────────┘   │ long poll
                                                    │ fail N×    │
                                                    ▼            ▼
                                             ┌──────────┐  ┌───────────────────┐
                                             │   DLQ    │  │  GPU worker pool   │
                                             └──────────┘  │ (AWS Batch / ECS)  │
                                                           │ separate + idemp.  │
                                                           └───────────────────┘
                                                                    │ stems
                                                                    ▼
                                                           ┌─────────────────┐
                                                           │  S3 output bkt   │
                                                           └─────────────────┘

The core of the design is "each component doesn't know the others." The upload side doesn't know the worker exists, and the worker, even if it scales, doesn't break consistency. S3 is the source of truth, SQS is the buffer, the worker is an idempotent transformer — this division of roles (SRP) makes a foundation that withstands both spikes and failures.

⚠️ Beware of an infinite loop: if a worker started by an S3 event writes back to the same bucket, the notification recurses and runs away. Always separate the bucket (or prefix) for input and output. An S3 notification can go to SNS / SQS / Lambda / EventBridge, but note that SQS FIFO can't be a direct notification destination (via EventBridge if needed).

Why design on the premise of "at least once"

SQS is at-least-once delivery. AWS official clearly states it too.

due to the at-least-once delivery model of Amazon SQS, there's no absolute guarantee that a message won't be delivered more than once during the visibility timeout period.

In other words, the same audio being processed twice can happen as the "normal path." Rather than swallowing this, absorb it with idempotency. The most robust is "if the output S3 key already exists, treat it as success without processing" — since separation is heavy GPU processing, avoiding double execution directly becomes cost reduction. The general principles of idempotency are detailed in idempotent async-processing design.

# 冪等キー：音声の内容ではなく"入力S3キー + モデル"で出力先を決定論的に決める
def output_prefix(input_key: str, model: str) -> str:
    """同じ入力×同じモデルなら必ず同じ出力prefix。存在すれば再処理しない。"""
    safe_model = model.replace("/", "_")
    return f"stems/{safe_model}/{input_key}"

The crux of long jobs: the visibility-timeout heartbeat

Source separation takes a few dozen seconds to a few minutes per song. Processing beyond SQS's visibility timeout (default 30 seconds) re-delivers the message to another worker and causes double processing.

The countermeasure is a heartbeat — during processing, keep periodically extending the visibility with ChangeMessageVisibility. It's also AWS official's recommendation.

Implement a heartbeat mechanism to periodically extend the visibility timeout, ensuring the message remains invisible until processing is complete.

But the visibility timeout has an upper limit of "max 12 hours from the first receipt" (extending doesn't reset this limit). For ultra-long ones over this, the official guidance is to split with Step Functions.

# heartbeat.py — 処理中だけ可視性を延長し続けるコンテキストマネージャ
import threading
from contextlib import contextmanager

@contextmanager
def visibility_heartbeat(sqs, queue_url: str, receipt_handle: str,
                         *, extend_to: int = 300, interval: int = 120):
    """interval秒ごとに可視性をextend_to秒へ延長。長時間ジョブの二重配信を防ぐ。"""
    stop = threading.Event()

    def beat() -> None:
        while not stop.wait(interval):           # interval待つ。stopが立てば即終了
            try:
                sqs.change_message_visibility(
                    QueueUrl=queue_url,
                    ReceiptHandle=receipt_handle,
                    VisibilityTimeout=extend_to,  # 最初の受信から最大12hの制約に注意
                )
            except Exception:                     # 延長失敗は致命的でない。次の周期で再試行
                pass

    t = threading.Thread(target=beat, daemon=True)
    t.start()
    try:
        yield
    finally:
        stop.set()                                # 処理完了で確実に停止（リーク防止）
        t.join(timeout=1)

Choosing the GPU foundation: AWS Batch or ECS

There are mainly 2 places to put a GPU worker. For queue-driven one-shot batch inference, AWS Batch is purpose-built.

Viewpoint	AWS Batch	ECS (resident service)
Positioning	specialized for batch compute (auto-scales queue + compute environment)	resident operation of containers
GPU specification	`resourceRequirements`'s `type=GPU`	task definition's `resourceRequirements`
Spot integration	built-in (can mix Spot/On-Demand)	configure capacity providers yourself
Retry	the job definition's `attempts` (1–10) built-in	implement on the app/service side
Suited use	mass processing of queue-driven one-shot jobs	low-latency resident, fine-grained control

AWS Batch's GPU request is written in the job definition like this (the official form).

{
  "resourceRequirements": [
    { "type": "GPU",    "value": "1" },
    { "type": "VCPU",   "value": "4" },
    { "type": "MEMORY", "value": "16384" }
  ]
}

💡 How to choose the GPU type: in AWS official words, g4dn (NVIDIA T4, 16GB) is "the lowest-cost GPU instance for ML inference." It's enough for a mass batch of MDX-Net. For VRAM-hungry models like RoFormer or long files, raise to g5 (A10G, 24GB) / g6 (L4, 24GB) / g6e (L40S, 48GB). When using a GPU on ECS, the GPU-optimized AMI and ECS_ENABLE_GPU_SUPPORT=true are needed.

The world's best worker implementation (boto3)

Consolidate the design requirements into one worker — ① model resident, ② S3-key idempotency, ③ visibility heartbeat, ④ graceful termination on Spot interruption, ⑤ leave poison messages to the DLQ, ⑥ structured logs.

# worker.py — 本番GPU音源分離ワーカー（SQS駆動・冪等・回復性）
from __future__ import annotations

import json
import logging
import os
import signal
import sys
import tempfile
from pathlib import Path
from urllib.parse import unquote_plus

import boto3
from audio_separator.separator import Separator

from heartbeat import visibility_heartbeat   # 前掲

# --- 構造化ログ（相関IDを必ず通す。音声内容＝PIIは出さない）---
logging.basicConfig(level=logging.INFO, format='%(message)s')
log = logging.getLogger("worker")

def emit(event: str, **fields) -> None:
    log.info(json.dumps({"event": event, **fields}, ensure_ascii=False))

QUEUE_URL = os.environ["QUEUE_URL"]
OUT_BUCKET = os.environ["OUTPUT_BUCKET"]
MODEL = os.environ.get("SEPARATION_MODEL", "UVR-MDX-NET-Inst_HQ_3.onnx")
MODEL_CACHE = os.environ.get("MODEL_CACHE_DIR", "/models")   # イメージに焼き込み済み推奨

sqs = boto3.client("sqs")
s3 = boto3.client("s3")

_shutdown = False
def _on_sigterm(*_):                 # Spot中断(2分前通知)/ECS停止で届く
    global _shutdown
    _shutdown = True
    emit("shutdown_requested")       # 新規受信を止め、処理中ジョブを安全に終える
signal.signal(signal.SIGTERM, _on_sigterm)

# モデルは一度だけVRAMへ。リクエスト間で再利用（ロードは高コスト）
_separator: Separator | None = None
def get_separator() -> Separator:
    global _separator
    if _separator is None:
        _separator = Separator(output_format="flac", model_file_dir=MODEL_CACHE)
        _separator.load_model(model_filename=MODEL)
        emit("model_loaded", model=MODEL)
    return _separator


def _already_done(prefix: str) -> bool:
    """出力S3キーが存在するなら処理済み＝冪等にスキップ（二重GPU実行を防ぐ）。"""
    resp = s3.list_objects_v2(Bucket=OUT_BUCKET, Prefix=prefix, MaxKeys=1)
    return resp.get("KeyCount", 0) > 0


def process_message(msg: dict) -> None:
    body = json.loads(msg["Body"])
    rec = body["Records"][0]["s3"]                 # S3イベント通知の形
    in_bucket = rec["bucket"]["name"]
    in_key = unquote_plus(rec["object"]["key"])    # S3通知のkeyはURLエンコード済（空白→'+'・特殊文字→%XX）
    prefix = f"stems/{MODEL.replace('/', '_')}/{in_key}"   # 決定論的な出力先
    cid = msg["MessageId"]

    if _already_done(prefix):
        emit("skip_idempotent", cid=cid, key=in_key)
        return                                     # 既に成果物あり＝成功扱い

    with visibility_heartbeat(sqs, QUEUE_URL, msg["ReceiptHandle"]):
        with tempfile.TemporaryDirectory() as tmp:
            src = Path(tmp) / Path(in_key).name
            s3.download_file(in_bucket, in_key, str(src))
            sep = get_separator()
            sep.output_dir = tmp
            stems = sep.separate(str(src))         # GPU処理（ハートビートが守る）
            for stem in stems:
                key = f"{prefix}/{Path(stem).name}"
                s3.upload_file(stem, OUT_BUCKET, key)
            emit("separated", cid=cid, key=in_key, stems=len(stems))


def main() -> None:
    get_separator()                                # 起動時ウォームアップ
    while not _shutdown:
        resp = sqs.receive_message(
            QueueUrl=QUEUE_URL, MaxNumberOfMessages=1,
            WaitTimeSeconds=20,                    # ロングポーリング（空ポーリング課金を削減）
        )
        for msg in resp.get("Messages", []):
            try:
                process_message(msg)
                sqs.delete_message(QueueUrl=QUEUE_URL,
                                   ReceiptHandle=msg["ReceiptHandle"])  # 成功時のみ削除
            except Exception:
                # 失敗は削除しない → 可視性切れで再配信 → maxReceiveCount超でDLQへ
                emit("process_failed", cid=msg.get("MessageId"))
                log.exception("processing error")
    emit("drained_and_exit")
    sys.exit(0)


if __name__ == "__main__":
    main()

The reason this worker is production-quality lies in the handling of failure.

delete_message only on success. A failed message isn't deleted and is auto-re-delivered on visibility expiry.
When re-delivery exceeds maxReceiveCount, it's isolated to the DLQ and doesn't clog the main flow (config in the next section).
On SIGTERM (Spot interruption / ECS stop), stop new receipts and complete only the in-progress. Since an interrupted job isn't deleted, another worker picks it up again — zero misses.

Infrastructure with Terraform (queue + DLQ + S3 notification)

Declaratively define the queue, the dead-letter queue (DLQ), the redrive policy, and the S3 → SQS notification.

# main.tf — SQS + DLQ + S3イベント通知
resource "aws_sqs_queue" "dlq" {
  name                      = "separation-dlq"
  message_retention_seconds = 1209600          # 14日。原因調査の猶予を確保
}

resource "aws_sqs_queue" "jobs" {
  name                       = "separation-jobs"
  visibility_timeout_seconds = 300             # ハートビート前提の初期値
  receive_wait_time_seconds  = 20              # ロングポーリング既定

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 5                     # 5回失敗で毒メッセージをDLQへ
  })
}

# S3 → SQS（ObjectCreatedのみ。入力prefixに限定して再帰を防ぐ）
resource "aws_s3_bucket_notification" "ingest" {
  bucket = aws_s3_bucket.input.id
  queue {
    queue_arn     = aws_sqs_queue.jobs.arn
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = "uploads/"
  }
}

📦 Handling large audio: the SQS message limit is 1 MiB (expanded from 256 KB in 2025). Since an audio file isn't small, always put the body in S3 and put only the S3 key (a pointer) in the message. With the S3 notification method above, since the message is only event metadata, it naturally takes this form.

Cost optimization: Spot × auto-scale × idempotent cache

GPU is expensive. Carve the production unit price in three stages.

Spot instances: a GPU batch, if you build in interruption resistance, Spot pays off. AWS Batch can mix Spot/On-Demand, and re-submits on interruption. The worker above safely lets go on SIGTERM (the 2-minute-before notice of interruption), so a job lost to Spot is picked up again by re-delivery, not the DLQ.
Queue-depth auto-scale: with ApproximateNumberOfMessagesVisible as the metric, increase workers when the queue piles up, and shrink to 0 when empty. Don't hold an idle GPU.
Idempotent cache: zero reprocessing with the S3-key existence check. Don't pass the same song through the GPU twice on re-upload, re-send, or Spot re-submission.

💴 The way of thinking about the cost model: unit price ≈ (processing seconds per song ÷ 3600) × instance hourly rate ÷ Spot discount rate. g4dn (T4), as AWS official says "the cheapest GPU for ML inference," is the first candidate for a mass batch of MDX-Net. Confirm the latest exact pricing on the pricing page (this article doesn't assert the absolute value of the unit price).

Observability: a state where you can trace a stalled process at a glance

The scariest thing in a mass batch is "you can't tell which job got stuck where." In addition to the worker's structured logs (emit), prepare at least this.

Put queue depth and DLQ count into a CloudWatch alarm (DLQ > 0 is immediate notification = poison-message detection).
Metric-ize processing time, success/failure, and idempotent-skip count to visualize throughput and cost.
Always pass a correlation ID (MessageId / S3 key) through the logs. Don't emit the audio content (which can be PII).

The systematic build-out of observability is summarized in the OpenTelemetry article, and retry/backoff/circuit-breaker in the resilience-patterns article.

Testability: make a boundary that runs without a GPU

GPU-dependent code is hard to run in CI — that's exactly why carving the boundary is the world's best design.

Pure logic like output_prefix / _already_done can be unit-tested without I/O.
Mock S3 / SQS with moto and verify the worker's delivery, idempotency, and DLQ route without a GPU.
Swap the separation body (Separator.separate) via an interface and pass the pipeline with dummy output.

# test_idempotency.py — GPU不要。冪等ロジックだけを検証
from worker import output_prefix  # 純関数として切り出しておく

def test_output_prefix_is_deterministic():
    a = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
    b = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
    assert a == b                       # 同じ入力×モデルなら必ず同じ出力先＝冪等の土台

def test_model_namespaced():
    p1 = output_prefix("uploads/x.wav", "model_a")
    p2 = output_prefix("uploads/x.wav", "model_b")
    assert p1 != p2                     # モデルが違えば出力先も分かれる

Frequently asked questions (FAQ)

Q. Why a container (Batch/ECS), not Lambda? A. Source separation is heavy processing that needs a GPU and minutes-long execution time. Since Lambda doesn't support GPU and has constraints on execution time, a GPU container (AWS Batch / ECS) suits. Carving out only light preprocessing (normalization, queue submission) to Lambda is effective.

Q. Should the visibility timeout just be long from the start? A. Too long delays re-delivery on failure, too short causes double processing. So "short + extend with a heartbeat" is the standard. Note that even with extension, 12 hours from the first receipt is the upper limit, so split ultra-long ones with Step Functions.

Q. Won't a job disappear on Spot interruption? A. It won't. Since the worker delete_messages only on success, an interrupted job is re-delivered on visibility expiry and picked up again by another worker. If you complete only the in-progress on SIGTERM, no misses occur.

Q. What's the guarantee that the same file isn't processed twice? A. Since SQS is at-least-once delivery, "absolutely once" isn't guaranteed. So make it idempotent with an output-S3-key existence check and absorb double execution as success. This also prevents wasted GPU cost.

Q. The cold start is slow with the model's first download. A. Bake the model into the container image (download once at build time), or point model_file_dir to a persistent volume. Since /tmp is volatile, avoid re-downloading per container (details in the troubleshooting article).

Summary: from a manual script to a "non-crashing foundation"

The essence of scaling source separation in production lies not in the model but in foundation design premised on "at least once, fails, gets interrupted."

Loosely coupled: S3 → SQS → GPU worker → S3. Each doesn't know the others.
Idempotent: absorb double processing with an output-S3-key existence check (= cost reduction).
Resilient: visibility heartbeat, delete only on success, DLQ, Spot graceful termination.
Cost: Spot × queue-depth auto-scale × idempotent cache.
Observability / testability: correlation ID, DLQ alarm, a boundary that runs without a GPU.

I actually run this foundation in production on an AI-video-localization platform that has audio separation as its first stage. Please consult me, along with the track record, on the design that raises GPU-using audio/video AI from "works in a PoC" to "doesn't crash while holding down cost in production." With one person × generative AI, I support end-to-end from design to production operation.

Sources / official resources

SQS visibility timeout: Amazon SQS visibility timeout (default 30 seconds, max 12 hours, heartbeat, at-least-once)
SQS message limit: SQS quotas / expanded to 1 MiB (2025)
AWS Batch (GPU): Job definition parameters / What is AWS Batch
ECS GPU: Working with GPUs on Amazon ECS
GPU instances: Amazon EC2 G4 (T4)
S3 event notifications: Amazon S3 Event Notifications (destinations, at-least-once, FIFO unsupported)

※ AWS's specs, limits, and pricing get updated. Always confirm the primary sources before implementing. This article doesn't assert the Spot-interruption behavior or the absolute value of pricing, and presupposes confirmation of the latest official documentation / pricing page.

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

The goal of this article

30-second summary (conclusion first)

Architecture: a loosely-coupled queue-driven pipeline

Why design on the premise of "at least once"

The crux of long jobs: the visibility-timeout heartbeat

Choosing the GPU foundation: AWS Batch or ECS

The world's best worker implementation (boto3)

Infrastructure with Terraform (queue + DLQ + S3 notification)

Cost optimization: Spot × auto-scale × idempotent cache

Observability: a state where you can trace a stalled process at a glance

Testability: make a boundary that runs without a GPU

Frequently asked questions (FAQ)

Summary: from a manual script to a "non-crashing foundation"

Sources / official resources

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

A production-quality AI video-localization platform: designing a long GPU pipeline to run to completion 'without crashing, cheaply, and naturally'

MuseTalk Production Deployment in Practice — Docker, GPU Serving, Autoscaling, Cost Optimization, Observability

Self-hosting Llama in production with vLLM: a high-throughput inference-server operations log

The goal of this article

30-second summary (conclusion first)

Architecture: a loosely-coupled queue-driven pipeline

Why design on the premise of "at least once"

The crux of long jobs: the visibility-timeout heartbeat

Choosing the GPU foundation: AWS Batch or ECS

The world's best worker implementation (boto3)

Infrastructure with Terraform (queue + DLQ + S3 notification)

Cost optimization: Spot × auto-scale × idempotent cache

Observability: a state where you can trace a stalled process at a glance

Testability: make a boundary that runs without a GPU

Frequently asked questions (FAQ)

Summary: from a manual script to a "non-crashing foundation"

Sources / official resources

Related articles

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

A production-quality AI video-localization platform: designing a long GPU pipeline to run to completion 'without crashing, cheaply, and naturally'

MuseTalk Production Deployment in Practice — Docker, GPU Serving, Autoscaling, Cost Optimization, Observability

Self-hosting Llama in production with vLLM: a high-throughput inference-server operations log