The goal of this article
Separating "one file on your machine" with UVR5/MDX-Net or Demucs is easy. The problem is the stage of processing thousands of files a day, without crashing, cheaply, and idempotently. Here many PoCs hit "the limit of the manual script."
This article shows a queue-driven batch foundation that scales source separation in production on AWS, with design philosophy and working code. When you finish reading, you can assemble the following.
- You can design a loosely-coupled pipeline of S3 event → SQS → GPU worker → S3.
- You can safely process a minutes-long long job with a visibility-timeout heartbeat.
- You can implement an idempotent, resilient GPU worker that withstands Spot interruption, poison messages, and double delivery.
📐 This article's positioning: the general pattern of turning source separation into a production service (a type-safe API ingress, a job queue, OOM graceful degradation, testability, and other platform-independent design) is handled in the GPU-worker-foundation article. This article is the "platform implementation edition" of concretely implementing it on AWS — narrowed to AWS-specific points like SQS's visibility timeout and message limit, GPU requests in AWS Batch/ECS, S3 event-driven, Spot, Terraform, and g4dn cost. See there for the general design, here for the AWS implementation.
About the author (disclosure of credibility): I single-handedly designed, implemented, and run in production an AI-video-localization platform that fully automates "source separation → transcription → translation → multilingual dubbing → mouth synchronization." I run its first stage, source separation, with exactly this article's queue-driven GPU batch. This article's "heartbeat," "Spot graceful termination," and "idempotent cache" aren't demo knowledge but the very design implemented to not crash jobs in production. The project overview is in the track record, and the idempotency principles cultivated in a payment platform are summarized in the reliability-cluster article.
30-second summary (conclusion first)
| Point | Conclusion |
|---|---|
| Overall composition | S3 (input) → S3 event → SQS → GPU worker → S3 (output). Loosely coupled, retryable, idempotent |
| Why queue-driven | absorb spikes, auto-scale workers by queue depth. Failures retry, poison goes to the DLQ |
| Idempotency | SQS is at-least-once delivery. The worker prevents double processing with an S3-output-key existence check |
| Long jobs | periodically extend (heartbeat) the visibility timeout (default 30 seconds, max 12 hours) with ChangeMessageVisibility |
| GPU foundation | for queue-driven batches, AWS Batch (GPU = resourceRequirements type=GPU, with built-in Spot and retry) is the prime choice. Resident is ECS |
| GPU type | g4dn (NVIDIA T4 16GB) = cheapest for ML inference. If VRAM is needed for quality/length, g5(A10G 24GB)/g6(L4 24GB) |
| Cost | Spot + auto-scale + idempotent cache. Receive Spot interruption with SIGTERM and let go safely |
| Large input | the SQS message limit is 1 MiB (was 256 KB). Always put the audio body in S3 and pass a pointer |
| Poison messages | isolate to the DLQ when over maxReceiveCount, and don't clog the main flow |
Architecture: a loosely-coupled queue-driven pipeline
┌──────────────┐ ObjectCreated ┌─────────────┐
upload → │ S3 input bkt │ ──event notify─→ │ SQS queue │ ──┐
└──────────────┘ └─────────────┘ │ long poll
│ fail N× │
▼ ▼
┌──────────┐ ┌───────────────────┐
│ DLQ │ │ GPU worker pool │
└──────────┘ │ (AWS Batch / ECS) │
│ separate + idemp. │
└───────────────────┘
│ stems
▼
┌─────────────────┐
│ S3 output bkt │
└─────────────────┘
The core of the design is "each component doesn't know the others." The upload side doesn't know the worker exists, and the worker, even if it scales, doesn't break consistency. S3 is the source of truth, SQS is the buffer, the worker is an idempotent transformer — this division of roles (SRP) makes a foundation that withstands both spikes and failures.
⚠️ Beware of an infinite loop: if a worker started by an S3 event writes back to the same bucket, the notification recurses and runs away. Always separate the bucket (or prefix) for input and output. An S3 notification can go to SNS / SQS / Lambda / EventBridge, but note that SQS FIFO can't be a direct notification destination (via EventBridge if needed).
Why design on the premise of "at least once"
SQS is at-least-once delivery. AWS official clearly states it too.
due to the at-least-once delivery model of Amazon SQS, there's no absolute guarantee that a message won't be delivered more than once during the visibility timeout period.
In other words, the same audio being processed twice can happen as the "normal path." Rather than swallowing this, absorb it with idempotency. The most robust is "if the output S3 key already exists, treat it as success without processing" — since separation is heavy GPU processing, avoiding double execution directly becomes cost reduction. The general principles of idempotency are detailed in idempotent async-processing design.
# 冪等キー:音声の内容ではなく"入力S3キー + モデル"で出力先を決定論的に決める
def output_prefix(input_key: str, model: str) -> str:
"""同じ入力×同じモデルなら必ず同じ出力prefix。存在すれば再処理しない。"""
safe_model = model.replace("/", "_")
return f"stems/{safe_model}/{input_key}"
The crux of long jobs: the visibility-timeout heartbeat
Source separation takes a few dozen seconds to a few minutes per song. Processing beyond SQS's visibility timeout (default 30 seconds) re-delivers the message to another worker and causes double processing.
The countermeasure is a heartbeat — during processing, keep periodically extending the visibility with ChangeMessageVisibility. It's also AWS official's recommendation.
Implement a heartbeat mechanism to periodically extend the visibility timeout, ensuring the message remains invisible until processing is complete.
But the visibility timeout has an upper limit of "max 12 hours from the first receipt" (extending doesn't reset this limit). For ultra-long ones over this, the official guidance is to split with Step Functions.
# heartbeat.py — 処理中だけ可視性を延長し続けるコンテキストマネージャ
import threading
from contextlib import contextmanager
@contextmanager
def visibility_heartbeat(sqs, queue_url: str, receipt_handle: str,
*, extend_to: int = 300, interval: int = 120):
"""interval秒ごとに可視性をextend_to秒へ延長。長時間ジョブの二重配信を防ぐ。"""
stop = threading.Event()
def beat() -> None:
while not stop.wait(interval): # interval待つ。stopが立てば即終了
try:
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=extend_to, # 最初の受信から最大12hの制約に注意
)
except Exception: # 延長失敗は致命的でない。次の周期で再試行
pass
t = threading.Thread(target=beat, daemon=True)
t.start()
try:
yield
finally:
stop.set() # 処理完了で確実に停止(リーク防止)
t.join(timeout=1)
Choosing the GPU foundation: AWS Batch or ECS
There are mainly 2 places to put a GPU worker. For queue-driven one-shot batch inference, AWS Batch is purpose-built.
| Viewpoint | AWS Batch | ECS (resident service) |
|---|---|---|
| Positioning | specialized for batch compute (auto-scales queue + compute environment) | resident operation of containers |
| GPU specification | resourceRequirements's type=GPU | task definition's resourceRequirements |
| Spot integration | built-in (can mix Spot/On-Demand) | configure capacity providers yourself |
| Retry | the job definition's attempts (1–10) built-in | implement on the app/service side |
| Suited use | mass processing of queue-driven one-shot jobs | low-latency resident, fine-grained control |
AWS Batch's GPU request is written in the job definition like this (the official form).
{
"resourceRequirements": [
{ "type": "GPU", "value": "1" },
{ "type": "VCPU", "value": "4" },
{ "type": "MEMORY", "value": "16384" }
]
}
💡 How to choose the GPU type: in AWS official words, g4dn (NVIDIA T4, 16GB) is "the lowest-cost GPU instance for ML inference." It's enough for a mass batch of MDX-Net. For VRAM-hungry models like RoFormer or long files, raise to g5 (A10G, 24GB) / g6 (L4, 24GB) / g6e (L40S, 48GB). When using a GPU on ECS, the GPU-optimized AMI and
ECS_ENABLE_GPU_SUPPORT=trueare needed.
The world's best worker implementation (boto3)
Consolidate the design requirements into one worker — ① model resident, ② S3-key idempotency, ③ visibility heartbeat, ④ graceful termination on Spot interruption, ⑤ leave poison messages to the DLQ, ⑥ structured logs.
# worker.py — 本番GPU音源分離ワーカー(SQS駆動・冪等・回復性)
from __future__ import annotations
import json
import logging
import os
import signal
import sys
import tempfile
from pathlib import Path
from urllib.parse import unquote_plus
import boto3
from audio_separator.separator import Separator
from heartbeat import visibility_heartbeat # 前掲
# --- 構造化ログ(相関IDを必ず通す。音声内容=PIIは出さない)---
logging.basicConfig(level=logging.INFO, format='%(message)s')
log = logging.getLogger("worker")
def emit(event: str, **fields) -> None:
log.info(json.dumps({"event": event, **fields}, ensure_ascii=False))
QUEUE_URL = os.environ["QUEUE_URL"]
OUT_BUCKET = os.environ["OUTPUT_BUCKET"]
MODEL = os.environ.get("SEPARATION_MODEL", "UVR-MDX-NET-Inst_HQ_3.onnx")
MODEL_CACHE = os.environ.get("MODEL_CACHE_DIR", "/models") # イメージに焼き込み済み推奨
sqs = boto3.client("sqs")
s3 = boto3.client("s3")
_shutdown = False
def _on_sigterm(*_): # Spot中断(2分前通知)/ECS停止で届く
global _shutdown
_shutdown = True
emit("shutdown_requested") # 新規受信を止め、処理中ジョブを安全に終える
signal.signal(signal.SIGTERM, _on_sigterm)
# モデルは一度だけVRAMへ。リクエスト間で再利用(ロードは高コスト)
_separator: Separator | None = None
def get_separator() -> Separator:
global _separator
if _separator is None:
_separator = Separator(output_format="flac", model_file_dir=MODEL_CACHE)
_separator.load_model(model_filename=MODEL)
emit("model_loaded", model=MODEL)
return _separator
def _already_done(prefix: str) -> bool:
"""出力S3キーが存在するなら処理済み=冪等にスキップ(二重GPU実行を防ぐ)。"""
resp = s3.list_objects_v2(Bucket=OUT_BUCKET, Prefix=prefix, MaxKeys=1)
return resp.get("KeyCount", 0) > 0
def process_message(msg: dict) -> None:
body = json.loads(msg["Body"])
rec = body["Records"][0]["s3"] # S3イベント通知の形
in_bucket = rec["bucket"]["name"]
in_key = unquote_plus(rec["object"]["key"]) # S3通知のkeyはURLエンコード済(空白→'+'・特殊文字→%XX)
prefix = f"stems/{MODEL.replace('/', '_')}/{in_key}" # 決定論的な出力先
cid = msg["MessageId"]
if _already_done(prefix):
emit("skip_idempotent", cid=cid, key=in_key)
return # 既に成果物あり=成功扱い
with visibility_heartbeat(sqs, QUEUE_URL, msg["ReceiptHandle"]):
with tempfile.TemporaryDirectory() as tmp:
src = Path(tmp) / Path(in_key).name
s3.download_file(in_bucket, in_key, str(src))
sep = get_separator()
sep.output_dir = tmp
stems = sep.separate(str(src)) # GPU処理(ハートビートが守る)
for stem in stems:
key = f"{prefix}/{Path(stem).name}"
s3.upload_file(stem, OUT_BUCKET, key)
emit("separated", cid=cid, key=in_key, stems=len(stems))
def main() -> None:
get_separator() # 起動時ウォームアップ
while not _shutdown:
resp = sqs.receive_message(
QueueUrl=QUEUE_URL, MaxNumberOfMessages=1,
WaitTimeSeconds=20, # ロングポーリング(空ポーリング課金を削減)
)
for msg in resp.get("Messages", []):
try:
process_message(msg)
sqs.delete_message(QueueUrl=QUEUE_URL,
ReceiptHandle=msg["ReceiptHandle"]) # 成功時のみ削除
except Exception:
# 失敗は削除しない → 可視性切れで再配信 → maxReceiveCount超でDLQへ
emit("process_failed", cid=msg.get("MessageId"))
log.exception("processing error")
emit("drained_and_exit")
sys.exit(0)
if __name__ == "__main__":
main()
The reason this worker is production-quality lies in the handling of failure.
delete_messageonly on success. A failed message isn't deleted and is auto-re-delivered on visibility expiry.- When re-delivery exceeds
maxReceiveCount, it's isolated to the DLQ and doesn't clog the main flow (config in the next section). - On SIGTERM (Spot interruption / ECS stop), stop new receipts and complete only the in-progress. Since an interrupted job isn't
deleted, another worker picks it up again — zero misses.
Infrastructure with Terraform (queue + DLQ + S3 notification)
Declaratively define the queue, the dead-letter queue (DLQ), the redrive policy, and the S3 → SQS notification.
# main.tf — SQS + DLQ + S3イベント通知
resource "aws_sqs_queue" "dlq" {
name = "separation-dlq"
message_retention_seconds = 1209600 # 14日。原因調査の猶予を確保
}
resource "aws_sqs_queue" "jobs" {
name = "separation-jobs"
visibility_timeout_seconds = 300 # ハートビート前提の初期値
receive_wait_time_seconds = 20 # ロングポーリング既定
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 5 # 5回失敗で毒メッセージをDLQへ
})
}
# S3 → SQS(ObjectCreatedのみ。入力prefixに限定して再帰を防ぐ)
resource "aws_s3_bucket_notification" "ingest" {
bucket = aws_s3_bucket.input.id
queue {
queue_arn = aws_sqs_queue.jobs.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "uploads/"
}
}
📦 Handling large audio: the SQS message limit is 1 MiB (expanded from 256 KB in 2025). Since an audio file isn't small, always put the body in S3 and put only the S3 key (a pointer) in the message. With the S3 notification method above, since the message is only event metadata, it naturally takes this form.
Cost optimization: Spot × auto-scale × idempotent cache
GPU is expensive. Carve the production unit price in three stages.
- Spot instances: a GPU batch, if you build in interruption resistance, Spot pays off. AWS Batch can mix Spot/On-Demand, and re-submits on interruption. The worker above safely lets go on SIGTERM (the 2-minute-before notice of interruption), so a job lost to Spot is picked up again by re-delivery, not the DLQ.
- Queue-depth auto-scale: with
ApproximateNumberOfMessagesVisibleas the metric, increase workers when the queue piles up, and shrink to 0 when empty. Don't hold an idle GPU. - Idempotent cache: zero reprocessing with the S3-key existence check. Don't pass the same song through the GPU twice on re-upload, re-send, or Spot re-submission.
💴 The way of thinking about the cost model: unit price ≈ (processing seconds per song ÷ 3600) × instance hourly rate ÷ Spot discount rate. g4dn (T4), as AWS official says "the cheapest GPU for ML inference," is the first candidate for a mass batch of MDX-Net. Confirm the latest exact pricing on the pricing page (this article doesn't assert the absolute value of the unit price).
Observability: a state where you can trace a stalled process at a glance
The scariest thing in a mass batch is "you can't tell which job got stuck where." In addition to the worker's structured logs (emit), prepare at least this.
- Put queue depth and DLQ count into a CloudWatch alarm (DLQ > 0 is immediate notification = poison-message detection).
- Metric-ize processing time, success/failure, and idempotent-skip count to visualize throughput and cost.
- Always pass a correlation ID (MessageId / S3 key) through the logs. Don't emit the audio content (which can be PII).
The systematic build-out of observability is summarized in the OpenTelemetry article, and retry/backoff/circuit-breaker in the resilience-patterns article.
Testability: make a boundary that runs without a GPU
GPU-dependent code is hard to run in CI — that's exactly why carving the boundary is the world's best design.
- Pure logic like
output_prefix/_already_donecan be unit-tested without I/O. - Mock S3 / SQS with moto and verify the worker's delivery, idempotency, and DLQ route without a GPU.
- Swap the separation body (
Separator.separate) via an interface and pass the pipeline with dummy output.
# test_idempotency.py — GPU不要。冪等ロジックだけを検証
from worker import output_prefix # 純関数として切り出しておく
def test_output_prefix_is_deterministic():
a = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
b = output_prefix("uploads/song.wav", "UVR-MDX-NET-Inst_HQ_3.onnx")
assert a == b # 同じ入力×モデルなら必ず同じ出力先=冪等の土台
def test_model_namespaced():
p1 = output_prefix("uploads/x.wav", "model_a")
p2 = output_prefix("uploads/x.wav", "model_b")
assert p1 != p2 # モデルが違えば出力先も分かれる
Frequently asked questions (FAQ)
Q. Why a container (Batch/ECS), not Lambda? A. Source separation is heavy processing that needs a GPU and minutes-long execution time. Since Lambda doesn't support GPU and has constraints on execution time, a GPU container (AWS Batch / ECS) suits. Carving out only light preprocessing (normalization, queue submission) to Lambda is effective.
Q. Should the visibility timeout just be long from the start? A. Too long delays re-delivery on failure, too short causes double processing. So "short + extend with a heartbeat" is the standard. Note that even with extension, 12 hours from the first receipt is the upper limit, so split ultra-long ones with Step Functions.
Q. Won't a job disappear on Spot interruption?
A. It won't. Since the worker delete_messages only on success, an interrupted job is re-delivered on visibility expiry and picked up again by another worker. If you complete only the in-progress on SIGTERM, no misses occur.
Q. What's the guarantee that the same file isn't processed twice? A. Since SQS is at-least-once delivery, "absolutely once" isn't guaranteed. So make it idempotent with an output-S3-key existence check and absorb double execution as success. This also prevents wasted GPU cost.
Q. The cold start is slow with the model's first download.
A. Bake the model into the container image (download once at build time), or point model_file_dir to a persistent volume. Since /tmp is volatile, avoid re-downloading per container (details in the troubleshooting article).
Summary: from a manual script to a "non-crashing foundation"
The essence of scaling source separation in production lies not in the model but in foundation design premised on "at least once, fails, gets interrupted."
- Loosely coupled: S3 → SQS → GPU worker → S3. Each doesn't know the others.
- Idempotent: absorb double processing with an output-S3-key existence check (= cost reduction).
- Resilient: visibility heartbeat, delete only on success, DLQ, Spot graceful termination.
- Cost: Spot × queue-depth auto-scale × idempotent cache.
- Observability / testability: correlation ID, DLQ alarm, a boundary that runs without a GPU.
I actually run this foundation in production on an AI-video-localization platform that has audio separation as its first stage. Please consult me, along with the track record, on the design that raises GPU-using audio/video AI from "works in a PoC" to "doesn't crash while holding down cost in production." With one person × generative AI, I support end-to-end from design to production operation.
Sources / official resources
- SQS visibility timeout: Amazon SQS visibility timeout (default 30 seconds, max 12 hours, heartbeat, at-least-once)
- SQS message limit: SQS quotas / expanded to 1 MiB (2025)
- AWS Batch (GPU): Job definition parameters / What is AWS Batch
- ECS GPU: Working with GPUs on Amazon ECS
- GPU instances: Amazon EC2 G4 (T4)
- S3 event notifications: Amazon S3 Event Notifications (destinations, at-least-once, FIFO unsupported)
※ AWS's specs, limits, and pricing get updated. Always confirm the primary sources before implementing. This article doesn't assert the Spot-interruption behavior or the absolute value of pricing, and presupposes confirmation of the latest official documentation / pricing page.