Queue-First Cloud Sandbox: Preventing Worker Starvation Under Burst Load

How I shifted from request-coupled execution to queue-worker isolation to keep execution throughput stable under burst traffic.

AWSQueuesPlatform EngineeringReliabilityTypeScript

1. Hook and Stakes

Running untrusted code directly inside synchronous API requests caused latency spikes and resource contention when traffic burst.

If execution jobs block API threads, the platform fails both reliability and safety expectations expected in production backend systems.

2. Architecture Diagram

Split control-plane web traffic from execution-plane worker traffic, with queue buffering between API ingestion and sandbox runtimes.

mermaid
graph LR
  Client[UI / API Client]-->Control[Control API (ALB)]
  Control-->ExecAPI[Execution API Boundary]
  ExecAPI-->Queue[Execution Queue]
  Queue-->DLQ[Dead Letter Queue]
  Queue-->Worker[Fargate Spot Workers]
  DLQ-->Recovery[EventBridge Replay]
  Recovery-->Queue
  Worker-->Store[(Result Store)]
  Store-->ExecAPI

ALB-backed control and execution API boundaries
Queue + DLQ buffering for async execution and recovery replay
Fargate Spot workers with bounded runtime resources
EventBridge-triggered DLQ recovery automation
Deterministic result formatting and retrieval

3. Stress Test and Breaking Point

Setup: I replayed burst execution submissions while increasing concurrent requests and mixed payload sizes.

Failure Signal: The request-coupled version showed queueing at the API layer and rising tail latency when multiple jobs fought for runtime resources.

Execution pipeline stabilized after queue-worker decoupling and bounded retry controls.
Fargate Spot worker pools reduced asynchronous compute cost by 70% in burst validation runs.
DLQ replay automation achieved 100% payload recovery during staged partition drills.
Worker-level isolation prevented one heavy job class from starving unrelated requests.

4. Bottleneck Root Cause and Resolution

Root Cause: Synchronous execution tied API responsiveness to runtime availability, exhausting constrained processing windows during bursts.

Resolution: I moved execution requests into an asynchronous queue, enforced worker runtime ceilings, and returned deterministic job-state responses from the API boundary.

Asynchronous execution improves reliability but introduces queue latency overhead.
Strict runtime limits improve safety but can reject edge-case workloads.

5. Business Impact

Improved platform reliability under burst load by isolating user-facing and execution-facing concerns.
Reduced operational risk from untrusted workloads through bounded runtime controls.
Created an architecture that can be validated directly through live endpoints and system design docs.

References and Live Evidence

Live Execution API Source Repository System Design Read Production Upgrade Log