Incident_Architecture_Breakdown
Queue-First Cloud Code Execution: Preventing Worker Starvation Under Burst Load
How I shifted from request-coupled execution to queue-worker isolation to keep execution throughput stable under burst traffic.
1. Hook and Stakes
Running untrusted code directly inside synchronous API requests caused latency spikes and resource contention when traffic burst.
If execution jobs block API threads, the platform fails both reliability and safety expectations expected in production backend systems.
2. Architecture Diagram
Split control-plane web traffic from execution-plane worker traffic, with queue buffering between API ingestion and sandbox runtimes.
mermaid graph LR Client[UI / API Client]-->Control[Control API (ALB)] Control-->ExecAPI[Execution API Boundary] ExecAPI-->Queue[Execution Queue] Queue-->DLQ[Dead Letter Queue] Queue-->Worker[Fargate Spot Workers] DLQ-->Recovery[EventBridge Replay] Recovery-->Queue Worker-->Store[(Result Store)] Store-->ExecAPI
- ALB-backed control and execution API boundaries
- Queue + DLQ buffering for async execution and recovery replay
- Fargate Spot workers with bounded runtime resources
- EventBridge-triggered DLQ recovery automation
- Deterministic result formatting and retrieval
3. Stress Test and Breaking Point
Setup: I replayed burst execution submissions while increasing concurrent requests and mixed payload sizes.
Failure Signal: The request-coupled version showed queueing at the API layer and rising tail latency when multiple jobs fought for runtime resources.
- Execution pipeline stabilized after queue-worker decoupling and bounded retry controls.
- Fargate Spot worker pools reduced asynchronous compute cost by 70% in burst validation runs.
- DLQ replay automation achieved 100% payload recovery during staged partition drills.
- Worker-level isolation prevented one heavy job class from starving unrelated requests.
4. Bottleneck Root Cause and Resolution
Root Cause: Synchronous execution tied API responsiveness to runtime availability, exhausting constrained processing windows during bursts.
Resolution: I moved execution requests into an asynchronous queue, enforced worker runtime ceilings, and returned deterministic job-state responses from the API boundary.
- Asynchronous execution improves reliability but introduces queue latency overhead.
- Strict runtime limits improve safety but can reject edge-case workloads.
5. Business Impact
- Improved platform reliability under burst load by isolating user-facing and execution-facing concerns.
- Reduced operational risk from untrusted workloads through bounded runtime controls.
- Created an architecture that can be validated directly through live endpoints and system design docs.