Cloud Sandbox System Design
Back to PortfolioScaling & Messaging Systems
Fault-tolerant cloud sandbox platform for isolated code execution, queue resilience, and high-throughput payload processing.
Fargate Spot FinOps + DLQ Recovery | 15k+ Req/Min Burst Tests
Why This Project Matters
Shows SRE-first backend platform engineering where sandbox isolation, autoscaling, queue durability, and cost-efficiency are first-class requirements.
Tech + Architecture Summary
- Tech: Node.js, AWS Fargate, EventBridge, Terraform (IaC), FinOps
- Architecture: ALB execution ingress -> queue and DLQ lanes -> Fargate Spot worker pool -> result store -> recovery scheduler via EventBridge.
Impact Metrics
- Architected a highly elastic worker pool utilizing AWS Fargate Spot instances via Terraform, reducing distributed compute costs by 70% for asynchronous payload processing.
- Engineered a self-healing queue ecosystem using Redis Dead Letter Queues (DLQ) and AWS EventBridge cron triggers, achieving 100% payload recovery during staged network partition drills.
- Tuned Node.js V8 garbage collection and libuv thread-pool sizing to prevent memory leaks during sustained 15,000+ req/min payload spikes.
Core Problem
Execute untrusted user code safely while controlling runtime limits, output size, and request-level isolation.
Build Notes
What I Owned
This project is where I practiced separating request intake from execution work so one slow or unsafe job does not control the whole service path.
Hard Lesson
The important lesson was that execution platforms are mostly about isolation and backpressure; the language runner matters less than the safety boundary around it.
Next Enhancement
Next I would add a visible job timeline with queued, running, completed, failed, and DLQ states so reviewers can watch the lifecycle instead of only seeing the API response.
High-Level Architecture
mermaid graph LR Client[Web Client]-->Control[Execution Control API] Control-->API[Execution API Endpoint] API-->Queue[Execution Queue] Queue-->Worker[Sandboxed Workers] Worker-->Result[Execution Result Store] Result-->API API-->Control
Production-Grade Capabilities
- Asynchronous queue-worker execution model with bounded retries and durable result flow.
- Tenant-aware API boundary with safer sandbox and runtime guardrails.
- Terraform-managed infrastructure topology with explicit control-plane and execution-plane separation.
Engineering Decisions
- Strict sandbox limits improve safety but can reject edge-case workloads that need higher resource ceilings.
- Queue-based execution improves throughput stability, but adds extra latency compared to direct synchronous execution.
- Splitting web and API deployments improves scalability isolation, but increases operational surface area.
Behavioral + Impact Signals
- Designed around safe defaults for sandboxing and bounded retries.
- Prioritized service isolation to protect user-facing workflows from backend spikes.
- Added operational observability and auditability for execution lifecycle events.
Quality Guarantees
- Every execution request runs with bounded CPU and memory limits.
- Execution output is returned in a deterministic response format.
- Failed runs do not block subsequent queue processing.
Recent Upgrades
- Introduced Terraform-governed dual-endpoint model for control plane and execution API traffic separation.
- Expanded Cloud Sandbox scope into a mini Replit/Judge0-style platform with async queue-worker execution and tenant quota controls.
- Added stronger sandbox controls: bounded runtime resources, idempotent job handling, and audit visibility.
- Clarified the live ALB endpoint as the execution API proof path and documented queue/DLQ recovery in the production upgrade log.
Outcome Highlights
- Upgraded deployment to separate web app and API endpoints for clearer platform architecture.
- Shipped a public cloud API endpoint for real execution requests.
- Designed for isolation-first execution behavior under backend constraints.
- Implemented execution flow with queue-worker reliability patterns.