Cloud Code Execution Environment System Design
Back to PortfolioScaling & Messaging Systems
Fault-tolerant asynchronous code execution platform designed for FinOps efficiency, queue resilience, and high-throughput payload processing.
Fargate Spot FinOps + DLQ Recovery | 15k+ Req/Min Burst Tests
Why This Project Matters
Shows SRE-first backend platform engineering where autoscaling, queue durability, and cost-efficiency are designed as first-class requirements.
Tech + Architecture Summary
- Tech: Node.js, AWS Fargate, EventBridge, Terraform (IaC), FinOps
- Architecture: ALB execution ingress -> queue and DLQ lanes -> Fargate Spot worker pool -> result store -> recovery scheduler via EventBridge.
Impact Metrics
- Architected a highly elastic worker pool utilizing AWS Fargate Spot instances via Terraform, reducing distributed compute costs by 70% for asynchronous payload processing.
- Engineered a self-healing queue ecosystem using Redis Dead Letter Queues (DLQ) and AWS EventBridge cron triggers, achieving 100% payload recovery during staged network partition drills.
- Tuned Node.js V8 garbage collection and libuv thread-pool sizing to prevent memory leaks during sustained 15,000+ req/min payload spikes.
Core Problem
Execute untrusted user code safely while controlling runtime limits, output size, and request-level isolation.
High-Level Architecture
mermaid graph LR Client[Web Client]-->Control[Execution Control API] Control-->API[Execution API Endpoint] API-->Queue[Execution Queue] Queue-->Worker[Sandboxed Workers] Worker-->Result[Execution Result Store] Result-->API API-->Control
Production-Grade Capabilities
- Asynchronous queue-worker execution model with bounded retries and durable result flow.
- Tenant-aware API boundary with safer sandbox and runtime guardrails.
- Terraform-managed infrastructure topology with explicit control-plane and execution-plane separation.
Engineering Decisions
- Strict sandbox limits improve safety but can reject edge-case workloads that need higher resource ceilings.
- Queue-based execution improves throughput stability, but adds extra latency compared to direct synchronous execution.
- Splitting web and API deployments improves scalability isolation, but increases operational surface area.
Behavioral + Impact Signals
- Designed around safe defaults for sandboxing and bounded retries.
- Prioritized service isolation to protect user-facing workflows from backend spikes.
- Added operational observability and auditability for execution lifecycle events.
Quality Guarantees
- Every execution request runs with bounded CPU and memory limits.
- Execution output is returned in a deterministic response format.
- Failed runs do not block subsequent queue processing.
Recent Upgrades
- Introduced Terraform-governed dual-endpoint model for control plane and execution API traffic separation.
- Expanded engine scope into a mini Replit/Judge0-style platform with async queue-worker execution and tenant quota controls.
- Added stronger sandbox controls: bounded runtime resources, idempotent job handling, and audit visibility.
Outcome Highlights
- Upgraded deployment to separate web app and API endpoints for clearer platform architecture.
- Shipped a public cloud API endpoint for real execution requests.
- Designed for isolation-first execution behavior under backend constraints.
- Implemented execution flow with queue-worker reliability patterns.