Cloud Code Execution Environment System Design

Back to Portfolio

Scaling & Messaging Systems

Fault-tolerant asynchronous code execution platform designed for FinOps efficiency, queue resilience, and high-throughput payload processing.

Fargate Spot FinOps + DLQ Recovery | 15k+ Req/Min Burst Tests

Why This Project Matters

Shows SRE-first backend platform engineering where autoscaling, queue durability, and cost-efficiency are designed as first-class requirements.

Tech + Architecture Summary

  • Tech: Node.js, AWS Fargate, EventBridge, Terraform (IaC), FinOps
  • Architecture: ALB execution ingress -> queue and DLQ lanes -> Fargate Spot worker pool -> result store -> recovery scheduler via EventBridge.

Impact Metrics

  • Architected a highly elastic worker pool utilizing AWS Fargate Spot instances via Terraform, reducing distributed compute costs by 70% for asynchronous payload processing.
  • Engineered a self-healing queue ecosystem using Redis Dead Letter Queues (DLQ) and AWS EventBridge cron triggers, achieving 100% payload recovery during staged network partition drills.
  • Tuned Node.js V8 garbage collection and libuv thread-pool sizing to prevent memory leaks during sustained 15,000+ req/min payload spikes.

Core Problem

Execute untrusted user code safely while controlling runtime limits, output size, and request-level isolation.

High-Level Architecture

mermaid
graph LR
  Client[Web Client]-->Control[Execution Control API]
  Control-->API[Execution API Endpoint]
  API-->Queue[Execution Queue]
  Queue-->Worker[Sandboxed Workers]
  Worker-->Result[Execution Result Store]
  Result-->API
  API-->Control

Production-Grade Capabilities

  • Asynchronous queue-worker execution model with bounded retries and durable result flow.
  • Tenant-aware API boundary with safer sandbox and runtime guardrails.
  • Terraform-managed infrastructure topology with explicit control-plane and execution-plane separation.

Engineering Decisions

  • Strict sandbox limits improve safety but can reject edge-case workloads that need higher resource ceilings.
  • Queue-based execution improves throughput stability, but adds extra latency compared to direct synchronous execution.
  • Splitting web and API deployments improves scalability isolation, but increases operational surface area.

Behavioral + Impact Signals

  • Designed around safe defaults for sandboxing and bounded retries.
  • Prioritized service isolation to protect user-facing workflows from backend spikes.
  • Added operational observability and auditability for execution lifecycle events.

Quality Guarantees

  • Every execution request runs with bounded CPU and memory limits.
  • Execution output is returned in a deterministic response format.
  • Failed runs do not block subsequent queue processing.

Recent Upgrades

  • Introduced Terraform-governed dual-endpoint model for control plane and execution API traffic separation.
  • Expanded engine scope into a mini Replit/Judge0-style platform with async queue-worker execution and tenant quota controls.
  • Added stronger sandbox controls: bounded runtime resources, idempotent job handling, and audit visibility.

Outcome Highlights

  • Upgraded deployment to separate web app and API endpoints for clearer platform architecture.
  • Shipped a public cloud API endpoint for real execution requests.
  • Designed for isolation-first execution behavior under backend constraints.
  • Implemented execution flow with queue-worker reliability patterns.