Mini Load Balancer (Go) System Design
Back to PortfolioDistributed Systems & Cloud APIs
Regular ECS-hosted Go load balancer migrated off AWS App Runner to gain direct control over proxy behavior, service rollout, and networking diagnostics.
App Runner -> Regular ECS | Go pprof + Consul | Prometheus/Grafana
Why This Project Matters
Shows when a networking-heavy service outgrows App Runner and needs regular ECS service-level control for proxying, health management, and deployment behavior.
Tech + Architecture Summary
- Tech: AWS ECS, Go, Consul, Prometheus, Grafana, pprof
- Architecture: miniloadbalancer.io -> ALB/TLS ingress -> regular ECS service running Go proxy + control plane -> Consul discovery -> backend pool -> Prometheus/Grafana telemetry.
Impact Metrics
- Migrated the service from AWS App Runner to regular ECS so service rollout policy, health-probe cadence, task behavior, and ingress wiring could be controlled directly.
- Conducted deep runtime profiling using Go pprof to identify and eliminate memory allocation bottlenecks, optimizing goroutine scheduling for high-throughput TCP proxying.
- Integrated dynamic service discovery via Consul, enabling zero-downtime backend node registration, sub-second health-aware failover, and ECS-backed routing introspection.
Core Problem
Route traffic predictably under backend failure while minimizing flapping, maintaining idempotent retry behavior, and preserving operational insight.
High-Level Architecture
mermaid graph LR Client[User Traffic]-->LB[Go Load Balancer] LB-->CP[Control Plane /admin/*] LB-->Proxy[Proxy Plane /proxy/*] Proxy-->B1[Backend A] Proxy-->B2[Backend B] Proxy-->B3[Backend C] LB-.Health Checks.->B1 LB-.Health Checks.->B2 LB-.Health Checks.->B3
Production-Grade Capabilities
- Multiple runtime-selectable balancing strategies with control-plane visibility.
- Health-aware failover with hysteresis and graceful draining for safer lifecycle transitions.
- Regular ECS runtime with explicit service deployment policy, ingress control, and operator-visible telemetry endpoints.
Engineering Decisions
- Consistent hashing improves stickiness and cache locality, but can create uneven load if key distribution is skewed.
- Aggressive health probing catches failures quickly, but risks false flaps without hysteresis thresholds.
- Retry with failover improves success rates, but must stay bounded to avoid amplifying tail latency.
- Regular ECS restores deeper service and networking control, but adds task definitions, deployment orchestration, and load balancer/service wiring overhead.
Behavioral + Impact Signals
- Designed bounded retry logic to avoid runaway failure loops.
- Added graceful draining and hysteresis to reduce operational flapping.
- Prioritized introspection endpoints for easier incident investigation.
- Accepted more orchestration ownership when the workload demanded lower-level control than App Runner exposed.
Quality Guarantees
- Only healthy backends receive routed traffic unless explicitly in drain-aware transition.
- Idempotent request retries remain bounded and never recurse indefinitely.
- Routing strategy switches are observable through control-plane and metrics endpoints.
Recent Upgrades
- Migrated the public deployment from AWS App Runner to regular ECS and cut over the live endpoint to miniloadbalancer.io.
- Added reliability mechanisms including circuit breaker, bounded retries, and health-check hysteresis.
- Added Consul service-discovery integration and operator-facing dashboard/control surface.
Outcome Highlights
- Implemented three routing strategies with runtime switch support.
- Added circuit breaker, active health checks, graceful draining, and failover mechanics.
- Cut over the public deployment to miniloadbalancer.io on a regular ECS-backed runtime with public operations evidence.