Mini Load Balancer (Go) System Design

Back to Portfolio

Distributed Systems & Cloud APIs

Regular ECS-hosted Go load balancer migrated off AWS App Runner to gain direct control over proxy behavior, service rollout, and networking diagnostics.

App Runner -> Regular ECS | Go pprof + Consul | Prometheus/Grafana

Why This Project Matters

Shows when a networking-heavy service outgrows App Runner and needs regular ECS service-level control for proxying, health management, and deployment behavior.

Tech + Architecture Summary

  • Tech: AWS ECS, Go, Consul, Prometheus, Grafana, pprof
  • Architecture: miniloadbalancer.io -> ALB/TLS ingress -> regular ECS service running Go proxy + control plane -> Consul discovery -> backend pool -> Prometheus/Grafana telemetry.

Impact Metrics

  • Migrated the service from AWS App Runner to regular ECS so service rollout policy, health-probe cadence, task behavior, and ingress wiring could be controlled directly.
  • Conducted deep runtime profiling using Go pprof to identify and eliminate memory allocation bottlenecks, optimizing goroutine scheduling for high-throughput TCP proxying.
  • Integrated dynamic service discovery via Consul, enabling zero-downtime backend node registration, sub-second health-aware failover, and ECS-backed routing introspection.

Core Problem

Route traffic predictably under backend failure while minimizing flapping, maintaining idempotent retry behavior, and preserving operational insight.

High-Level Architecture

mermaid
graph LR
  Client[User Traffic]-->LB[Go Load Balancer]
  LB-->CP[Control Plane /admin/*]
  LB-->Proxy[Proxy Plane /proxy/*]
  Proxy-->B1[Backend A]
  Proxy-->B2[Backend B]
  Proxy-->B3[Backend C]
  LB-.Health Checks.->B1
  LB-.Health Checks.->B2
  LB-.Health Checks.->B3

Production-Grade Capabilities

  • Multiple runtime-selectable balancing strategies with control-plane visibility.
  • Health-aware failover with hysteresis and graceful draining for safer lifecycle transitions.
  • Regular ECS runtime with explicit service deployment policy, ingress control, and operator-visible telemetry endpoints.

Engineering Decisions

  • Consistent hashing improves stickiness and cache locality, but can create uneven load if key distribution is skewed.
  • Aggressive health probing catches failures quickly, but risks false flaps without hysteresis thresholds.
  • Retry with failover improves success rates, but must stay bounded to avoid amplifying tail latency.
  • Regular ECS restores deeper service and networking control, but adds task definitions, deployment orchestration, and load balancer/service wiring overhead.

Behavioral + Impact Signals

  • Designed bounded retry logic to avoid runaway failure loops.
  • Added graceful draining and hysteresis to reduce operational flapping.
  • Prioritized introspection endpoints for easier incident investigation.
  • Accepted more orchestration ownership when the workload demanded lower-level control than App Runner exposed.

Quality Guarantees

  • Only healthy backends receive routed traffic unless explicitly in drain-aware transition.
  • Idempotent request retries remain bounded and never recurse indefinitely.
  • Routing strategy switches are observable through control-plane and metrics endpoints.

Recent Upgrades

  • Migrated the public deployment from AWS App Runner to regular ECS and cut over the live endpoint to miniloadbalancer.io.
  • Added reliability mechanisms including circuit breaker, bounded retries, and health-check hysteresis.
  • Added Consul service-discovery integration and operator-facing dashboard/control surface.

Outcome Highlights

  • Implemented three routing strategies with runtime switch support.
  • Added circuit breaker, active health checks, graceful draining, and failover mechanics.
  • Cut over the public deployment to miniloadbalancer.io on a regular ECS-backed runtime with public operations evidence.