Edge Balancer (Go) System Design

Distributed Systems & Cloud APIs

Regular ECS-hosted Go-based Edge Balancer migrated off AWS App Runner to gain direct control over proxy behavior, service rollout, and networking diagnostics.

App Runner -> Regular ECS | Go pprof + Consul | Prometheus/Grafana

Live Demo Source Repository Read Incident Report: OOM Debugging Read Migration Deep Dive Read Production Upgrade Log

Why This Project Matters

Shows when an edge-routing service outgrows App Runner and needs regular ECS service-level control for proxying, health management, and deployment behavior.

Tech + Architecture Summary

Tech: AWS ECS, Go, Consul, Prometheus, Grafana, pprof
Architecture: miniloadbalancer.io -> ALB/TLS ingress -> regular ECS service running Go proxy + control plane -> Consul discovery -> backend pool -> Prometheus/Grafana telemetry.

Impact Metrics

Migrated the service from AWS App Runner to regular ECS so service rollout policy, health-probe cadence, task behavior, and ingress wiring could be controlled directly.
Conducted deep runtime profiling using Go pprof to identify and eliminate memory allocation bottlenecks, optimizing goroutine scheduling for high-throughput TCP proxying.
Integrated dynamic service discovery via Consul, enabling zero-downtime backend node registration, sub-second health-aware failover, and ECS-backed routing introspection.

Core Problem

Route traffic predictably under backend failure while minimizing flapping, maintaining idempotent retry behavior, and preserving operational insight.

Build Notes

What I Owned

This is my networking fundamentals project: I wanted a small enough Go service to explain line by line, but realistic enough to discuss health checks, retries, draining, and observability.

Hard Lesson

The main lesson was that an edge balancer is dangerous if failure handling is too aggressive; bounded retries and hysteresis matter because naive retries can amplify an outage.

Next Enhancement

Next I would publish a short traffic replay demo that compares round robin, least connections, and consistent hashing under the same backend failure scenario.

High-Level Architecture

mermaid
graph LR
  Client[User Traffic]-->LB[Edge Balancer]
  LB-->CP[Control Plane /admin/*]
  LB-->Proxy[Proxy Plane /proxy/*]
  Proxy-->B1[Backend A]
  Proxy-->B2[Backend B]
  Proxy-->B3[Backend C]
  LB-.Health Checks.->B1
  LB-.Health Checks.->B2
  LB-.Health Checks.->B3

Production-Grade Capabilities

Multiple runtime-selectable balancing strategies with control-plane visibility.
Health-aware failover with hysteresis and graceful draining for safer lifecycle transitions.
Regular ECS runtime with explicit service deployment policy, ingress control, and operator-visible telemetry endpoints.

Engineering Decisions

Consistent hashing improves stickiness and cache locality, but can create uneven load if key distribution is skewed.
Aggressive health probing catches failures quickly, but risks false flaps without hysteresis thresholds.
Retry with failover improves success rates, but must stay bounded to avoid amplifying tail latency.
Regular ECS restores deeper service and networking control, but adds task definitions, deployment orchestration, and Edge Balancer service-wiring overhead.

Behavioral + Impact Signals

Designed bounded retry logic to avoid runaway failure loops.
Added graceful draining and hysteresis to reduce operational flapping.
Prioritized introspection endpoints for easier incident investigation.
Accepted more orchestration ownership when the workload demanded lower-level control than App Runner exposed.

Quality Guarantees

Only healthy backends receive routed traffic unless explicitly in drain-aware transition.
Idempotent request retries remain bounded and never recurse indefinitely.
Routing strategy switches are observable through control-plane and metrics endpoints.

Recent Upgrades

Migrated the public deployment from AWS App Runner to regular ECS and cut over the live endpoint to miniloadbalancer.io.
Added reliability mechanisms including circuit breaker, bounded retries, and health-check hysteresis.
Added Consul service-discovery integration and operator-facing dashboard/control surface.
Documented the migration as a workload-fit decision: regular ECS for networking-heavy proxy control, not a blanket rejection of App Runner.

Outcome Highlights

Implemented three routing strategies with runtime switch support.
Added circuit breaker, active health checks, graceful draining, and failover mechanics.
Cut over the public deployment to miniloadbalancer.io on a regular ECS-backed runtime with public operations evidence.