Edge Balancer Failure Handling: Circuit Breakers, Hysteresis, and Bounded Retries

A breakdown of how I hardened a Go-based Edge Balancer against backend flapping with health-aware routing and controlled retry behavior.

GoDistributed SystemsConcurrencyLoad BalancingObservability

1. Hook and Stakes

Basic round-robin routing looked correct under healthy nodes, but degraded quickly when backend health oscillated.

Without stable failure-handling, transient outages amplify retry storms and destroy tail latency in production traffic paths.

2. Architecture Diagram

A dual-plane design routes user traffic through proxy logic while exposing an admin control plane for strategy and health inspection.

mermaid
graph LR
  Client[Incoming Traffic]-->LB[Edge Balancer]
  LB-->Proxy[Proxy Plane]
  LB-->Admin[Control Plane /admin/*]
  Proxy-->B1[Backend A]
  Proxy-->B2[Backend B]
  Proxy-->B3[Backend C]
  LB-.Health Checks.->B1
  LB-.Health Checks.->B2
  LB-.Health Checks.->B3

Runtime-selectable routing (round robin, least connections, consistent hashing)
Active health checks with hysteresis thresholds
Circuit breaker with bounded retries
Metrics endpoints for routing + backend health state visibility

3. Stress Test and Breaking Point

Setup: I injected backend instability while replaying concurrent requests across all routing strategies.

Failure Signal: Without hysteresis and bounded retry controls, backends repeatedly flipped state and created noisy failover loops.

Circuit-breaker + hysteresis rules reduced backend flapping during instability windows.
Bounded retries prevented recursive retry amplification under partial outage.
Routing strategy visibility through metrics endpoints made failure behavior debuggable during load tests.

4. Bottleneck Root Cause and Resolution

Root Cause: Health checks were too eager and retries were too permissive, causing transient backend failures to propagate as system-wide instability.

Resolution: I added health-check hysteresis, explicit circuit-breaker state transitions, and retry bounds so failover remains controlled and observable.

Conservative circuit-breaker thresholds reduce flapping but can delay re-entry for recovered nodes.
Retry limits protect latency tails but can reduce best-effort success rate for borderline requests.

5. Business Impact

Improved service continuity under backend degradation scenarios.
Reduced incident triage time through explicit control-plane and metrics evidence.
Demonstrated production-style systems thinking relevant to infra and platform teams.

References and Live Evidence

Live Deployment Source Repository System Design Read Incident Report: OOM Debugging Read Migration Deep Dive Read Production Upgrade Log