Incident_Architecture_Breakdown
Go Load Balancer Failure Handling: Circuit Breakers, Hysteresis, and Bounded Retries
A breakdown of how I hardened a Go load balancer against backend flapping with health-aware routing and controlled retry behavior.
1. Hook and Stakes
Basic round-robin routing looked correct under healthy nodes, but degraded quickly when backend health oscillated.
Without stable failure-handling, transient outages amplify retry storms and destroy tail latency in production traffic paths.
2. Architecture Diagram
A dual-plane design routes user traffic through proxy logic while exposing an admin control plane for strategy and health inspection.
mermaid graph LR Client[Incoming Traffic]-->LB[Go Load Balancer] LB-->Proxy[Proxy Plane] LB-->Admin[Control Plane /admin/*] Proxy-->B1[Backend A] Proxy-->B2[Backend B] Proxy-->B3[Backend C] LB-.Health Checks.->B1 LB-.Health Checks.->B2 LB-.Health Checks.->B3
- Runtime-selectable routing (round robin, least connections, consistent hashing)
- Active health checks with hysteresis thresholds
- Circuit breaker with bounded retries
- Metrics endpoints for routing + backend health state visibility
3. Stress Test and Breaking Point
Setup: I injected backend instability while replaying concurrent requests across all routing strategies.
Failure Signal: Without hysteresis and bounded retry controls, backends repeatedly flipped state and created noisy failover loops.
- Circuit-breaker + hysteresis rules reduced backend flapping during instability windows.
- Bounded retries prevented recursive retry amplification under partial outage.
- Routing strategy visibility through metrics endpoints made failure behavior debuggable during load tests.
4. Bottleneck Root Cause and Resolution
Root Cause: Health checks were too eager and retries were too permissive, causing transient backend failures to propagate as system-wide instability.
Resolution: I added health-check hysteresis, explicit circuit-breaker state transitions, and retry bounds so failover remains controlled and observable.
- Conservative circuit-breaker thresholds reduce flapping but can delay re-entry for recovered nodes.
- Retry limits protect latency tails but can reduce best-effort success rate for borderline requests.
5. Business Impact
- Improved service continuity under backend degradation scenarios.
- Reduced incident triage time through explicit control-plane and metrics evidence.
- Demonstrated production-style systems thinking relevant to infra and platform teams.