Distributed Uptime Monitoring System
NetPulse
Queue-based uptime monitoring system with retry logic, regional workers, PgBouncer connection pooling, mTLS worker communication, and incident lifecycle controls.
Simple Architecture
API Layer v Queue v Regional Worker Pool v PgBouncer + PostgreSQL + Redis v Alerting + Dashboard
Queue-based architecture protects the database during burst traffic and lets worker throughput scale independently from the dashboard/API layer.
Metrics
10,000+ concurrent regional worker write load tests
P95 check-to-dashboard latency held under 45ms
PgBouncer added to prevent PostgreSQL connection exhaustion
What Broke
- PostgreSQL connection exhaustion under burst traffic
- Duplicate incident alerts during retry windows
- Noisy single-sample failures creating low-trust alerts
Fixes
- PgBouncer connection pooling
- Queue-based ingestion with retry/debounce controls
- Incident lifecycle rules with alert deduplication
NetPulse Phase 2-3 Upgrade Track
Phase 2
ImplementedProduction Onboarding + Reviewer-Safe Demo Access
Strengthened NetPulse from an architecture demo into a reviewable SaaS workflow with Cognito registration, email verification, login, and demo-safe access paths.
- Added tenant-aware registration and login flow so the system can be evaluated like a real monitoring product.
- Introduced read-only demo API behavior for external reviewers without exposing privileged user actions.
- Expanded deployment and integration validation around the onboarding and monitoring paths.
Phase 3
Next Build TargetReal-User Evidence + Public Status Page Upgrade
Moves NetPulse from staged validation language toward verifiable live-product evidence through public status pages, timestamped uptime summaries, and safer demo-mode boundaries.
- Expose public status pages that hide private owner data while showing uptime, incidents, and last-check time.
- Keep demo mode isolated so reviewers can inspect behavior without mutating privileged tenant data.
- Differentiate real usage evidence from staged JMeter/load-test validation in portfolio copy.
Build Notes
What I Owned
I use NetPulse as my main proof-of-work because it connects monitoring, queueing, pooling, authentication, and incident UX into one system instead of a single isolated demo.
Hard Lesson
The main lesson was that alerting is a trust problem: retry windows, debounce rules, and incident lifecycle state matter as much as detecting downtime.
Next Enhancement
Next I would add anonymized real-user uptime checks and a public status-page demo so the evidence shifts from staged validation to production usage.
Engineering Decisions
| Decision | Reason |
|---|---|
| Queue-based ingestion | Protect database during burst traffic |
| PgBouncer pooling | Avoid connection exhaustion |
| mTLS checker traffic | Secure worker-to-engine communication |
| Alert debouncing | Reduce duplicate incident noise |