NetPulse: Distributed Uptime Monitoring SaaS System Design
Back to PortfolioDistributed Systems & Cloud APIs
Zero-trust distributed uptime monitoring platform with hardened database pooling and secure inter-region checker communication.
PgBouncer + mTLS | 10k+ Regional Write Spike Validation
Why This Project Matters
Demonstrates secure and high-concurrency monitoring architecture with reliability controls that stay stable under aggressive regional write spikes.
Tech + Architecture Summary
- Tech: Next.js, Node.js, PostgreSQL, PgBouncer, mTLS, Docker
- Architecture: mTLS regional checkers -> queue -> monitoring engine -> PgBouncer + Postgres/Redis -> status dashboard + incident lifecycle.
Impact Metrics
- Implemented PgBouncer for advanced PostgreSQL connection pooling, preventing database connection exhaustion during 10,000+ concurrent regional worker write load tests.
- Enforced Zero-Trust architecture by establishing Mutual TLS (mTLS) encryption between distributed regional checkers and the centralized monitoring engine.
- P95 check-to-dashboard update latency maintained strictly under 45ms during aggressive JMeter staging validation runs.
Core Problem
Deliver trustworthy uptime checks and incident alerts without flooding users with false positives or delayed notifications.
High-Level Architecture
mermaid graph TD Checkers[Regional Check Workers]-->Queue[Job Queue] Queue-->Engine[Monitoring Engine] Engine-->Store[(PostgreSQL)] Engine-->Cache[(Redis)] Store-->Dashboard[Status Dashboard] Engine-->Alerts[Incident Alerts]
Production-Grade Capabilities
- Tenant-aware authentication and onboarding via Cognito registration + login.
- Public demo-safe read-only API mode for external review without privileged credentials.
- Multi-region probe workflow with incident lifecycle and alert deduplication controls.
Engineering Decisions
- Shorter check intervals improve freshness but can increase noise, so I added retry windows and alert debouncing before incident creation.
- Persisting complete history improves debugging, but can grow quickly; I retained detailed recent logs and summarized older events.
Behavioral + Impact Signals
- Built failure-handling logic with retries and alert debouncing.
- Documented reliability/cost tradeoffs for retention and probing intervals.
- Shipped live system plus docs so reviewers can validate claims quickly.
Quality Guarantees
- Each check result is timestamped, attributable, and persisted.
- Incident state changes are auditable from detection to resolution.
- Alert pipelines avoid duplicate notifications for the same incident window.
Recent Upgrades
- Added dedicated registration with Cognito email verification and full login flow for production-style onboarding.
- Introduced public read-only demo API mode so external reviewers can evaluate functionality without tenant credentials.
- Expanded integration test + deployment workflows for stronger end-to-end operational confidence.
Outcome Highlights
- Shipped a production deployment and public source code for external technical review.
- Implemented status dashboards that let users track service health over time instead of isolated check events.
- Designed alert flow with retry/debounce behavior to reduce noisy false alarms.