NetPulse: Distributed Uptime Monitoring SaaS System Design

Distributed Systems & Cloud APIs

Zero-trust distributed uptime monitoring platform with hardened database pooling and secure inter-region checker communication.

PgBouncer + mTLS | 10k+ Regional Write Spike Validation

Live Demo Source Repository Read Production Upgrade Log Read ADR: Connection Pooling Tradeoffs

Why This Project Matters

Demonstrates secure and high-concurrency monitoring architecture with reliability controls that stay stable under aggressive regional write spikes.

Tech + Architecture Summary

Tech: Next.js, Node.js, PostgreSQL, PgBouncer, mTLS, Docker
Architecture: mTLS regional checkers -> queue -> monitoring engine -> PgBouncer + Postgres/Redis -> status dashboard + incident lifecycle.

Impact Metrics

Implemented PgBouncer for advanced PostgreSQL connection pooling, preventing database connection exhaustion during 10,000+ concurrent regional worker write load tests.
Enforced Zero-Trust architecture by establishing Mutual TLS (mTLS) encryption between distributed regional checkers and the centralized monitoring engine.
P95 check-to-dashboard update latency maintained strictly under 45ms during aggressive JMeter staging validation runs.
Expanded the product story beyond staged load validation with Phase 2 onboarding/demo-access work and a Phase 3 real-user evidence plan.

Core Problem

Deliver trustworthy uptime checks and incident alerts without flooding users with false positives or delayed notifications.

Build Notes

What I Owned

I use NetPulse as my main proof-of-work because it connects monitoring, queueing, pooling, authentication, and incident UX into one system instead of a single isolated demo.

Hard Lesson

The main lesson was that alerting is a trust problem: retry windows, debounce rules, and incident lifecycle state matter as much as detecting downtime.

Next Enhancement

Next I would add anonymized real-user uptime checks and a public status-page demo so the evidence shifts from staged validation to production usage.

High-Level Architecture

mermaid
graph TD
  Checkers[Regional Check Workers]-->Queue[Job Queue]
  Queue-->Engine[Monitoring Engine]
  Engine-->Store[(PostgreSQL)]
  Engine-->Cache[(Redis)]
  Store-->Dashboard[Status Dashboard]
  Engine-->Alerts[Incident Alerts]

Production-Grade Capabilities

Tenant-aware authentication and onboarding via Cognito registration + login.
Public demo-safe read-only API mode for external review without privileged credentials.
Multi-region probe workflow with incident lifecycle and alert deduplication controls.

Engineering Decisions

Shorter check intervals improve freshness but can increase noise, so I added retry windows and alert debouncing before incident creation.
Persisting complete history improves debugging, but can grow quickly; I retained detailed recent logs and summarized older events.

Behavioral + Impact Signals

Built failure-handling logic with retries and alert debouncing.
Documented reliability/cost tradeoffs for retention and probing intervals.
Shipped live system plus docs so reviewers can validate claims quickly.

Quality Guarantees

Each check result is timestamped, attributable, and persisted.
Incident state changes are auditable from detection to resolution.
Alert pipelines avoid duplicate notifications for the same incident window.

Recent Upgrades

Added dedicated registration with Cognito email verification and full login flow for production-style onboarding.
Introduced public read-only demo API mode so external reviewers can evaluate functionality without tenant credentials.
Expanded integration test + deployment workflows for stronger end-to-end operational confidence.
Linked NetPulse into the cross-project production architecture upgrade log so pooling, mTLS, queueing, and incident lifecycle decisions are easier to verify.

Phase Improvements

Phase 2

Implemented

Production Onboarding + Reviewer-Safe Demo Access

Strengthened NetPulse from an architecture demo into a reviewable SaaS workflow with Cognito registration, email verification, login, and demo-safe access paths.

Added tenant-aware registration and login flow so the system can be evaluated like a real monitoring product.
Introduced read-only demo API behavior for external reviewers without exposing privileged user actions.
Expanded deployment and integration validation around the onboarding and monitoring paths.

Open NetPulse System Design

Phase 3

Next Build Target

Real-User Evidence + Public Status Page Upgrade

Moves NetPulse from staged validation language toward verifiable live-product evidence through public status pages, timestamped uptime summaries, and safer demo-mode boundaries.

Expose public status pages that hide private owner data while showing uptime, incidents, and last-check time.
Keep demo mode isolated so reviewers can inspect behavior without mutating privileged tenant data.
Differentiate real usage evidence from staged JMeter/load-test validation in portfolio copy.

Read Incident Writeup

Outcome Highlights

Shipped a production deployment and public source code for external technical review.
Implemented status dashboards that let users track service health over time instead of isolated check events.
Designed alert flow with retry/debounce behavior to reduce noisy false alarms.