NetPulse: Distributed Uptime Monitoring SaaS System Design

Back to Portfolio

Distributed Systems & Cloud APIs

Zero-trust distributed uptime monitoring platform with hardened database pooling and secure inter-region checker communication.

PgBouncer + mTLS | 10k+ Regional Write Spike Validation

Why This Project Matters

Demonstrates secure and high-concurrency monitoring architecture with reliability controls that stay stable under aggressive regional write spikes.

Tech + Architecture Summary

  • Tech: Next.js, Node.js, PostgreSQL, PgBouncer, mTLS, Docker
  • Architecture: mTLS regional checkers -> queue -> monitoring engine -> PgBouncer + Postgres/Redis -> status dashboard + incident lifecycle.

Impact Metrics

  • Implemented PgBouncer for advanced PostgreSQL connection pooling, preventing database connection exhaustion during 10,000+ concurrent regional worker write load tests.
  • Enforced Zero-Trust architecture by establishing Mutual TLS (mTLS) encryption between distributed regional checkers and the centralized monitoring engine.
  • P95 check-to-dashboard update latency maintained strictly under 45ms during aggressive JMeter staging validation runs.
  • Expanded the product story beyond staged load validation with Phase 2 onboarding/demo-access work and a Phase 3 real-user evidence plan.

Core Problem

Deliver trustworthy uptime checks and incident alerts without flooding users with false positives or delayed notifications.

Build Notes

What I Owned

I use NetPulse as my main proof-of-work because it connects monitoring, queueing, pooling, authentication, and incident UX into one system instead of a single isolated demo.

Hard Lesson

The main lesson was that alerting is a trust problem: retry windows, debounce rules, and incident lifecycle state matter as much as detecting downtime.

Next Enhancement

Next I would add anonymized real-user uptime checks and a public status-page demo so the evidence shifts from staged validation to production usage.

High-Level Architecture

mermaid
graph TD
  Checkers[Regional Check Workers]-->Queue[Job Queue]
  Queue-->Engine[Monitoring Engine]
  Engine-->Store[(PostgreSQL)]
  Engine-->Cache[(Redis)]
  Store-->Dashboard[Status Dashboard]
  Engine-->Alerts[Incident Alerts]

Production-Grade Capabilities

  • Tenant-aware authentication and onboarding via Cognito registration + login.
  • Public demo-safe read-only API mode for external review without privileged credentials.
  • Multi-region probe workflow with incident lifecycle and alert deduplication controls.

Engineering Decisions

  • Shorter check intervals improve freshness but can increase noise, so I added retry windows and alert debouncing before incident creation.
  • Persisting complete history improves debugging, but can grow quickly; I retained detailed recent logs and summarized older events.

Behavioral + Impact Signals

  • Built failure-handling logic with retries and alert debouncing.
  • Documented reliability/cost tradeoffs for retention and probing intervals.
  • Shipped live system plus docs so reviewers can validate claims quickly.

Quality Guarantees

  • Each check result is timestamped, attributable, and persisted.
  • Incident state changes are auditable from detection to resolution.
  • Alert pipelines avoid duplicate notifications for the same incident window.

Recent Upgrades

  • Added dedicated registration with Cognito email verification and full login flow for production-style onboarding.
  • Introduced public read-only demo API mode so external reviewers can evaluate functionality without tenant credentials.
  • Expanded integration test + deployment workflows for stronger end-to-end operational confidence.
  • Linked NetPulse into the cross-project production architecture upgrade log so pooling, mTLS, queueing, and incident lifecycle decisions are easier to verify.

Phase Improvements

Phase 2

Implemented

Production Onboarding + Reviewer-Safe Demo Access

Strengthened NetPulse from an architecture demo into a reviewable SaaS workflow with Cognito registration, email verification, login, and demo-safe access paths.

  • Added tenant-aware registration and login flow so the system can be evaluated like a real monitoring product.
  • Introduced read-only demo API behavior for external reviewers without exposing privileged user actions.
  • Expanded deployment and integration validation around the onboarding and monitoring paths.
Open NetPulse System Design

Phase 3

Next Build Target

Real-User Evidence + Public Status Page Upgrade

Moves NetPulse from staged validation language toward verifiable live-product evidence through public status pages, timestamped uptime summaries, and safer demo-mode boundaries.

  • Expose public status pages that hide private owner data while showing uptime, incidents, and last-check time.
  • Keep demo mode isolated so reviewers can inspect behavior without mutating privileged tenant data.
  • Differentiate real usage evidence from staged JMeter/load-test validation in portfolio copy.
Read Incident Writeup

Outcome Highlights

  • Shipped a production deployment and public source code for external technical review.
  • Implemented status dashboards that let users track service health over time instead of isolated check events.
  • Designed alert flow with retry/debounce behavior to reduce noisy false alarms.