Distributed Systems & Cloud APIs
Zero-trust distributed uptime monitoring platform with hardened database pooling and secure inter-region checker communication.
PgBouncer + mTLS | 10k+ Regional Write Spike Validation
Why This Project Matters
Demonstrates secure and high-concurrency monitoring architecture with reliability controls that stay stable under aggressive regional write spikes.
Tech + Architecture Summary
- Tech: Next.js, Node.js, PostgreSQL, PgBouncer, mTLS, Docker
- Architecture: mTLS regional checkers -> queue -> monitoring engine -> PgBouncer + Postgres/Redis -> status dashboard + incident lifecycle.
Impact Metrics
- Implemented PgBouncer for advanced PostgreSQL connection pooling, preventing database connection exhaustion during 10,000+ concurrent regional worker write load tests.
- Enforced Zero-Trust architecture by establishing Mutual TLS (mTLS) encryption between distributed regional checkers and the centralized monitoring engine.
- P95 check-to-dashboard update latency maintained strictly under 45ms during aggressive JMeter staging validation runs.
- Expanded the product story beyond staged load validation with Phase 2 onboarding/demo-access work and a Phase 3 real-user evidence plan.
Core Problem
Deliver trustworthy uptime checks and incident alerts without flooding users with false positives or delayed notifications.
Build Notes
What I Owned
I use NetPulse as my main proof-of-work because it connects monitoring, queueing, pooling, authentication, and incident UX into one system instead of a single isolated demo.
Hard Lesson
The main lesson was that alerting is a trust problem: retry windows, debounce rules, and incident lifecycle state matter as much as detecting downtime.
Next Enhancement
Next I would add anonymized real-user uptime checks and a public status-page demo so the evidence shifts from staged validation to production usage.
High-Level Architecture
mermaid
graph TD
Checkers[Regional Check Workers]-->Queue[Job Queue]
Queue-->Engine[Monitoring Engine]
Engine-->Store[(PostgreSQL)]
Engine-->Cache[(Redis)]
Store-->Dashboard[Status Dashboard]
Engine-->Alerts[Incident Alerts]
Production-Grade Capabilities
- Tenant-aware authentication and onboarding via Cognito registration + login.
- Public demo-safe read-only API mode for external review without privileged credentials.
- Multi-region probe workflow with incident lifecycle and alert deduplication controls.
Engineering Decisions
- Shorter check intervals improve freshness but can increase noise, so I added retry windows and alert debouncing before incident creation.
- Persisting complete history improves debugging, but can grow quickly; I retained detailed recent logs and summarized older events.
Behavioral + Impact Signals
- Built failure-handling logic with retries and alert debouncing.
- Documented reliability/cost tradeoffs for retention and probing intervals.
- Shipped live system plus docs so reviewers can validate claims quickly.
Quality Guarantees
- Each check result is timestamped, attributable, and persisted.
- Incident state changes are auditable from detection to resolution.
- Alert pipelines avoid duplicate notifications for the same incident window.
Recent Upgrades
- Added dedicated registration with Cognito email verification and full login flow for production-style onboarding.
- Introduced public read-only demo API mode so external reviewers can evaluate functionality without tenant credentials.
- Expanded integration test + deployment workflows for stronger end-to-end operational confidence.
- Linked NetPulse into the cross-project production architecture upgrade log so pooling, mTLS, queueing, and incident lifecycle decisions are easier to verify.
Phase Improvements
Production Onboarding + Reviewer-Safe Demo Access
Strengthened NetPulse from an architecture demo into a reviewable SaaS workflow with Cognito registration, email verification, login, and demo-safe access paths.
- Added tenant-aware registration and login flow so the system can be evaluated like a real monitoring product.
- Introduced read-only demo API behavior for external reviewers without exposing privileged user actions.
- Expanded deployment and integration validation around the onboarding and monitoring paths.
Open NetPulse System DesignReal-User Evidence + Public Status Page Upgrade
Moves NetPulse from staged validation language toward verifiable live-product evidence through public status pages, timestamped uptime summaries, and safer demo-mode boundaries.
- Expose public status pages that hide private owner data while showing uptime, incidents, and last-check time.
- Keep demo mode isolated so reviewers can inspect behavior without mutating privileged tenant data.
- Differentiate real usage evidence from staged JMeter/load-test validation in portfolio copy.
Read Incident WriteupOutcome Highlights
- Shipped a production deployment and public source code for external technical review.
- Implemented status dashboards that let users track service health over time instead of isolated check events.
- Designed alert flow with retry/debounce behavior to reduce noisy false alarms.