May 26, 20262 min readarchitecture

Why we wait for consensus before paging you

A single bad PoP is the classic source of 3am false pages. gochron only opens an incident when multiple regions agree the target is down. Here is the design and what it gets us.

Pick any monitoring tool and run it for a quarter. You will get paged at 3am for an outage that wasn't, because one prober region had a bad minute: a flaky anycast route, packet loss at a single transit provider, an upstream DNS hiccup that resolved itself thirty seconds later. The probe was honest. It failed. The target was fine.

gochron's default behavior is to not page anyone for that kind of single-region blip. We call it consensus alerting, and it's the default gate between "a probe failed" and "a human gets woken up."

The shape

Probes run from six regions: iad, fra, nrt, syd, gru, hkg. For an HTTP, TCP, SSL, or DNS monitor configured against multiple regions, every check interval runs the probe from each enabled region and writes the result to the same Postgres cluster.

When evaluating whether to open an incident, the scheduler looks at the most recent result from each region. If a single region is failing while the others are succeeding, that fact is recorded but does not escalate. The per-region result is still visible on the monitor detail page, so you can debug a regional issue without being woken up by it.

If N or more regions agree the target is down, the incident opens and the matching alert rules fire.

Why not page on first failure

A naive "alert if down from anywhere" check minimizes detection latency at the cost of false positives. In practice the false-positive cost is not theoretical. Every false page erodes trust in the alerting system, which is how teams end up muting Slack channels and rationalizing "probably nothing" at 3am. That is the worst failure mode for monitoring.

The tradeoff is detection latency. Consensus alerting waits for multiple regions to confirm before opening an incident, which adds time. For most production targets, that delay is worth orders of magnitude fewer false pages. For a tighter SLO where seconds matter, consensus is the wrong default, and you can tune the threshold per monitor.

What about a real single-region outage

The honest answer: with consensus alerting on, you won't be paged if exactly one region's users are seeing a real outage. That's a deliberate choice. The dashboard still shows the regional failure clearly, and status-page region badges (also opt-in) let your users see the same breakdown.

If single-region visibility is critical for your product (say a CDN, or a regional API), the right move is a per-region monitor with the consensus threshold dropped to 1. Same probe data, different gate.

Where to dig in

Every monitor's detail page shows per-region probe results with the consensus threshold annotated. Incidents carry a snapshot of which regions agreed at open-time, so a post-mortem can reconstruct exactly why we paged or didn't.

If you've been bitten by single-region false pages in another tool, this is the architectural decision that's specifically designed to fix that pattern.

← All posts