Dead man's switch monitoring explained

The phrase "dead man's switch" comes from 19th-century railway engineering. Trains were fitted with a lever the driver had to keep pressed while moving. Release it — because the driver fell asleep, was incapacitated, or worse — and the brakes automatically applied. The absence of an activation signal caused the safety response.

The concept maps directly to scheduled task monitoring. Your job is expected to check in periodically. The absence of a check-in triggers an alert.

How it differs from active monitoring

Most infrastructure monitoring is active: a probe regularly hits an endpoint and alerts if the endpoint doesn't respond. Load balancer health checks, uptime services, and synthetic checks all work this way.

Dead man's switch monitoring inverts this model. Instead of a probe checking on a job, the job checks in with the monitor. The monitor's role is simply to wait. If the signal doesn't arrive within the expected window, it alerts.

This inversion is what makes it effective for scheduled tasks. An HTTP health check on /health tells you whether a service is responding right now. It tells you nothing about whether the 2am backup ran. The backup exits in seconds, leaves no running process, and has no endpoint to probe. There's nothing for active monitoring to check.

The state machine

At Tymo, each monitor is a four-state machine:

pending — Created but never pinged. Waiting for the first check-in.
up — Last ping arrived within the expected window.
down — The deadline passed (schedule + grace period) with no ping. Alert sent.
paused — Monitoring suspended. Pings are logged but ignored.

Transitions happen in two places: when a ping arrives (moving to up), and on the scheduler tick (moving to down when the deadline is missed). Recovery is automatic — a ping on a down monitor sends a recovery notification and moves the state back to up without any manual intervention.

Grace periods matter more than you think

Jobs don't run in zero time, and their runtimes vary. A backup that usually finishes in 10 minutes might take 45 minutes after a large data import. If your grace period is shorter than the worst-case runtime, you'll get false alerts — and false alerts are worse than no alerts, because they train your team to ignore the monitoring.

A reasonable approach: measure the job's longest observed runtime over the past 30 days and add 50% as buffer. If the job has occasionally taken up to 40 minutes, set a 60-minute grace period. The target is zero false positives. Every alert should represent a real problem.

Where it fits in a monitoring stack

Dead man's switch monitoring is one layer in a stack, not a replacement for other approaches. Use it alongside:

Log aggregation — for what happened during job execution and why it failed
Active probes on long-running services — for uptime and response time
Error rate alerting — for background workers that process queues continuously
Database query monitoring — for jobs that read or write critical data

For scheduled tasks specifically, the dead man's switch layer is the one most commonly missing and the one that carries the highest cost when absent. A backup that silently fails for three months is only discovered after a data loss event. A nightly report that stopped generating means decisions were made on stale data. The problem is not detecting the failure immediately — it's detecting it at all.