Watchdog Timer Best Practices: Preventing System Hangs and Failures
1. Choose the right type
- Hardware watchdog for critical systems requiring independent recovery.
- Software watchdog for lower-criticality tasks or where hardware isn’t available.
- Use a dual-layer approach (hardware + software watchdog) when reliability is paramount.
2. Set an appropriate timeout
- Calculate worst-case execution time (WCET) including interrupts, blocking I/O, and low-power wakeups.
- Add a safety margin (typically 1.5×–2× WCET) to avoid false resets.
- Use different timeouts for different operational modes (e.g., boot, runtime, low-power).
3. Implement robust petting/kick logic
- Only pet the watchdog when the system is in a known-good state (not simply on a timer).
- Use a centralized “health monitor” task that verifies critical subsystems before kicking.
- Avoid multiple unsynchronized tasks independently petting the watchdog.
4. Monitor critical subsystems explicitly
- Check CPU, memory usage, key peripheral responses, task liveness, and communication links.
- Use heartbeat signals from important threads/processes; escalate if missing.
- Record fault counters to detect intermittent failures before reset.
5. Detect and differentiate failure modes
- Use non-volatile storage (or reserved RAM across warm resets) to log reset causes and stack traces.
- Classify resets (watchdog, brownout, software-initiated) to tailor recovery actions.
- On repeated watchdogs, switch to a safe mode, disable nonessential features, or enter firmware-recovery.
6. Provide graceful recovery paths
- Attempt controlled shutdown of peripherals and data flush if time allows before reset.
- After reset, run self-tests and decide whether to resume, rollback to known-good firmware, or enter maintenance mode.
- Offer remote reporting of reset events for diagnostics.
7. Protect against accidental disabling or starvation
- Restrict who/what can disable the watchdog (privileged code paths only).
- Implement watchdog kick tokens or counters to prevent misbehaving code from indefinitely petting the watchdog.
- In multi-core systems, require consensus or a central monitor before disabling.
8. Test thoroughly and include fault injection
- Simulate hangs, memory leaks, high interrupt load, and peripheral failures.
- Verify that timeouts trigger resets and that post-reset behavior is correct.
- Include watchdog behavior in automated regression tests.
9. Consider power and low-power interactions
- Ensure the watchdog continues or is appropriately reconfigured during sleep modes.
- For battery-powered devices, avoid unnecessary resets that drain power; use longer timeouts in low-power modes.
10. Document and log
- Document watchdog policy, timeouts per mode, and petting logic in design docs.
- Log reset reasons and timestamps to aid debugging and trend analysis.
Implementing these practices will reduce false resets, improve system resilience, and ensure that genuine hangs lead to timely, diagnosable recovery.
Leave a Reply