Maybe one of the main differences between embedded software/PC programmers and server/backend programmers is their attitude toward system resets. A server programmer will try as hard as possible to avoid any sort of system reboot since this could make a bad situation even worse. They would always strive for graceful service degradation (i.e. the system would not provide its full or top-level service) whenever forced to take action against unexpected or failing conditions.
On the other hand embedded system programmers are more inclined to have the system restart from scratch when something weird happens. Possibly because most of the embedded systems are installed in a way that it is not possible (or extremely difficult) to access them for inspection or manual power cycle. Wise firmware programmer knows that bugs or unexpected conditions may lock up the micro turning it into an expensive and convoluted heat sink.
A watchdog is like a hardware-implemented timebomb, usually, it is inside the micro, but sometimes it can be implemented on an external IC. The application code needs to reset the timer before it runs out in order to avoid the triggering of the bomb (actually a system reset kind of explosion – possibly unharmful).
As a countermeasure, it is a bit extreme, but sometimes it is the only choice.
In my designs I often need several activities that progress concurrently, they could be thread/task, but also they could be state machines. The micro I’m currently using has just one hardware watchdog (well, not entirely true, but bare with me) therefore I need to multiplex it, making a virtualized watchdog service. In this way I create the illusion that every activity has its own, personal and private watchdog.
Watchdog virtualization is pretty straightforward, it is implemented using a bit-mask. Activities are numbered. When an activity kicks the virtual watchdog, the corresponding bit in the bit-mask is cleared. If all bits are cleared, then the virtual watchdog code kicks the physical watchdog and resets the bit-mask to all 1s.
The physical watchdog timing needs to be tuned so that it should last at least twice the time needed for the slowest activity to kick the watchdog, otherwise the watchdog could trigger even in normal conditions.
So far so good, you saved the day, when in production your firmware will be rock solid and ready to restart should anything go wrong. But… But now we’re still in the development stage, so activities may have bugs that prevent the kicking of the watchdog.
It happens, not that infrequently, that after some changes to the firmware, or pending a porting to a different board, the watchdog resets the board, leaving you to wonder which one, among the ten activities, is idly stuck somewhere rather than periodically kicking the watchdog?
My first attempt to ease the detective work, was to move the virtual watchdog bit-mask into a memory region that wasn’t cleared at boot. My idea was that among the uncleared bits, hid the offending task.
At boot, my code checked if the system had been reset by the watchdog and, in that case, the bit mask of the virtual watchdog was recorded in the diagnostic log.
Once implemented, this solution turned out not very practical, quite often two (or more) bits were turned on. Maybe one activity was blocking the other one or an unexpected timeout occurred inside a critical section. A post-mortem analysis based just on the stuck activities can scale up from difficult to impossible.
It would be really useful to perform such analysis right before the watchdog performs its reset. So I devised the following trick. Each time an activity kicks the watchdog, I note down the system time, so that I can always tell which time each activity kicked the dog last.
Then I add a periodic check, triggered from the system tick, say every 100ms, to determine which activity, if any, didn’t kick the watchdog in the last few seconds. How many seconds, depends on the watchdog timeout. I have a 4 seconds watchdog and test for 3 seconds of inactivity, which is plenty of time since activities should kick the watchdog with a period below 1 second.
If my check fails, I trigger an assertion. My assertions are defined so that they stop the debugger if the code is executed in the debugger, in this way I get a chance to examine the system before the reset.
And that pretty much solves my problem.
Some MCU provide a pre-watchdog interrupt that triggers right before the watchdog itself. By hooking to this interrupt you could do the same I did with the timer. But then your code is much more platform dependent.
My friend Roberto at Tecniplast, proposed to also copy the stack of each task, right before the watchdog reset. This would help forensic analysis if you have no debugger access to the device and manage to dump that stack somehow (possibly at the next reset).
No dogs were actually harmed (not even kicked) while writing this post. I prefer cats, but they are less reliable to reset the MCU.
2 thoughts on “Watchdogging the Watchdog”
Great stuff Max!!
Also it is agreat idea to copy the stack of each task. Not an easy thing to do, I believe, but can be useful.
The use of assertion was brilliant!!!
When the assertion is triggered, the debugger stops to execute the FW. At that point, a reset of the micro does not take place?
Thank you Corrado!
Saving stacks for later retrieval is a good idea indeed, the two drawbacks are that you need enough RAM to copy stack contents and then enough bandwidth to recover the data.
When the assertion is hit, you have still some time before the watchdog actually triggers, so you have enough time to examine the memory and the stack of the various tasks, and you can continue the execution for a while before the reset. Anyway, it depends on how the assertion timeout is defined against the watchdog timeout.
Happy New Year!