I am uncertain as to how to position the watchdog daemon in the initscript start-up sequence.
If the watchdog daemon is started early in the initscript sequence, then there is a chance it could trigger erroneously because a process it is monitoring (via pidfile) hasn't yet started up--though it will start-up in just a few more seconds.
If the watchdog daemon is started late in the initscript sequence, then there is a chance that the system could lock up without a watchdog reboot, if a process that starts before it gets "stuck" during initialisation. The watchdog won't trigger because it hasn't started yet.
It could be good to start the watchdog daemon early, but have a start-up hold-off time to prevent the watchdog daemon from triggering a shutdown during the time that processes are still starting up.
Is that a good strategy, or is there a better way to handle this?
The way I prefer is to start watchdog last for exactly the reasons you
describe.
Right, that's why there is a second binary in the package: wd_keepalive. This
one should be started as early as possible. It only takes care of the hardware
watchdog, if there is one, but not the additional checks. Therefore there
should be no false positive. When starting watchdog later, you just have to
make sure to stop wd_keepalive.
Michael
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL
Thanks for explaining that, Michael. One limitation of wd_keepalive I see is that it wouldn't handle the scenario where an initscript process "locks up", i.e. infinite loop. It would keep patting the hardware watchdog forever, as long as the kernel is still running fine.
It would be good to have a time-out parameter for wd_keepalive, so it can handle that scenario.
Or, have an initial hold-off time for the watchdog daemon, so it won't start looking for pidfiles until after an initial hold-off time to allow for expected initscript run time.
I see the latest wd_keepalive in the git repository (not yet released) has a --loop-exit option. That's the sort of thing I was asking for.
Is a release 5.15 planned sometime soon?
Last edit: Craig McQueen 2015-11-30
Hi Craig,
The --loop-exit option as it stands is not going to do what you want as
it also shuts down the watchdog before exit (as desirable for testing).
You could make it do this by loading the watchdog module with the
CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you
will not be able to stop the watchdog under normal operations. For example:
http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime
However, you have spotted a problem with the current way of doing
things, but it is not one that has a simple solution. Part of the reason
for swapping the watchdog daemon with the wd_keepalive daemon is to
support updating of either the binary or the configuration file while
the system is running, etc, and at the same time not rebooting the
machine if CONFIG_WATCHDOG_NOWAYOUT was used.
Using a variation of the --loop-exit option would protect against a hung
system during start up or shut down, but it would present a bit of a
configuration issue in terms of what "stopping" the watchdog under
normal operation is actually going to do. I don't know enough about the
init script options to know if there is any sane way of adding a
"safestop" command so the usual start|stop has a time-out but an
administrator can really stop things without an unexpected reboot 60
seconds later.
Regards, Paul
The --loop-exit option as it stands is not going to do what you want as it also shuts down the watchdog before exit (as desirable for testing).
You could make it do this by loading the watchdog module with the CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you will not be able to stop the watchdog under normal operations. For example:
http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime
However, you have spotted a problem with the current way of doing things, but it is not one that has a simple solution. Part of the reason for swapping the watchdog daemon with the wd_keepalive daemon is to support updating of either the binary or the configuration file while the system is running, etc, and at the same time not rebooting the machine if CONFIG_WATCHDOG_NOWAYOUT was used.
Using a variation of the --loop-exit option would protect against a hung system during start up or shut down, but it would present a bit of a configuration issue in terms of what "stopping" the watchdog under normal operation is actually going to do. I don't know enough about the init script options to know if there is any sane way of adding a "safestop" command so the usual start|stop has a time-out but an administrator can really stop things without an unexpected reboot 60 seconds later.
Regards, Paul
Currently, I'm using wd_keepalive --loop-exit option, together with kernel CONFIG_WATCHDOG_NOWAYOUT.
Thanks for letting us know. While that will work, it is not very elegant but it is necesary in some cases.
I have seen one machine go down to running wd_keepalive but then something stopped its reboot sequence AFTER it had taken down ssh access, so it needed a physical visit to reboot. Your solution would have saved that case. More generally it might need the watchdog start/stop script to be aware of the runlevel so it can preserved the ability to stop watchdog checks indefinately by admin action, but times-out on bringing the machine down.
Also it might be useful to have a variation of --loop-exit that instead of quitting it calls sync() a couple of times then forces a short delay hardware reboot (like the end game for a watchdog reboot, but minus the complexity of file system unmount).
Seems the work-around is acceptable so closing this.