Watchdog / Bugs / #13 Start-up hold-off time

Michael Meskes - 2015-11-27

I am uncertain as to how to position the watchdog daemon in the initscript
start-up sequence.
...

The way I prefer is to start watchdog last for exactly the reasons you
describe.

If the watchdog daemon is started late in the initscript sequence, then
there is a chance that the system could lock up without a watchdog reboot,
if a process that starts before it gets "stuck" during initialisation. The
watchdog won't trigger because it hasn't started yet.

Right, that's why there is a second binary in the package: wd_keepalive. This
one should be started as early as possible. It only takes care of the hardware
watchdog, if there is one, but not the additional checks. Therefore there
should be no false positive. When starting watchdog later, you just have to
make sure to stop wd_keepalive.

Michael

Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig McQueen - 2015-11-29

Thanks for explaining that, Michael. One limitation of wd_keepalive I see is that it wouldn't handle the scenario where an initscript process "locks up", i.e. infinite loop. It would keep patting the hardware watchdog forever, as long as the kernel is still running fine.

It would be good to have a time-out parameter for wd_keepalive, so it can handle that scenario.

Or, have an initial hold-off time for the watchdog daemon, so it won't start looking for pidfiles until after an initial hold-off time to allow for expected initscript run time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig McQueen - 2015-11-30

I see the latest wd_keepalive in the git repository (not yet released) has a --loop-exit option. That's the sort of thing I was asking for.

Is a release 5.15 planned sometime soon?

Last edit: Craig McQueen 2015-11-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Crawford - 2015-11-30
  
  Hi Craig,
  The --loop-exit option as it stands is not going to do what you want as
  it also shuts down the watchdog before exit (as desirable for testing).
  
  You could make it do this by loading the watchdog module with the
  CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you
  will not be able to stop the watchdog under normal operations. For example:
  
  http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime
  
  However, you have spotted a problem with the current way of doing
  things, but it is not one that has a simple solution. Part of the reason
  for swapping the watchdog daemon with the wd_keepalive daemon is to
  support updating of either the binary or the configuration file while
  the system is running, etc, and at the same time not rebooting the
  machine if CONFIG_WATCHDOG_NOWAYOUT was used.
  
  Using a variation of the --loop-exit option would protect against a hung
  system during start up or shut down, but it would present a bit of a
  configuration issue in terms of what "stopping" the watchdog under
  normal operation is actually going to do. I don't know enough about the
  init script options to know if there is any sane way of adding a
  "safestop" command so the usual start|stop has a time-out but an
  administrator can really stop things without an unexpected reboot 60
  seconds later.
  
  Regards, Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Crawford - 2015-11-30

The --loop-exit option as it stands is not going to do what you want as it also shuts down the watchdog before exit (as desirable for testing).

You could make it do this by loading the watchdog module with the CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you will not be able to stop the watchdog under normal operations. For example:

http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime

However, you have spotted a problem with the current way of doing things, but it is not one that has a simple solution. Part of the reason for swapping the watchdog daemon with the wd_keepalive daemon is to support updating of either the binary or the configuration file while the system is running, etc, and at the same time not rebooting the machine if CONFIG_WATCHDOG_NOWAYOUT was used.

Using a variation of the --loop-exit option would protect against a hung system during start up or shut down, but it would present a bit of a configuration issue in terms of what "stopping" the watchdog under normal operation is actually going to do. I don't know enough about the init script options to know if there is any sane way of adding a "safestop" command so the usual start|stop has a time-out but an administrator can really stop things without an unexpected reboot 60 seconds later.

Regards, Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig McQueen - 2015-12-21

Currently, I'm using wd_keepalive --loop-exit option, together with kernel CONFIG_WATCHDOG_NOWAYOUT.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Crawford - 2015-12-22

Thanks for letting us know. While that will work, it is not very elegant but it is necesary in some cases.

I have seen one machine go down to running wd_keepalive but then something stopped its reboot sequence AFTER it had taken down ssh access, so it needed a physical visit to reboot. Your solution would have saved that case. More generally it might need the watchdog start/stop script to be aware of the runlevel so it can preserved the ability to stop watchdog checks indefinately by admin action, but times-out on bringing the machine down.

Also it might be useful to have a variation of --loop-exit that instead of quitting it calls sync() a couple of times then forces a short delay hardware reboot (like the end game for a watchdog reboot, but minus the complexity of file system unmount).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Crawford - 2023-03-26

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul Crawford - 2023-03-26

Seems the work-around is acceptable so closing this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Start-up hold-off time

Group

Searches

Help

#13 Start-up hold-off time

Discussion

Michael