Menu

#13 Start-up hold-off time

v1.0 (example)
closed
nobody
None
5
2023-03-26
2015-11-27
No

I am uncertain as to how to position the watchdog daemon in the initscript start-up sequence.

If the watchdog daemon is started early in the initscript sequence, then there is a chance it could trigger erroneously because a process it is monitoring (via pidfile) hasn't yet started up--though it will start-up in just a few more seconds.

If the watchdog daemon is started late in the initscript sequence, then there is a chance that the system could lock up without a watchdog reboot, if a process that starts before it gets "stuck" during initialisation. The watchdog won't trigger because it hasn't started yet.

It could be good to start the watchdog daemon early, but have a start-up hold-off time to prevent the watchdog daemon from triggering a shutdown during the time that processes are still starting up.

Is that a good strategy, or is there a better way to handle this?

Discussion

  • Michael Meskes

    Michael Meskes - 2015-11-27

    I am uncertain as to how to position the watchdog daemon in the initscript
    start-up sequence.
    ...

    The way I prefer is to start watchdog last for exactly the reasons you
    describe.

    If the watchdog daemon is started late in the initscript sequence, then
    there is a chance that the system could lock up without a watchdog reboot,
    if a process that starts before it gets "stuck" during initialisation. The
    watchdog won't trigger because it hasn't started yet.

    Right, that's why there is a second binary in the package: wd_keepalive. This
    one should be started as early as possible. It only takes care of the hardware
    watchdog, if there is one, but not the additional checks. Therefore there
    should be no false positive. When starting watchdog later, you just have to
    make sure to stop wd_keepalive.

    Michael

    Michael Meskes
    Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
    Meskes at (Debian|Postgresql) dot Org
    Jabber: michael.meskes at gmail dot com
    VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

     
  • Craig McQueen

    Craig McQueen - 2015-11-29

    Thanks for explaining that, Michael. One limitation of wd_keepalive I see is that it wouldn't handle the scenario where an initscript process "locks up", i.e. infinite loop. It would keep patting the hardware watchdog forever, as long as the kernel is still running fine.

    It would be good to have a time-out parameter for wd_keepalive, so it can handle that scenario.

    Or, have an initial hold-off time for the watchdog daemon, so it won't start looking for pidfiles until after an initial hold-off time to allow for expected initscript run time.

     
  • Craig McQueen

    Craig McQueen - 2015-11-30

    I see the latest wd_keepalive in the git repository (not yet released) has a --loop-exit option. That's the sort of thing I was asking for.

    Is a release 5.15 planned sometime soon?

     

    Last edit: Craig McQueen 2015-11-30
    • Paul Crawford

      Paul Crawford - 2015-11-30

      Hi Craig,
      The --loop-exit option as it stands is not going to do what you want as
      it also shuts down the watchdog before exit (as desirable for testing).

      You could make it do this by loading the watchdog module with the
      CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you
      will not be able to stop the watchdog under normal operations. For example:

      http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime

      However, you have spotted a problem with the current way of doing
      things, but it is not one that has a simple solution. Part of the reason
      for swapping the watchdog daemon with the wd_keepalive daemon is to
      support updating of either the binary or the configuration file while
      the system is running, etc, and at the same time not rebooting the
      machine if CONFIG_WATCHDOG_NOWAYOUT was used.

      Using a variation of the --loop-exit option would protect against a hung
      system during start up or shut down, but it would present a bit of a
      configuration issue in terms of what "stopping" the watchdog under
      normal operation is actually going to do. I don't know enough about the
      init script options to know if there is any sane way of adding a
      "safestop" command so the usual start|stop has a time-out but an
      administrator can really stop things without an unexpected reboot 60
      seconds later.

      Regards, Paul

       
  • Paul Crawford

    Paul Crawford - 2015-11-30

    The --loop-exit option as it stands is not going to do what you want as it also shuts down the watchdog before exit (as desirable for testing).

    You could make it do this by loading the watchdog module with the CONFIG_WATCHDOG_NOWAYOUT option so it cannot be stopped, but then you will not be able to stop the watchdog under normal operations. For example:

    http://stackoverflow.com/questions/25247317/linux-watch-dog-change-the-noway-out-config-at-runtime

    However, you have spotted a problem with the current way of doing things, but it is not one that has a simple solution. Part of the reason for swapping the watchdog daemon with the wd_keepalive daemon is to support updating of either the binary or the configuration file while the system is running, etc, and at the same time not rebooting the machine if CONFIG_WATCHDOG_NOWAYOUT was used.

    Using a variation of the --loop-exit option would protect against a hung system during start up or shut down, but it would present a bit of a configuration issue in terms of what "stopping" the watchdog under normal operation is actually going to do. I don't know enough about the init script options to know if there is any sane way of adding a "safestop" command so the usual start|stop has a time-out but an administrator can really stop things without an unexpected reboot 60 seconds later.

    Regards, Paul

     
  • Craig McQueen

    Craig McQueen - 2015-12-21

    Currently, I'm using wd_keepalive --loop-exit option, together with kernel CONFIG_WATCHDOG_NOWAYOUT.

     
  • Paul Crawford

    Paul Crawford - 2015-12-22

    Thanks for letting us know. While that will work, it is not very elegant but it is necesary in some cases.

    I have seen one machine go down to running wd_keepalive but then something stopped its reboot sequence AFTER it had taken down ssh access, so it needed a physical visit to reboot. Your solution would have saved that case. More generally it might need the watchdog start/stop script to be aware of the runlevel so it can preserved the ability to stop watchdog checks indefinately by admin action, but times-out on bringing the machine down.

    Also it might be useful to have a variation of --loop-exit that instead of quitting it calls sync() a couple of times then forces a short delay hardware reboot (like the end game for a watchdog reboot, but minus the complexity of file system unmount).

     
  • Paul Crawford

    Paul Crawford - 2023-03-26
    • status: open --> closed
     
  • Paul Crawford

    Paul Crawford - 2023-03-26

    Seems the work-around is acceptable so closing this.

     

Log in to post a comment.

MongoDB Logo MongoDB