From time to time /proc deadlocks
Brought to you by:
brucewalker,
rogertsang
For quite some time (maybe even on the old 2.4 based kernel) we've occasionally seen a problem where the /proc filesystem seems to be deadlocked - any attempt to read /proc hangs.
When this happens rebooting one the nodes (not any node, it has to be the "right" one) will free up the system and things will continue as normal.
Today I just noticed that when I rebooted the node that was "causing" the problem I had the following messages on the init node:
Node 6 has gone down!!!
Assertion failed! origin_lock != ((void *)0), cluster/ssi/vproc/dvp_pvpsops.c, pvpsop_get_execnode, line=376
nm_add_node: Node 6 added
Is this a clue?
Related to this origin_lock assertion is a possible race in vproc_origin_list traversal supposedly fixed by pragma #ifdef VOD_HLIST since SSI-1.9.x, but the fix introduced a possible deadlock bug and should be fixed in 1.9.6.
Lock ordering pre-1.9.6 (with #ifdef VOD_HLIST):
-> vproc_origin_cleanup (down_read origin list)
-> vproc_origin_fgpgrp_cleanup
-> pvpop_getctty
-> rpvpop_start_op
-> pvpopsop_get_execnode
-> vproc_lock_origin_node
-> vproc_origin_find (down_read origin list)
a fix is going into CVS
With current CVS (20/11/2008) I still see this bug, on coming in to work I found my (non-init) node stuck, apparently in the screensaver, and when I tried to see what was going on from the initnode each time I did a stat on "/proc/1" it hung. stat on other things in /proc was working - stat /proc/self or stat /proc/$$ for example.
When I turned off the node that was stuck the hung "stat" operations on the initnode sprang back to life and I see messages like this in the log:
Node 6 has gone down!!!
Assertion failed! origin_lock != ((void *)0), cluster/ssi/vproc/dvp_pvpsops.c, pvpsop_get_execnode, line=379
Assertion failed! origin_lock != ((void *)0), cluster/ssi/vproc/dvp_pvpsops.c, pvpsop_get_execnode, line=379
Assertion failed! origin_lock != ((void *)0), cluster/ssi/vproc/dvp_pvpsops.c, pvpsop_get_execnode, line=379
nm_add_node: Node 6 added
Sorry, wasn't clear above - this is not straight CVS, it's my port of current CVS to 2.6.12. However I'm pretty sure this part of the port is good.
Well. I finally found a (crazy) way to duplicate this - launch a windows app with wine and hit control-c before it gets going. Eventually it will provoke the hang.
A "bta A" trace of the running processes is attached.
output of bta A when /proc is hung
please try latest CVS (March 24th)
bug fixed?