System lockup with recent pull from Bruce's git tree

Trond Myklebust trond.myklebust at fys.uio.no
Sun Oct 29 21:49:16 EST 2006


On Sun, 2006-10-29 at 23:01 +0100, Christophe Saout wrote:
> Am Sonntag, den 29.10.2006, 22:43 +0100 schrieb Christophe Saout:
> > Am Freitag, den 08.09.2006, 12:04 -0400 schrieb J. Bruce Fields:
> > > On Fri, Sep 08, 2006 at 08:58:04AM -0700, Frank Filz wrote:
> > > > After an overnight run, this has not re-created when the server is a
> > > > separate machine.
> > > > 
> > > > One thought though - if the problem is the response coming back before
> > > > the client can get into the wait code, the problem might only be visible
> > > > when the server is the same machine as the client, with an instantaneous
> > > > transmission of the request and response.
> > > 
> > > Could be.  Maybe it could be reproduced with uml or zen or something.
> > 
> > It looks like I'm hitting this or a similar bug on a large Xen machine.
> > On a heavy loaded Apache machine that imports the web directories via
> > NFSv4 (2.6.18.1 client and server without additional patches), after
> > some days the load would explode due to Apache processes waiting on the
> > fs. In certain directories a ls would hang, no entries in the kernel
> > log. Trying to unmount the fs gives a "busy inodes after umount,
> > self-destruct in 5 seconds".
> > 
> > No success reproducing this with fsstress so far...
> 
> Update: I think I hit it.
> 
> tuek:/home/www/traffic # ls -l
> 
> Hangs indefinitely. Strangely, this is not the directory I ran the
> fsstress in... but I also have an un"rm"able file there, where rm shows
> the same symptoms of just hanging.
> 
> The ugly thing is that the rm hangs in an rpc call whereas the ls hangs
> somewhere else.
> 
> Call trace from sysrq-t for the hanging "rm":

Are you certain of this? I don't see any reason for 'rm' to be calling
open().

> Oct 29 22:46:47 tuek Call Trace:
> Oct 29 22:46:47 tuek [<ffffffff8026b251>] _spin_lock_irqsave+0x11/0x30
> Oct 29 22:46:47 tuek [<ffffffff8026b327>] _spin_unlock_irqrestore+0x27/0x30
> Oct 29 22:46:47 tuek [<ffffffff8025709a>] finish_wait+0x6a/0x80
> Oct 29 22:46:47 tuek [<ffffffff804e0ed2>] rpc_wait_bit_interruptible+0x22/0x30
> Oct 29 22:46:47 tuek [<ffffffff80269d85>] __wait_on_bit+0x45/0x80
> Oct 29 22:46:47 tuek [<ffffffff804e0eb0>] rpc_wait_bit_interruptible+0x0/0x30
> Oct 29 22:46:47 tuek [<ffffffff804e0eb0>] rpc_wait_bit_interruptible+0x0/0x30
> Oct 29 22:46:47 tuek [<ffffffff80269e38>] out_of_line_wait_on_bit+0x78/0x90
> Oct 29 22:46:47 tuek [<ffffffff8029d1d0>] wake_bit_function+0x0/0x40
> Oct 29 22:46:47 tuek [<ffffffff804e0f1a>] __rpc_wait_for_completion_task+0x3a/0x40   
> Oct 29 22:46:47 tuek [<ffffffff8033ee31>] nfs4_wait_for_completion_rpc_task+0x31/0x60
> Oct 29 22:46:47 tuek [<ffffffff804e159e>] rpc_run_task+0x4e/0x60
> Oct 29 22:46:47 tuek [<ffffffff80341c16>] _nfs4_proc_open+0x76/0x1d0
> Oct 29 22:46:47 tuek [<ffffffff80341fe7>] nfs4_do_open+0xc7/0x1c0
> Oct 29 22:46:47 tuek [<ffffffff80343ae4>] nfs4_atomic_open+0xd4/0x170  
> Oct 29 22:46:47 tuek [<ffffffff8020ac99>] kmem_cache_alloc+0x79/0x90   
> Oct 29 22:46:47 tuek [<ffffffff8032dd6e>] nfs_atomic_lookup+0x12e/0x1c0
> Oct 29 22:46:47 tuek [<ffffffff8020ce41>] do_lookup+0xe1/0x1b0
> Oct 29 22:46:47 tuek [<ffffffff8020a2b2>] __link_path_walk+0x992/0xec0
> Oct 29 22:46:47 tuek [<ffffffff8020e7ab>] link_path_walk+0x7b/0x120   
> Oct 29 22:46:47 tuek [<ffffffff80241437>] do_unlinkat+0x127/0x1b0     
> Oct 29 22:46:47 tuek [<ffffffff8021679a>] get_unused_fd+0x7a/0x100  
> Oct 29 22:46:47 tuek [<ffffffff8020cbf4>] do_path_lookup+0x284/0x2c0
> Oct 29 22:46:47 tuek [<ffffffff80224f67>] __path_lookup_intent_open+0x67/0xc0
> Oct 29 22:46:47 tuek [<ffffffff802c4aac>] path_lookup_open+0xc/0x10
> Oct 29 22:46:47 tuek [<ffffffff8021bf4b>] open_namei+0x7b/0x730
> Oct 29 22:46:47 tuek [<ffffffff8020d762>] permission+0x82/0xb0  
> Oct 29 22:46:47 tuek [<ffffffff8023d4f2>] may_delete+0x52/0x140  
> Oct 29 22:46:47 tuek [<ffffffff80229588>] do_filp_open+0x28/0x50  
> Oct 29 22:46:47 tuek [<ffffffff80241437>] do_unlinkat+0x127/0x1b0 
> Oct 29 22:46:47 tuek [<ffffffff8021679a>] get_unused_fd+0x7a/0x100
> Oct 29 22:46:47 tuek [<ffffffff8021aa2a>] do_sys_open+0x5a/0xf0   
> Oct 29 22:46:47 tuek [<ffffffff8024033c>] sys_openat+0xc/0x10     
> Oct 29 22:46:47 tuek [<ffffffff80266a12>] system_call+0x86/0x8b
> Oct 29 22:46:47 tuek [<ffffffff8026698c>] system_call+0x0/0x8b
> 
> Call trace from sysrq-t for the hanging "ls":

I don't really understand this one either. Are you running with
CONFIG_FRAME_POINTER and CONFIG_STACK_UNWIND enabled?

> Oct 29 22:56:13 tuek Call Trace:
> Oct 29 22:56:13 tuek [<ffffffff802179a8>] __page_set_anon_rmap+0x38/0x40
> Oct 29 22:56:13 tuek [<ffffffff8026b251>] _spin_lock_irqsave+0x11/0x30
> Oct 29 22:56:13 tuek [<ffffffff8026b327>] _spin_unlock_irqrestore+0x27/0x30
> Oct 29 22:56:13 tuek [<ffffffff802238bb>] __up_read+0x9b/0xb0
> Oct 29 22:56:13 tuek [<ffffffff8029f869>] up_read+0x9/0x10
> Oct 29 22:56:13 tuek [<ffffffff8026a093>] __mutex_lock_slowpath+0x73/0xb5
> Oct 29 22:56:13 tuek [<ffffffff80227820>] filldir+0x0/0xe0
> Oct 29 22:56:13 tuek [<ffffffff8026a0e4>] .text.lock.mutex+0xf/0x1b
> Oct 29 22:56:13 tuek [<ffffffff80239286>] vfs_readdir+0x56/0xb0
> Oct 29 22:56:13 tuek [<ffffffff8023ceb1>] sys_getdents+0x81/0xd0
> Oct 29 22:56:13 tuek [<ffffffff80267133>] error_exit+0x0/0x6e
> Oct 29 22:56:13 tuek [<ffffffff80266a12>] system_call+0x86/0x8b
> Oct 29 22:56:13 tuek [<ffffffff8026698c>] system_call+0x0/0x8b

Cheers,
  Trond



More information about the NFSv4 mailing list