[pnfs] recursive lock problem with nfs41_validate_state

William A. (Andy) Adamson andros at citi.umich.edu
Tue Mar 18 15:16:16 EDT 2008


On Tue, Mar 18, 2008 at 1:56 PM, Benny Halevy <bhalevy at panasas.com> wrote:

> while testing with CONFIG_PROVE_LOCKING the following
> report for recursive locking is consistently reproducible:
>
> Mar 17 19:28:13 bh-testlin1 kernel: [ INFO: possible recursive locking
> detected ]
> Mar 17 19:28:13 bh-testlin1 kernel: 2.6.24-pnfs #80
> Mar 17 19:28:13 bh-testlin1 kernel:
> ---------------------------------------------
> Mar 17 19:28:13 bh-testlin1 kernel: test1/2756 is trying to acquire lock:
> Mar 17 19:28:13 bh-testlin1 kernel:  (&clp->cl_sem){----}, at:
> [<ffffffff881f72f3>] nfs4_recover_expired_lease+0x18/0x3d [nfs]
> Mar 17 19:28:13 bh-testlin1 kernel:
> Mar 17 19:28:13 bh-testlin1 kernel: but task is already holding lock:
> Mar 17 19:28:13 bh-testlin1 kernel:  (&clp->cl_sem){----}, at:
> [<ffffffff881fa005>] nfs4_do_open+0x121/0x28c [nfs]
> Mar 17 19:28:13 bh-testlin1 kernel:
> Mar 17 19:28:13 bh-testlin1 kernel: other info that might help us debug
> this:
> Mar 17 19:28:13 bh-testlin1 kernel: 2 locks held by test1/2756:
> Mar 17 19:28:13 bh-testlin1 kernel:  #0:
>  (&type->i_mutex_dir_key#6){--..}, at: [<ffffffff810a8d66>]
> open_namei+0xf8/0x64d
> Mar 17 19:28:13 bh-testlin1 kernel:  #1:  (&clp->cl_sem){----}, at:
> [<ffffffff881fa005>] nfs4_do_open+0x121/0x28c [nfs]
> Mar 17 19:28:13 bh-testlin1 kernel:
> Mar 17 19:28:13 bh-testlin1 kernel: stack backtrace:
> Mar 17 19:28:13 bh-testlin1 kernel: Pid: 2756, comm: test1 Not tainted
> 2.6.24-pnfs #80
> Mar 17 19:28:13 bh-testlin1 kernel:
> Mar 17 19:28:13 bh-testlin1 kernel: Call Trace:
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff81056760>]
> __lock_acquire+0x1c5/0xc7a
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff8105761d>]
> lock_acquire+0x51/0x6c
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881f72f3>]
> :nfs:nfs4_recover_expired_lease+0x18/0x3d
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881f7280>]
> :nfs:nfs4_wait_clnt_recover+0x43/0x9e
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881f72f3>]
> :nfs:nfs4_recover_expired_lease+0x18/0x3d
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881f7342>]
> :nfs:nfs41_validate_state+0x2a/0x50
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881f9d45>]
> :nfs:_nfs4_proc_open+0x5f/0x1fe
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881fa047>]
> :nfs:nfs4_do_open+0x163/0x28c
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881fec41>]
> :nfs:nfs4_atomic_open+0xdb/0x1a2
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff81056128>]
> trace_hardirqs_on+0x116/0x13a
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff881e6b1f>]
> :nfs:nfs_atomic_lookup+0xb2/0x106
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff810a5c12>]
> __lookup_hash+0xe9/0x10d
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff810a8d6e>]
> open_namei+0x100/0x64d
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff810979fb>]
> check_object+0x156/0x205
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff8109d26d>]
> do_filp_open+0x1c/0x38
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff812498de>]
> _spin_unlock+0x17/0x20
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff8109cff2>]
> get_unused_fd_flags+0x111/0x11f
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff8109d2cf>]
> do_sys_open+0x46/0xc3
> Mar 17 19:28:13 bh-testlin1 kernel:  [<ffffffff8100c08a>]
> tracesys+0xdc/0xe1
>
> what triggers that is the call to nfs41_validate_state from
> _nfs4_proc_open after the cl_sem semaphore is acquired.


 there is already a call to nfs4_recover_expired_lease in _nfs4_do_open just
prior to acquiring the cl_sem, so the NFS4_VALIDATE_STATE call in
nfs4_proc_open() is redundant..


>
> When I started thinking about how to fix I started questioning
> what is its current purpose over our implicit recovery logic in
> the validate_args rcp execution step.
>
> Currently, the nfs41_validate_state function only calls
> nfs4_recover_expired_lease in a loop (while nfs4_recover_expired_lease
> returns a non-zero status).
>
> One more note, is that just removing all calls to NFS4_VALIDATE_STATE
> didn't just work, mount hangs after the server returns stale clientid
> error to CREATE_SESSION. It seems like exchange_id doesn't happen...


perhaps we should just use the nfs4_handle_exception() framework that avoids
this deadlock.

-->Andy

>
>
> Benny
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20080318/d7c4dc98/attachment-0001.htm 


More information about the pNFS mailing list