[pnfs] recursive lock problem with nfs41_validate_state
Benny Halevy
bhalevy at panasas.com
Tue Mar 18 13:56:40 EDT 2008
while testing with CONFIG_PROVE_LOCKING the following
report for recursive locking is consistently reproducible:
Mar 17 19:28:13 bh-testlin1 kernel: [ INFO: possible recursive locking detected ]
Mar 17 19:28:13 bh-testlin1 kernel: 2.6.24-pnfs #80
Mar 17 19:28:13 bh-testlin1 kernel: ---------------------------------------------
Mar 17 19:28:13 bh-testlin1 kernel: test1/2756 is trying to acquire lock:
Mar 17 19:28:13 bh-testlin1 kernel: (&clp->cl_sem){----}, at: [<ffffffff881f72f3>] nfs4_recover_expired_lease+0x18/0x3d [nfs]
Mar 17 19:28:13 bh-testlin1 kernel:
Mar 17 19:28:13 bh-testlin1 kernel: but task is already holding lock:
Mar 17 19:28:13 bh-testlin1 kernel: (&clp->cl_sem){----}, at: [<ffffffff881fa005>] nfs4_do_open+0x121/0x28c [nfs]
Mar 17 19:28:13 bh-testlin1 kernel:
Mar 17 19:28:13 bh-testlin1 kernel: other info that might help us debug this:
Mar 17 19:28:13 bh-testlin1 kernel: 2 locks held by test1/2756:
Mar 17 19:28:13 bh-testlin1 kernel: #0: (&type->i_mutex_dir_key#6){--..}, at: [<ffffffff810a8d66>] open_namei+0xf8/0x64d
Mar 17 19:28:13 bh-testlin1 kernel: #1: (&clp->cl_sem){----}, at: [<ffffffff881fa005>] nfs4_do_open+0x121/0x28c [nfs]
Mar 17 19:28:13 bh-testlin1 kernel:
Mar 17 19:28:13 bh-testlin1 kernel: stack backtrace:
Mar 17 19:28:13 bh-testlin1 kernel: Pid: 2756, comm: test1 Not tainted 2.6.24-pnfs #80
Mar 17 19:28:13 bh-testlin1 kernel:
Mar 17 19:28:13 bh-testlin1 kernel: Call Trace:
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff81056760>] __lock_acquire+0x1c5/0xc7a
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff8105761d>] lock_acquire+0x51/0x6c
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881f72f3>] :nfs:nfs4_recover_expired_lease+0x18/0x3d
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881f7280>] :nfs:nfs4_wait_clnt_recover+0x43/0x9e
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881f72f3>] :nfs:nfs4_recover_expired_lease+0x18/0x3d
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881f7342>] :nfs:nfs41_validate_state+0x2a/0x50
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881f9d45>] :nfs:_nfs4_proc_open+0x5f/0x1fe
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881fa047>] :nfs:nfs4_do_open+0x163/0x28c
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881fec41>] :nfs:nfs4_atomic_open+0xdb/0x1a2
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff81056128>] trace_hardirqs_on+0x116/0x13a
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff881e6b1f>] :nfs:nfs_atomic_lookup+0xb2/0x106
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff810a5c12>] __lookup_hash+0xe9/0x10d
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff810a8d6e>] open_namei+0x100/0x64d
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff810979fb>] check_object+0x156/0x205
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff8109d26d>] do_filp_open+0x1c/0x38
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff812498de>] _spin_unlock+0x17/0x20
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff8109cff2>] get_unused_fd_flags+0x111/0x11f
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff8109d2cf>] do_sys_open+0x46/0xc3
Mar 17 19:28:13 bh-testlin1 kernel: [<ffffffff8100c08a>] tracesys+0xdc/0xe1
what triggers that is the call to nfs41_validate_state from
_nfs4_proc_open after the cl_sem semaphore is acquired.
When I started thinking about how to fix I started questioning
what is its current purpose over our implicit recovery logic in
the validate_args rcp execution step.
Currently, the nfs41_validate_state function only calls
nfs4_recover_expired_lease in a loop (while nfs4_recover_expired_lease
returns a non-zero status).
One more note, is that just removing all calls to NFS4_VALIDATE_STATE
didn't just work, mount hangs after the server returns stale clientid
error to CREATE_SESSION. It seems like exchange_id doesn't happen...
Benny
More information about the pNFS
mailing list