nfs client hanging

Guillaume Rousse Guillaume.Rousse at inria.fr
Tue Jun 17 08:14:00 EDT 2008


Guillaume Rousse a écrit :
> Trond Myklebust a écrit :
>> On Fri, 2008-06-06 at 13:03 +0200, Guillaume Rousse wrote:
>>> Hello list.
>>>
>>> I've got some strange NFS 4 hanging issues between a linux 2.6.24 
>>> client, using nfs-utils 1.1.1 (all nfs4 patches applied), and a netapp 
>>> filer. Here is a debug output produced with all nfs and rpc flags 
>>> applied on the client, when trying to access the mount point:
>> It looks as if you might need the following fix which is already in
>> 2.6.25:
>>
>>   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=c37dcd334c0b0a46a90cfa13b9f69e2aaa89bc09
>>
>> and possibly also
>>
>>   http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git&a=commitdiff&h=4e99a1ff3410c627a428d5ddb6cd2e7bc908a486
> Thanks, I'll try this.
> 
> But I'm more and more suspecting the filer itself, as it just crashed...
I'm still experiencing the issue, running a 2.6.24.5 kernel with those 
two patches applied. However, only the two volumes exported from the 
filer are affected, whereas the one from the linux server is not.

Here are the logs related to the attempt to reach a filesystem exported 
from the filer:
Jun 17 12:05:25 chatelet kernel: NFS: revalidating (0:16/100)
Jun 17 12:05:25 chatelet kernel: RPC:       new task initialized, 
procpid 722
Jun 17 12:05:25 chatelet kernel: RPC:       allocated task f7ce7800
Jun 17 12:05:25 chatelet kernel: RPC:     0 looking up UNIX cred
Jun 17 12:05:25 chatelet kernel: RPC:  2605 __rpc_execute flags=0x80
Jun 17 12:05:25 chatelet kernel: RPC:  2605 call_start nfs4 proc 1 (sync)
Jun 17 12:05:25 chatelet kernel: RPC:  2605 call_reserve (status 0)
Jun 17 12:05:25 chatelet kernel: RPC:       waiting for request slot
Jun 17 12:05:25 chatelet kernel: RPC:  2605 sleep_on(queue 
"xprt_backlog" time 7867047)
Jun 17 12:05:26 chatelet kernel: RPC:  2605 added to queue f7d28964 
"xprt_backlog"
Jun 17 12:05:26 chatelet kernel: RPC:  2605 sync task going to sleep
Jun 17 12:05:46 chatelet kernel: RPC:  2605 got signal
Jun 17 12:05:46 chatelet kernel: RPC:  2605 __rpc_wake_up_task (now 7869090)
Jun 17 12:05:46 chatelet kernel: RPC:  2605 disabling timer
Jun 17 12:05:46 chatelet kernel: RPC:  2605 removed from queue f7d28964 
"xprt_backlog"
Jun 17 12:05:46 chatelet kernel: RPC:       __rpc_wake_up_task done
Jun 17 12:05:46 chatelet kernel: RPC:  2605 sync task resuming
Jun 17 12:05:46 chatelet kernel: RPC:  2605 return -512, status -512
Jun 17 12:05:46 chatelet kernel: RPC:  2605 release task
Jun 17 12:05:46 chatelet kernel: RPC:  2605 releasing UNIX cred e4ce8600
Jun 17 12:05:46 chatelet kernel: RPC:       rpc_release_client(df59ad00)
Jun 17 12:05:46 chatelet kernel: nfs_revalidate_inode: (0:16/100) 
getattr failed, error=-512
Jun 17 12:05:46 chatelet kernel: NFS: revalidating (0:16/100)
Jun 17 12:05:46 chatelet kernel: RPC:       new task initialized, 
procpid 722
Jun 17 12:05:46 chatelet kernel: RPC:       allocated task f151fe00
Jun 17 12:05:46 chatelet kernel: RPC:     0 looking up UNIX cred
Jun 17 12:05:46 chatelet kernel: RPC:  2606 __rpc_execute flags=0x80
Jun 17 12:05:46 chatelet kernel: RPC:  2606 call_start nfs4 proc 1 (sync)
Jun 17 12:05:46 chatelet kernel: RPC:  2606 call_reserve (status 0)
Jun 17 12:05:46 chatelet kernel: RPC:       waiting for request slot
Jun 17 12:05:46 chatelet kernel: RPC:  2605 freeing task
Jun 17 12:05:46 chatelet kernel: RPC:  2606 sleep_on(queue 
"xprt_backlog" time 7869098)
Jun 17 12:05:46 chatelet kernel: RPC:  2606 added to queue f7d28964 
"xprt_backlog"
Jun 17 12:05:46 chatelet kernel: RPC:  2606 sync task going to sleep

I'm not an expert, but it seems the 'nfs_revalidate_inode: (0:16/100) 
getattr failed, error=-512', followed by 'added to queue f7d28964 
"xprt_backlog"' basically implies something get wrong, and is then put 
in a queue to be tried again later. Which seems to explain the hang.

Capturing network traffic on port 2049 shows that no packets are 
exchanged between the filer and the client during this test. nfsstat -l, 
run on the filer, tells the client is unknown (by-client stats have just 
been enabled, hence no prior stats). Basically, the client seems to be 
unable to dialog with the filer anymore.

Stopping and relaunching rpcbind, then rpc.idmapd doesn't help. 
Rebooting the client does, however.

The filer is known to suffer from an unknown problem (NetApp Case 
#3141642 for anyone with access), but the behaviour seems to imply a 
client issue here.
-- 
Guillaume Rousse
Moyens Informatiques - INRIA Futurs
Tel: 01 69 35 69 62


More information about the NFSv4 mailing list