nfs client hanging

Guillaume Rousse Guillaume.Rousse at inria.fr
Thu Jun 26 05:14:56 EDT 2008


Guillaume Rousse a écrit :
> And I keep getting those "RPC: failed to contact local rpcbind server 
> (errno 5)" error messages in the logs. I'm switching to regular portmap 
> instead of rpcbind to see if it helps.
Replacing rpcbind with portmap doesn't help, the problem still happens.

Here is a summary of the situation:
- the server is a netapp NAS 270 running DAT 7.2.5

- the client is a linux x86_64 host running kernel 2.6.24.5, with 
following patches applied
  * 46f8a64bae11f5c9b15b4401f6e9863281999b66
  * 4e99a1ff3410c627a428d5ddb6cd2e7bc908a486
  * 63c86716ea34ad94d52e5b0abbda152574dc42b5
  * 8e60029f403781b8a63b7ffdb7dc1faff6ca651e
  * c37dcd334c0b0a46a90cfa13b9f69e2aaa89bc09

- the client doesn't have CONFIG_SUNRPC_BIND34 enabled

- there is no filtering between the client and the server

- delegation is disabled

- the file system is mounted with the following options: 
rw,nosuid,nodev,hard,intr

- when the hang occurs, the server is disconnected (no connection found 
in netstat -t output)

- here is nfsstat output:
[root at node1 ~]# nfsstat
Client rpc stats:
calls      retrans    authrefrsh
635265     0          0

Client nfs v4:
null         read         write        commit       open 
open_conf
0         0% 55455     8% 95486    15% 0         0% 181671   28% 0 
    0%
open_noat    open_dgrd    close        setattr      fsinfo       renew 

0         0% 24018     3% 155455   24% 30389     4% 22        0% 2454 
    0%
setclntid    confirm      lock         lockt        locku        access 

362       0% 362       0% 0         0% 0         0% 0         0% 14413 
    2%
getattr      lookup       lookup_root  remove       rename       link 

55037     8% 13396     2% 11        0% 3764      0% 90        0% 40 
    0%
symlink      create       pathconf     statfs       readlink     readdir 

404       0% 46        0% 11        0% 21        0% 80        0% 2234 
    0%
server_caps  delegreturn  getacl       setacl       fs_locations
33        0% 0         0% 0         0% 0         0% 0         0%

- attached is a rrdtool graph produced by munin for the client

- trying 'ls /home/ad<TAB>' produce the following output, and is 
interruptible:
Jun 26 11:06:24 node1 kernel: NFS: revalidating (0:14/100)

- trying 'ls /home/adapt' produces the following output, and is not 
interruptible:
Jun 26 11:08:56 node1 kernel: NFS: revalidating (0:14/100)
Jun 26 11:08:56 node1 kernel: RPC:       new task initialized, procpid 25977
Jun 26 11:08:56 node1 kernel: RPC:       allocated task ffff81041ade5300
Jun 26 11:08:56 node1 kernel: RPC:     0 looking up UNIX cred
Jun 26 11:08:56 node1 kernel: RPC: 51881 __rpc_execute flags=0x80
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_start nfs4 proc 1 (sync)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_reserve (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 reserved req ffff8104219462a0 
xid 6856de4f
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_reserveresult (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_allocate (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 allocated buffer of size 972 at 
ffff81042cda5800
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_bind (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_connect xprt 
ffff81042805c000 is not connected
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_connect xprt 
ffff81042805c000 is not connected
Jun 26 11:08:56 node1 kernel: RPC: 51881 sleep_on(queue "xprt_pending" 
time 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 added to queue ffff81042805c1d0 
"xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC: 51881 setting alarm for 60000 ms
Jun 26 11:08:56 node1 kernel: RPC:       xs_connect scheduled xprt 
ffff81042805c000
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task going to sleep
Jun 26 11:08:56 node1 kernel: RPC:       xs_bind4 0.0.0.0:806: ok (0)
Jun 26 11:08:56 node1 kernel: RPC:       worker connecting xprt 
ffff81042805c000 to address: addr=193.55.250.11 port=2049 proto=tcp
Jun 26 11:08:56 node1 kernel: RPC:       ffff81042805c000 connect status 
115 connected 0 sock state 2
Jun 26 11:08:56 node1 kernel: RPC:       xs_tcp_state_change client 
ffff81042805c000...
Jun 26 11:08:56 node1 kernel: RPC:       state 1 conn 0 dead 0 zapped 1
Jun 26 11:08:56 node1 kernel: RPC: 51881 __rpc_wake_up_task (now 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 disabling timer
Jun 26 11:08:56 node1 kernel: RPC: 51881 removed from queue 
ffff81042805c1d0 "xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC:       __rpc_wake_up_task done
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task resuming
Jun 26 11:08:56 node1 kernel: RPC: 51881 deleting timer
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_connect_status: connection 
established
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_connect_status (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_transmit (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_prepare_transmit
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_encode (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 marshaling UNIX cred 
ffff81042c966480
Jun 26 11:08:56 node1 kernel: RPC: 51881 using AUTH_UNIX cred 
ffff81042c966480 to wrap rpc data
Jun 26 11:08:56 node1 kernel: encode_compound: tag=
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_transmit(140)
Jun 26 11:08:56 node1 kernel: RPC:       xs_tcp_send_request(140) = 140
Jun 26 11:08:56 node1 kernel: RPC: 51881 xmit complete
Jun 26 11:08:56 node1 kernel: RPC: 51881 sleep_on(queue "xprt_pending" 
time 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 added to queue ffff81042805c1d0 
"xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC: 51881 setting alarm for 60000 ms
Jun 26 11:08:56 node1 kernel: RPC:       wake_up_next(ffff81042805c180 
"xprt_resend")
Jun 26 11:08:56 node1 kernel: RPC:       wake_up_next(ffff81042805c130 
"xprt_sending")
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task going to sleep
Jun 26 11:08:56 node1 kernel: RPC:       xs_tcp_data_ready...
Jun 26 11:08:56 node1 kernel: RPC:       xs_tcp_data_recv started
Jun 26 11:08:56 node1 kernel: RPC:       reading TCP record fragment of 
length 228
Jun 26 11:08:56 node1 kernel: RPC:       reading XID (4 bytes)
Jun 26 11:08:56 node1 kernel: RPC:       reading reply for XID 6856de4f
Jun 26 11:08:57 node1 kernel: RPC:       XID 6856de4f read 224 bytes
Jun 26 11:08:57 node1 kernel: RPC:       xprt = ffff81042805c000, 
tcp_copied = 228, tcp_offset = 228, tcp_reclen = 228
Jun 26 11:08:57 node1 kernel: RPC: 51881 xid 6856de4f complete (228 
bytes received)
Jun 26 11:08:57 node1 kernel: RPC: 51881 __rpc_wake_up_task (now 4402577104)
Jun 26 11:08:57 node1 kernel: RPC: 51881 disabling timer
Jun 26 11:08:57 node1 kernel: RPC: 51881 removed from queue 
ffff81042805c1d0 "xprt_pending"
Jun 26 11:08:57 node1 kernel: RPC:       __rpc_wake_up_task done
Jun 26 11:08:57 node1 kernel: RPC:       xs_tcp_data_recv done
Jun 26 11:08:57 node1 kernel: RPC: 51881 sync task resuming
Jun 26 11:08:57 node1 kernel: RPC: 51881 deleting timer
Jun 26 11:08:57 node1 kernel: RPC: 51881 call_status (status 228)
Jun 26 11:08:57 node1 kernel: RPC: 51881 call_decode (status 228)
Jun 26 11:08:57 node1 kernel: RPC: 51881 validating UNIX cred 
ffff81042c966480
Jun 26 11:08:57 node1 kernel: RPC: 51881 using AUTH_UNIX cred 
ffff81042c966480 to unwrap rpc data
Jun 26 11:08:57 node1 kernel: decode_attr_type: type=02
Jun 26 11:08:57 node1 kernel: decode_attr_change: change 
attribute=1302663605281908128
Jun 26 11:08:57 node1 kernel: decode_attr_size: file size=4096
Jun 26 11:08:57 node1 kernel: decode_attr_fsid: fsid=(0x235d83/0x0)
Jun 26 11:08:57 node1 kernel: decode_attr_fileid: fileid=100
Jun 26 11:08:57 node1 kernel: decode_attr_fs_locations: fs_locations 
done, error = 0
Jun 26 11:08:57 node1 kernel: decode_attr_mode: file mode=0755
Jun 26 11:08:57 node1 kernel: decode_attr_nlink: nlink=18

Is there anything else I can do to get informations now, before 
rebooting ? And if already too late, as I suspect the real issue here is 
the connection loss, what would be the best setup to catch the problem 
after the reboot ? Setting all rpcdebug flags is way too verbose for a 
problem occuring once every 2-3 days only...
-- 
Guillaume Rousse
Moyens Informatiques - INRIA Futurs
Tel: 01 69 35 69 62
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node1-nfs4_client-day.png
Type: image/png
Size: 54753 bytes
Desc: not available
Url : http://linux-nfs.org/pipermail/nfsv4/attachments/20080626/1ea1edcb/attachment.png 


More information about the NFSv4 mailing list