nfs client hanging
Guillaume Rousse
Guillaume.Rousse at inria.fr
Thu Jun 26 05:14:56 EDT 2008
Guillaume Rousse a écrit :
> And I keep getting those "RPC: failed to contact local rpcbind server
> (errno 5)" error messages in the logs. I'm switching to regular portmap
> instead of rpcbind to see if it helps.
Replacing rpcbind with portmap doesn't help, the problem still happens.
Here is a summary of the situation:
- the server is a netapp NAS 270 running DAT 7.2.5
- the client is a linux x86_64 host running kernel 2.6.24.5, with
following patches applied
* 46f8a64bae11f5c9b15b4401f6e9863281999b66
* 4e99a1ff3410c627a428d5ddb6cd2e7bc908a486
* 63c86716ea34ad94d52e5b0abbda152574dc42b5
* 8e60029f403781b8a63b7ffdb7dc1faff6ca651e
* c37dcd334c0b0a46a90cfa13b9f69e2aaa89bc09
- the client doesn't have CONFIG_SUNRPC_BIND34 enabled
- there is no filtering between the client and the server
- delegation is disabled
- the file system is mounted with the following options:
rw,nosuid,nodev,hard,intr
- when the hang occurs, the server is disconnected (no connection found
in netstat -t output)
- here is nfsstat output:
[root at node1 ~]# nfsstat
Client rpc stats:
calls retrans authrefrsh
635265 0 0
Client nfs v4:
null read write commit open
open_conf
0 0% 55455 8% 95486 15% 0 0% 181671 28% 0
0%
open_noat open_dgrd close setattr fsinfo renew
0 0% 24018 3% 155455 24% 30389 4% 22 0% 2454
0%
setclntid confirm lock lockt locku access
362 0% 362 0% 0 0% 0 0% 0 0% 14413
2%
getattr lookup lookup_root remove rename link
55037 8% 13396 2% 11 0% 3764 0% 90 0% 40
0%
symlink create pathconf statfs readlink readdir
404 0% 46 0% 11 0% 21 0% 80 0% 2234
0%
server_caps delegreturn getacl setacl fs_locations
33 0% 0 0% 0 0% 0 0% 0 0%
- attached is a rrdtool graph produced by munin for the client
- trying 'ls /home/ad<TAB>' produce the following output, and is
interruptible:
Jun 26 11:06:24 node1 kernel: NFS: revalidating (0:14/100)
- trying 'ls /home/adapt' produces the following output, and is not
interruptible:
Jun 26 11:08:56 node1 kernel: NFS: revalidating (0:14/100)
Jun 26 11:08:56 node1 kernel: RPC: new task initialized, procpid 25977
Jun 26 11:08:56 node1 kernel: RPC: allocated task ffff81041ade5300
Jun 26 11:08:56 node1 kernel: RPC: 0 looking up UNIX cred
Jun 26 11:08:56 node1 kernel: RPC: 51881 __rpc_execute flags=0x80
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_start nfs4 proc 1 (sync)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_reserve (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 reserved req ffff8104219462a0
xid 6856de4f
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_reserveresult (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_allocate (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 allocated buffer of size 972 at
ffff81042cda5800
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_bind (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_connect xprt
ffff81042805c000 is not connected
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_connect xprt
ffff81042805c000 is not connected
Jun 26 11:08:56 node1 kernel: RPC: 51881 sleep_on(queue "xprt_pending"
time 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 added to queue ffff81042805c1d0
"xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC: 51881 setting alarm for 60000 ms
Jun 26 11:08:56 node1 kernel: RPC: xs_connect scheduled xprt
ffff81042805c000
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task going to sleep
Jun 26 11:08:56 node1 kernel: RPC: xs_bind4 0.0.0.0:806: ok (0)
Jun 26 11:08:56 node1 kernel: RPC: worker connecting xprt
ffff81042805c000 to address: addr=193.55.250.11 port=2049 proto=tcp
Jun 26 11:08:56 node1 kernel: RPC: ffff81042805c000 connect status
115 connected 0 sock state 2
Jun 26 11:08:56 node1 kernel: RPC: xs_tcp_state_change client
ffff81042805c000...
Jun 26 11:08:56 node1 kernel: RPC: state 1 conn 0 dead 0 zapped 1
Jun 26 11:08:56 node1 kernel: RPC: 51881 __rpc_wake_up_task (now 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 disabling timer
Jun 26 11:08:56 node1 kernel: RPC: 51881 removed from queue
ffff81042805c1d0 "xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC: __rpc_wake_up_task done
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task resuming
Jun 26 11:08:56 node1 kernel: RPC: 51881 deleting timer
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_connect_status: connection
established
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_connect_status (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_transmit (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_prepare_transmit
Jun 26 11:08:56 node1 kernel: RPC: 51881 call_encode (status 0)
Jun 26 11:08:56 node1 kernel: RPC: 51881 marshaling UNIX cred
ffff81042c966480
Jun 26 11:08:56 node1 kernel: RPC: 51881 using AUTH_UNIX cred
ffff81042c966480 to wrap rpc data
Jun 26 11:08:56 node1 kernel: encode_compound: tag=
Jun 26 11:08:56 node1 kernel: RPC: 51881 xprt_transmit(140)
Jun 26 11:08:56 node1 kernel: RPC: xs_tcp_send_request(140) = 140
Jun 26 11:08:56 node1 kernel: RPC: 51881 xmit complete
Jun 26 11:08:56 node1 kernel: RPC: 51881 sleep_on(queue "xprt_pending"
time 4402577104)
Jun 26 11:08:56 node1 kernel: RPC: 51881 added to queue ffff81042805c1d0
"xprt_pending"
Jun 26 11:08:56 node1 kernel: RPC: 51881 setting alarm for 60000 ms
Jun 26 11:08:56 node1 kernel: RPC: wake_up_next(ffff81042805c180
"xprt_resend")
Jun 26 11:08:56 node1 kernel: RPC: wake_up_next(ffff81042805c130
"xprt_sending")
Jun 26 11:08:56 node1 kernel: RPC: 51881 sync task going to sleep
Jun 26 11:08:56 node1 kernel: RPC: xs_tcp_data_ready...
Jun 26 11:08:56 node1 kernel: RPC: xs_tcp_data_recv started
Jun 26 11:08:56 node1 kernel: RPC: reading TCP record fragment of
length 228
Jun 26 11:08:56 node1 kernel: RPC: reading XID (4 bytes)
Jun 26 11:08:56 node1 kernel: RPC: reading reply for XID 6856de4f
Jun 26 11:08:57 node1 kernel: RPC: XID 6856de4f read 224 bytes
Jun 26 11:08:57 node1 kernel: RPC: xprt = ffff81042805c000,
tcp_copied = 228, tcp_offset = 228, tcp_reclen = 228
Jun 26 11:08:57 node1 kernel: RPC: 51881 xid 6856de4f complete (228
bytes received)
Jun 26 11:08:57 node1 kernel: RPC: 51881 __rpc_wake_up_task (now 4402577104)
Jun 26 11:08:57 node1 kernel: RPC: 51881 disabling timer
Jun 26 11:08:57 node1 kernel: RPC: 51881 removed from queue
ffff81042805c1d0 "xprt_pending"
Jun 26 11:08:57 node1 kernel: RPC: __rpc_wake_up_task done
Jun 26 11:08:57 node1 kernel: RPC: xs_tcp_data_recv done
Jun 26 11:08:57 node1 kernel: RPC: 51881 sync task resuming
Jun 26 11:08:57 node1 kernel: RPC: 51881 deleting timer
Jun 26 11:08:57 node1 kernel: RPC: 51881 call_status (status 228)
Jun 26 11:08:57 node1 kernel: RPC: 51881 call_decode (status 228)
Jun 26 11:08:57 node1 kernel: RPC: 51881 validating UNIX cred
ffff81042c966480
Jun 26 11:08:57 node1 kernel: RPC: 51881 using AUTH_UNIX cred
ffff81042c966480 to unwrap rpc data
Jun 26 11:08:57 node1 kernel: decode_attr_type: type=02
Jun 26 11:08:57 node1 kernel: decode_attr_change: change
attribute=1302663605281908128
Jun 26 11:08:57 node1 kernel: decode_attr_size: file size=4096
Jun 26 11:08:57 node1 kernel: decode_attr_fsid: fsid=(0x235d83/0x0)
Jun 26 11:08:57 node1 kernel: decode_attr_fileid: fileid=100
Jun 26 11:08:57 node1 kernel: decode_attr_fs_locations: fs_locations
done, error = 0
Jun 26 11:08:57 node1 kernel: decode_attr_mode: file mode=0755
Jun 26 11:08:57 node1 kernel: decode_attr_nlink: nlink=18
Is there anything else I can do to get informations now, before
rebooting ? And if already too late, as I suspect the real issue here is
the connection loss, what would be the best setup to catch the problem
after the reboot ? Setting all rpcdebug flags is way too verbose for a
problem occuring once every 2-3 days only...
--
Guillaume Rousse
Moyens Informatiques - INRIA Futurs
Tel: 01 69 35 69 62
-------------- next part --------------
A non-text attachment was scrubbed...
Name: node1-nfs4_client-day.png
Type: image/png
Size: 54753 bytes
Desc: not available
Url : http://linux-nfs.org/pipermail/nfsv4/attachments/20080626/1ea1edcb/attachment.png
More information about the NFSv4
mailing list