critical kernel bug in encode_lookup

Gabriel Barazer gabriel at oxeva.fr
Tue Aug 14 20:45:17 EDT 2007


Hi there,

I have experienced some bug under a quite heavy load (but only 
sometimes), related to the encode_lookup function, which cause the nfs4 
client to stop and block. Here is a copy of this bug :
RESERVE_SPACE(429) failed in function encode_lookup
------------[ cut here ]------------
kernel BUG at fs/nfs/nfs4xdr.c:849!
invalid opcode: 0000 [1] SMP
CPU 3
Pid: 1695, comm: httpd Not tainted 2.6.22.1 #1
RIP: 0010:[<ffffffff802cc949>]  [<ffffffff802cc949>] encode_lookup+0x34/0x5c
RSP: 0018:ffff8100585798d8  EFLAGS: 00010292
RAX: 0000000000000037 RBX: 00000000000001a5 RCX: ffffffff80561c68
RDX: ffffffff80561c68 RSI: 0000000000000092 RDI: ffffffff80561c60
RBP: 00000000000001ad R08: ffffffff80561c68 R09: ffff810058579678
R10: 0000000000000003 R11: 0000000000000000 R12: ffff81003ff5af00
R13: ffff81007f411248 R14: ffffffff802cddea R15: ffff81003ff5af00
FS:  000000004680c940(0063) GS:ffff81007f3a7240(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002af96acc2000 CR3: 000000007ca5d000 CR4: 00000000000006e0
Process httpd (pid: 1695, threadinfo ffff810058578000, task 
ffff81007893e7e0)
Stack:  ffff81007f411248 ffff810058579a38 ffff81003582a864 ffffffff802cde4c
 ffff81003582a88c ffff81007f411250 ffff81003582aa38 ffff81007f411250
 0000000400000000 0000000000000000 0000000000000000 ffff81003582a864
Call Trace:
 [<ffffffff802cde4c>] nfs4_xdr_enc_lookup+0x62/0x85
 [<ffffffff80474cec>] call_transmit+0x1c1/0x22d
 [<ffffffff8047a1bb>] __rpc_execute+0x7d/0x234
 [<ffffffff8047544c>] rpc_call_sync+0x75/0x9c
 [<ffffffff802c82c9>] nfs4_proc_lookup+0xe5/0x263
 [<ffffffff802b451a>] nfs_lookup+0xf6/0x269
 [<ffffffff802384ae>] lock_timer_base+0x26/0x4b
 [<ffffffff80238524>] try_to_del_timer_sync+0x51/0x5a
 [<ffffffff8048f8df>] _spin_lock_bh+0x9/0x19
 [<ffffffff803d3457>] do_ioat_dma_memcpy+0x1fd/0x215
 [<ffffffff8047b360>] rpcauth_lookup_credcache+0x12e/0x24a
 [<ffffffff803d2ac5>] dma_memcpy_to_iovec+0x19d/0x1f0
 [<ffffffff802b6538>] nfs_atomic_lookup+0x4b/0x18a
 [<ffffffff8027469c>] do_lookup+0xc4/0x1ae
 [<ffffffff802765c8>] __link_path_walk+0x878/0xd0a
 [<ffffffff80276ab2>] link_path_walk+0x58/0xe0
 [<ffffffff803dcde7>] move_addr_to_user+0x3a/0x4e
 [<ffffffff80276e1f>] do_path_lookup+0x1a0/0x1c3
 [<ffffffff802759c0>] getname+0x14c/0x190
 [<ffffffff8027764f>] __user_walk_fd+0x37/0x53
 [<ffffffff80270a2e>] vfs_stat_fd+0x1b/0x4a
 [<ffffffff803dcde7>] move_addr_to_user+0x3a/0x4e
 [<ffffffff80270bf9>] sys_newstat+0x19/0x31
 [<ffffffff8026e649>] vfs_read+0x11e/0x132
 [<ffffffff8026e98c>] sys_read+0x60/0x6e
 [<ffffffff80209d8e>] system_call+0x7e/0x83


Code: 0f 0b eb fe c7 00 00 00 00 0f 89 d8 48 8d 7a 08 0f c8 89 42
RIP  [<ffffffff802cc949>] encode_lookup+0x34/0x5c
 RSP <ffff8100585798d8>
-----

When the bug occured, all the 5 servers on the same http cluster hit it 
at the same time, which make me think it's maybe related to the file 
server. As I have not a text console but only graphic KVM access, 
attached is the screenshot. I haven't be able to get the upper part of 
the calltrace in it, but I thought it could be useful as the call trace 
isn't exactly the same (but is the same on the all 5 servers except the 
one on which I have text console).

This happened after several hours of service, and all exactly at the 
same time (in a <5 second window).

The nfs4 clients are all Linux kernel 2.6.22.1 and the server is 
2.6.21.5, and no gss and krb5, only unix security.
I really need to make these nodes stable, so I can't do many tests on 
it, but I have a couple nodes free to make any tests, although it's 
quite difficult to reproduce the bug.

Any thoughts, ideas ?

Gabriel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: screenshot.jpg
Type: image/jpeg
Size: 62294 bytes
Desc: not available
Url : http://linux-nfs.org/pipermail/nfsv4/attachments/20070815/5120b7af/attachment.jpg 


More information about the NFSv4 mailing list