[pnfs] Permanent inode access stall after large writes with
LAYOUT_NFSV4_FILES
Mc Carthy, Fergal
Fergal.McCarthy at hp.com
Wed Oct 4 07:31:00 EDT 2006
I'm working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs code base,
using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes.
I'm repeatedly seeing a problem whereby if I try to write any
significantly size amount of data in one go, e.g. 1GB, then the command
hangs forever, with a stack trace like show below:
# [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K bs=1M
1024+0 records in
1024+0 records out
<hangs and never returns to prompt>
crash> set 3406
PID: 3406
COMMAND: "dd"
TASK: e0000040dbcd8000 [THREAD_INFO: e0000040dbcd8f00]
CPU: 0
STATE: TASK_INTERRUPTIBLE
crash> bt
PID: 3406 TASK: e0000040dbcd8000 CPU: 0 COMMAND: "dd"
#0 [BSP:e0000040dbcd92e0] schedule at a000000100672010
#1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at
a000000201b88b90
#2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500
#3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at a000000100674660
#4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0
#5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at
a000000201b95400
#6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840
#7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90
#8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at
a0000001000cfe00
#9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50
#10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0
#11 [BSP:e0000040dbcd9058] __fput at a00000010012e670
#12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0
#13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70
#14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0
#15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600
EFRAME: e0000040dbcdfe40
B0: 40000000000024a0 CR_IIP: a000000000010640
CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000004
AR_PFS: c000000000000004 AR_RSC: 000000000000000f
AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160
B6: 20000000001d0dc0 B7: 2000000000118350
PR: 000000000555a261 R1: 2000000000254238
R2: 2000000000279874 R3: 0000000000000015
R8: 0000000000000001 R9: 20000000002798d4
R10: 0000000000000073 R11: c000000000000289
R12: 60000fffff593700 R13: 20000000002c06a0
R14: 00000000fffff940 R15: 0000000000000405
R16: 00000000fbad8004 R17: 20000000002b7bb8
R18: 60000fffff593128 R19: 60000fffff5930a0
R20: 20000000002b7c28 R21: 60000fffff593178
R22: 20000000002b5728 R23: 00000000000020e8
R24: 20000000002b5728 R25: 200000000003d560
R26: 20000000002798c8 R27: 20000000002b5760
R28: 20000000002b5728 R29: 0000000000000025
R30: 400000000000a768 R31: 400000000000a768
F6: 1003ecccccccccccccccd F7: 1003e0000000000000000
F8: 000000000000000000000 F9: 000000000000000000000
F10: 000000000000000000000 F11: 000000000000000000000
#16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at
a000000000010640
The inode in question here is verifiably the inode of the file on the
mounted PNFS/MDS fs.
The PNFS fs in this case has 2 DS nodes each serving 1 DS filesystem,
and if I write a 1GB file using a fill pattern of '@' and check the
stripe files on the DS filesystems I see the expected 512MB of densely
packed '@' characters in each stripe file:
[od -A x -a of DS0 stripe file]
000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
*
20000000
[od -A x -a of DS1 stripe file]
000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
*
20000000
Checking in kernel I find that there are now stalled RPCs queued on any
of the DS rpc_xprt queues and in fact all (16) of the rqst/slot buffers
are back on the free list.
This suggests to me that the fact the PNFS I/O activity has completed is
not being successfully notified upstream to the MDS inode. This suggests
that the rpc_call_done callback passed to the rpc_call_async() is not
doing the write thing, and tracing through the code I have identified, I
believe, the likely routine, namely nfs_writeback_done_full(), which is
registered in the nfs_write_full_ops callbacks used for async NFSv4
writes. And there is an interesting comment about the
nfs_writeback_done_full() which, if accurate/uptodate, suggests that
there is a known potential race associated with this callback routine.
[extract from linux/fs/nfs/write.c]
/*
* Handle a write reply that flushes a whole page.
*
* FIXME: There is an inherent race with invalidate_inode_pages and
* writebacks since the page->count is kept > 1 for as long
* as the page has a write request pending.
*/
static void nfs_writeback_done_full(struct rpc_task *task, void
*calldata)
{
struct nfs_write_data *data = calldata;
if (nfs_writeback_done(task, data) != 0)
return;
nfs_writeback_full_page(data, task->tk_status);
}
static const struct rpc_call_ops nfs_write_full_ops = {
.rpc_call_done = nfs_writeback_done_full,
.rpc_release = nfs_writedata_release,
};
I did a quick check of the LKML but nothing obvious jumped out at me.
Therefore I'd appreciate any help/pointers on how to go about getting
past this problem... ;-)
Fergal.
--
Fergal.McCarthy at HP.com
(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20061004/05cf3029/attachment-0001.htm
More information about the pNFS
mailing list