[pnfs] Permanent inode access stall after large writes with LAYOUT_NFSV4_FILES

Rahul Iyer rni at andrew.cmu.edu
Wed Feb 21 12:40:21 EST 2007


just resending as discussed in today's call


Mc Carthy, Fergal wrote:

> I’m working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs code base, 
> using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes.
>
> I’m repeatedly seeing a problem whereby if I try to write any 
> significantly size amount of data in one go, e.g. 1GB, then the 
> command hangs forever, with a stack trace like show below:
>
> # [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K bs=1M
>
> 1024+0 records in
>
> 1024+0 records out
>
> <hangs and never returns to prompt>
>
> crash> set 3406
>
> PID: 3406
>
> COMMAND: "dd"
>
> TASK: e0000040dbcd8000 [THREAD_INFO: e0000040dbcd8f00]
>
> CPU: 0
>
> STATE: TASK_INTERRUPTIBLE
>
> crash> bt
>
> PID: 3406 TASK: e0000040dbcd8000 CPU: 0 COMMAND: "dd"
>
> #0 [BSP:e0000040dbcd92e0] schedule at a000000100672010
>
> #1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at a000000201b88b90
>
> #2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500
>
> #3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at a000000100674660
>
> #4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0
>
> #5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at a000000201b95400
>
> #6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840
>
> #7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90
>
> #8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at a0000001000cfe00
>
> #9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50
>
> #10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0
>
> #11 [BSP:e0000040dbcd9058] __fput at a00000010012e670
>
> #12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0
>
> #13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70
>
> #14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0
>
> #15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600
>
> EFRAME: e0000040dbcdfe40
>
> B0: 40000000000024a0 CR_IIP: a000000000010640
>
> CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000004
>
> AR_PFS: c000000000000004 AR_RSC: 000000000000000f
>
> AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
>
> AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
>
> LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160
>
> B6: 20000000001d0dc0 B7: 2000000000118350
>
> PR: 000000000555a261 R1: 2000000000254238
>
> R2: 2000000000279874 R3: 0000000000000015
>
> R8: 0000000000000001 R9: 20000000002798d4
>
> R10: 0000000000000073 R11: c000000000000289
>
> R12: 60000fffff593700 R13: 20000000002c06a0
>
> R14: 00000000fffff940 R15: 0000000000000405
>
> R16: 00000000fbad8004 R17: 20000000002b7bb8
>
> R18: 60000fffff593128 R19: 60000fffff5930a0
>
> R20: 20000000002b7c28 R21: 60000fffff593178
>
> R22: 20000000002b5728 R23: 00000000000020e8
>
> R24: 20000000002b5728 R25: 200000000003d560
>
> R26: 20000000002798c8 R27: 20000000002b5760
>
> R28: 20000000002b5728 R29: 0000000000000025
>
> R30: 400000000000a768 R31: 400000000000a768
>
> F6: 1003ecccccccccccccccd F7: 1003e0000000000000000
>
> F8: 000000000000000000000 F9: 000000000000000000000
>
> F10: 000000000000000000000 F11: 000000000000000000000
>
> #16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at a000000000010640
>
> The inode in question here is verifiably the inode of the file on the 
> mounted PNFS/MDS fs.
>
> The PNFS fs in this case has 2 DS nodes each serving 1 DS filesystem, 
> and if I write a 1GB file using a fill pattern of ‘@’ and check the 
> stripe files on the DS filesystems I see the expected 512MB of densely 
> packed ‘@’ characters in each stripe file:
>
> [od –A x –a of DS0 stripe file]
>
> 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
>
> *
>
> 20000000
>
> [od –A x –a of DS1 stripe file]
>
> 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
>
> *
>
> 20000000
>
> Checking in kernel I find that there are now stalled RPCs queued on 
> any of the DS rpc_xprt queues and in fact all (16) of the rqst/slot 
> buffers are back on the free list.
>
> This suggests to me that the fact the PNFS I/O activity has completed 
> is not being successfully notified upstream to the MDS inode. This 
> suggests that the rpc_call_done callback passed to the 
> rpc_call_async() is not doing the write thing, and tracing through the 
> code I have identified, I believe, the likely routine, namely 
> nfs_writeback_done_full(), which is registered in the 
> nfs_write_full_ops callbacks used for async NFSv4 writes. And there is 
> an interesting comment about the nfs_writeback_done_full() which, if 
> accurate/uptodate, suggests that there is a known potential race 
> associated with this callback routine.
>
> [extract from linux/fs/nfs/write.c]
>
> /*
>
> * Handle a write reply that flushes a whole page.
>
> *
>
> * FIXME: There is an inherent race with invalidate_inode_pages and
>
> * writebacks since the page->count is kept > 1 for as long
>
> * as the page has a write request pending.
>
> */
>
> static void nfs_writeback_done_full(struct rpc_task *task, void *calldata)
>
> {
>
> struct nfs_write_data *data = calldata;
>
> if (nfs_writeback_done(task, data) != 0)
>
> return;
>
> nfs_writeback_full_page(data, task->tk_status);
>
> }
>
> static const struct rpc_call_ops nfs_write_full_ops = {
>
> .rpc_call_done = nfs_writeback_done_full,
>
> .rpc_release = nfs_writedata_release,
>
> };
>
> I did a quick check of the LKML but nothing obvious jumped out at me.
>
> Therefore I’d appreciate any help/pointers on how to go about getting 
> past this problem… ;-)
>
> Fergal.
>
> --
>
> Fergal.McCarthy at HP.com
>
> (The contents of this message and any attachments to it are 
> confidential and may be legally privileged. If you have received this 
> message in error you should delete it from your system immediately and 
> advise the sender. To any recipient of this message within HP, unless 
> otherwise stated, you should consider this message and attachments as 
> "HP CONFIDENTIAL".)
>
>------------------------------------------------------------------------
>
>_______________________________________________
>pNFS mailing list
>pNFS at linux-nfs.org
>http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>


More information about the pNFS mailing list