[pnfs] Permanent inode access stall after large
writeswith LAYOUT_NFSV4_FILES
Iyer, Rahul
Rahul.Iyer at netapp.com
Fri Oct 6 17:44:21 EDT 2006
Hi guys,
I don't see this behaviour on the Netapp server. I just tried it out.
Also, when I test, I typically use a 1MB file and have 2 data servers
set up, with a stripe size of 128K. So, I always write more than 1
stripe to a file. So, that's another behavior I do not see.
Fergal, is ths a uniprocessor client or a multiprocessor? In the past
we've had bugs that I saw on my dual processor, but weren't visible on a
uniprocessor. Maybe the reverse has happened??
Regards
Rahul
> -----Original Message-----
> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> Sent: Friday, October 06, 2006 2:27 PM
> To: Mc Carthy, Fergal
> Cc: pnfs at linux-nfs.org
> Subject: Re: [pnfs] Permanent inode access stall after large
> writeswith LAYOUT_NFSV4_FILES
>
> Hi Fergal,
>
> I have never seen my client become totally hung (unless it
> crashes), but a few of us have noticed some similar behavior
> with regards to client writes. For some reason, sometimes the
> first time you write more than 1 stripe to a file, it delays
> the first write by the client timeo variable. What value of
> timeo do you use on mount? Marc has noticed that if you
> perform a read before a write on the mounted file system,
> this behavior goes away.
>
> An additional behavior I noticed awhile back was that if your
> underlying file system doesn't support layoutreturn,
> layoutreturn will fail and sometimes put the client can get
> into a bad state. I checked in a fix a little ago that made
> supporting layoutreturn optional. Not sure if a failed
> layoutreturn is still a problem though. You can turn client
> tracing on or use ethereal to see this (with garth's patch).
>
> Are you sure none of your servers have experienced a problem?
> Can you Ctrl-C the dd command?
>
> Dean
>
> Mc Carthy, Fergal wrote:
> >
> > I'm working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs
> code base,
> > using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes.
> >
> > I'm repeatedly seeing a problem whereby if I try to write any
> > significantly size amount of data in one go, e.g. 1GB, then the
> > command hangs forever, with a stack trace like show below:
> >
> > # [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K
> > bs=1M
> >
> > 1024+0 records in
> >
> > 1024+0 records out
> >
> > <hangs and never returns to prompt>
> >
> > crash> set 3406
> >
> > PID: 3406
> >
> > COMMAND: "dd"
> >
> > TASK: e0000040dbcd8000 [THREAD_INFO: e0000040dbcd8f00]
> >
> > CPU: 0
> >
> > STATE: TASK_INTERRUPTIBLE
> >
> > crash> bt
> >
> > PID: 3406 TASK: e0000040dbcd8000 CPU: 0 COMMAND: "dd"
> >
> > #0 [BSP:e0000040dbcd92e0] schedule at a000000100672010
> >
> > #1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at
> > a000000201b88b90
> >
> > #2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500
> >
> > #3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at
> a000000100674660
> >
> > #4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0
> >
> > #5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at
> > a000000201b95400
> >
> > #6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840
> >
> > #7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90
> >
> > #8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at
> > a0000001000cfe00
> >
> > #9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50
> >
> > #10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0
> >
> > #11 [BSP:e0000040dbcd9058] __fput at a00000010012e670
> >
> > #12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0
> >
> > #13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70
> >
> > #14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0
> >
> > #15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600
> >
> > EFRAME: e0000040dbcdfe40
> >
> > B0: 40000000000024a0 CR_IIP: a000000000010640
> >
> > CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000004
> >
> > AR_PFS: c000000000000004 AR_RSC: 000000000000000f
> >
> > AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
> >
> > AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
> >
> > LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160
> >
> > B6: 20000000001d0dc0 B7: 2000000000118350
> >
> > PR: 000000000555a261 R1: 2000000000254238
> >
> > R2: 2000000000279874 R3: 0000000000000015
> >
> > R8: 0000000000000001 R9: 20000000002798d4
> >
> > R10: 0000000000000073 R11: c000000000000289
> >
> > R12: 60000fffff593700 R13: 20000000002c06a0
> >
> > R14: 00000000fffff940 R15: 0000000000000405
> >
> > R16: 00000000fbad8004 R17: 20000000002b7bb8
> >
> > R18: 60000fffff593128 R19: 60000fffff5930a0
> >
> > R20: 20000000002b7c28 R21: 60000fffff593178
> >
> > R22: 20000000002b5728 R23: 00000000000020e8
> >
> > R24: 20000000002b5728 R25: 200000000003d560
> >
> > R26: 20000000002798c8 R27: 20000000002b5760
> >
> > R28: 20000000002b5728 R29: 0000000000000025
> >
> > R30: 400000000000a768 R31: 400000000000a768
> >
> > F6: 1003ecccccccccccccccd F7: 1003e0000000000000000
> >
> > F8: 000000000000000000000 F9: 000000000000000000000
> >
> > F10: 000000000000000000000 F11: 000000000000000000000
> >
> > #16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at
> > a000000000010640
> >
> > The inode in question here is verifiably the inode of the
> file on the
> > mounted PNFS/MDS fs.
> >
> > The PNFS fs in this case has 2 DS nodes each serving 1 DS
> filesystem,
> > and if I write a 1GB file using a fill pattern of '@' and check the
> > stripe files on the DS filesystems I see the expected 512MB
> of densely
> > packed '@' characters in each stripe file:
> >
> > [od -A x -a of DS0 stripe file]
> >
> > 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
> >
> > *
> >
> > 20000000
> >
> > [od -A x -a of DS1 stripe file]
> >
> > 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
> >
> > *
> >
> > 20000000
> >
> > Checking in kernel I find that there are now stalled RPCs queued on
> > any of the DS rpc_xprt queues and in fact all (16) of the rqst/slot
> > buffers are back on the free list.
> >
> > This suggests to me that the fact the PNFS I/O activity has
> completed
> > is not being successfully notified upstream to the MDS inode. This
> > suggests that the rpc_call_done callback passed to the
> > rpc_call_async() is not doing the write thing, and tracing
> through the
> > code I have identified, I believe, the likely routine, namely
> > nfs_writeback_done_full(), which is registered in the
> > nfs_write_full_ops callbacks used for async NFSv4 writes.
> And there is
> > an interesting comment about the nfs_writeback_done_full()
> which, if
> > accurate/uptodate, suggests that there is a known potential race
> > associated with this callback routine.
> >
> > [extract from linux/fs/nfs/write.c]
> >
> > /*
> >
> > * Handle a write reply that flushes a whole page.
> >
> > *
> >
> > * FIXME: There is an inherent race with invalidate_inode_pages and
> >
> > * writebacks since the page->count is kept > 1 for as long
> >
> > * as the page has a write request pending.
> >
> > */
> >
> > static void nfs_writeback_done_full(struct rpc_task *task, void
> > *calldata)
> >
> > {
> >
> > struct nfs_write_data *data = calldata;
> >
> > if (nfs_writeback_done(task, data) != 0)
> >
> > return;
> >
> > nfs_writeback_full_page(data, task->tk_status);
> >
> > }
> >
> > static const struct rpc_call_ops nfs_write_full_ops = {
> >
> > .rpc_call_done = nfs_writeback_done_full,
> >
> > .rpc_release = nfs_writedata_release,
> >
> > };
> >
> > I did a quick check of the LKML but nothing obvious jumped
> out at me.
> >
> > Therefore I'd appreciate any help/pointers on how to go
> about getting
> > past this problem... ;-)
> >
> > Fergal.
> >
> > --
> >
> > Fergal.McCarthy at HP.com
> >
> > (The contents of this message and any attachments to it are
> > confidential and may be legally privileged. If you have
> received this
> > message in error you should delete it from your system
> immediately and
> > advise the sender. To any recipient of this message within
> HP, unless
> > otherwise stated, you should consider this message and
> attachments as
> > "HP CONFIDENTIAL".)
> >
> >
> ----------------------------------------------------------------------
> > --
> >
> > _______________________________________________
> > pNFS mailing list
> > pNFS at linux-nfs.org
> > http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>
> --
> Dean Hildebrand
> Ph.D. Candidate
> University of Michigan
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>
More information about the pNFS
mailing list