[pnfs] Permanent inode access stall after large writes with LAYOUT_NFSV4_FILES

Mc Carthy, Fergal Fergal.McCarthy at hp.com
Mon Oct 9 06:48:53 EDT 2006


Dean,

	Just to clarify - the client node doesn't hang in any fashion;
only accesses to the inode of the file I write > 1GB to will stall. So I
can read/write to files with different names, and ls explicit files in
the same directory, or directory contents in other directories, on the
same file system no problem. However any attempts to access the specific
file/inode on the fs, e.g. ls, will stall, waiting for pending NFS
operations on the inode to complete. This stall state persists forever
as far as I can demonstrate.
	None of my DS server nodes, or the exported file systems that
they provide, are in anyway affected or unhealthy, and operation tracing
on the DS and MDS nodes show that they have successfully completed every
request sent to them.

	I have seen the initial delay issue on the first write to a file
myself in my own test environment and traced that down to the fact that
if a write to a stripe hits a DS server that hasn't got the export entry
cached in-kernel (in the SUNRPC maintained in-kernel exports cache)
then, like with a NFSv4 mount compound operation, the initial PUTFH will
be failed with a DropIt error...
	However in the case of a mount the client rapidly
retries/resends the dropped request, and by that time the exports cache
entry should have been populated, whereas for the PNFS writes it takes
~60 secs before the client resends the request... So the easy way I
found to avoid that one is to warm up the DS nodes by mounting/umounting
the DS file systems to ensure that the relevant exports entries are
populated.

	I'm aware of the layout return issue; what I've seen in that
case is the fact that if I delete a file on the client, because I've not
yet added anything to my test server (yet) to force a CB_LAYOUTRECALL to
be generated as part of the delete handling, then subsequent accesses
(re-use the same filename for a new test run) to the now deleted file
path will trigger client side LAYOUTRETURN operations which fail with
ESTALE because the file handle being returned is no longer valid.
	However that doesn't appear to be the source of my problem
because I can get the write stall on a completely clean client mount
that has never deleted any files previously, or done any I/O of any
sort.

	To better re-state things, via instrumentation of the code I
have pretty much proved to myself that all the PNFS writes have
completed successfully, with the corresponding calls to the registered
rpc_call_done callback (i.e. nfs_writeback_done_full) for each generated
DS write having been called successfully.
	However the nfs inode remains stalled waiting for notification
that the overall operation has completed, which isn't happening.
	Further I find that if I limit my file system to say 256MB, and
just write a steady stream of files this size, then I don't see the
problem. 

	I'll answer Rahul's response next.

	Fergal.

--

Fergal.McCarthy at HP.com

(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
 
-----Original Message-----
From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu] 
Sent: 06 October 2006 22:27
To: Mc Carthy, Fergal
Cc: pnfs at linux-nfs.org
Subject: Re: [pnfs] Permanent inode access stall after large writes with
LAYOUT_NFSV4_FILES

Hi Fergal,

I have never seen my client become totally hung (unless it crashes), but

a few of us have noticed some similar behavior with regards to client 
writes. For some reason, sometimes the first time you write more than 1 
stripe to a file, it delays the first write by the client timeo 
variable. What value of timeo do you use on mount? Marc has noticed that

if you perform a read before a write on the mounted file system, this 
behavior goes away.

An additional behavior I noticed awhile back was that if your underlying

file system doesn't support layoutreturn, layoutreturn will fail and 
sometimes put the client can get into a bad state. I checked in a fix a 
little ago that made supporting layoutreturn optional. Not sure if a 
failed layoutreturn is still a problem though. You can turn client 
tracing on or use ethereal to see this (with garth's patch).

Are you sure none of your servers have experienced a problem? Can you 
Ctrl-C the dd command?

Dean

Mc Carthy, Fergal wrote:
>
> I'm working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs code base, 
> using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes.
>
> I'm repeatedly seeing a problem whereby if I try to write any 
> significantly size amount of data in one go, e.g. 1GB, then the 
> command hangs forever, with a stack trace like show below:
>
> # [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K
bs=1M
>
> 1024+0 records in
>
> 1024+0 records out
>
> <hangs and never returns to prompt>
>
> crash> set 3406
>
> PID: 3406
>
> COMMAND: "dd"
>
> TASK: e0000040dbcd8000 [THREAD_INFO: e0000040dbcd8f00]
>
> CPU: 0
>
> STATE: TASK_INTERRUPTIBLE
>
> crash> bt
>
> PID: 3406 TASK: e0000040dbcd8000 CPU: 0 COMMAND: "dd"
>
> #0 [BSP:e0000040dbcd92e0] schedule at a000000100672010
>
> #1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at
a000000201b88b90
>
> #2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500
>
> #3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at a000000100674660
>
> #4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0
>
> #5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at
a000000201b95400
>
> #6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840
>
> #7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90
>
> #8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at
a0000001000cfe00
>
> #9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50
>
> #10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0
>
> #11 [BSP:e0000040dbcd9058] __fput at a00000010012e670
>
> #12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0
>
> #13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70
>
> #14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0
>
> #15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600
>
> EFRAME: e0000040dbcdfe40
>
> B0: 40000000000024a0 CR_IIP: a000000000010640
>
> CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000004
>
> AR_PFS: c000000000000004 AR_RSC: 000000000000000f
>
> AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
>
> AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
>
> LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160
>
> B6: 20000000001d0dc0 B7: 2000000000118350
>
> PR: 000000000555a261 R1: 2000000000254238
>
> R2: 2000000000279874 R3: 0000000000000015
>
> R8: 0000000000000001 R9: 20000000002798d4
>
> R10: 0000000000000073 R11: c000000000000289
>
> R12: 60000fffff593700 R13: 20000000002c06a0
>
> R14: 00000000fffff940 R15: 0000000000000405
>
> R16: 00000000fbad8004 R17: 20000000002b7bb8
>
> R18: 60000fffff593128 R19: 60000fffff5930a0
>
> R20: 20000000002b7c28 R21: 60000fffff593178
>
> R22: 20000000002b5728 R23: 00000000000020e8
>
> R24: 20000000002b5728 R25: 200000000003d560
>
> R26: 20000000002798c8 R27: 20000000002b5760
>
> R28: 20000000002b5728 R29: 0000000000000025
>
> R30: 400000000000a768 R31: 400000000000a768
>
> F6: 1003ecccccccccccccccd F7: 1003e0000000000000000
>
> F8: 000000000000000000000 F9: 000000000000000000000
>
> F10: 000000000000000000000 F11: 000000000000000000000
>
> #16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at
a000000000010640
>
> The inode in question here is verifiably the inode of the file on the 
> mounted PNFS/MDS fs.
>
> The PNFS fs in this case has 2 DS nodes each serving 1 DS filesystem, 
> and if I write a 1GB file using a fill pattern of '@' and check the 
> stripe files on the DS filesystems I see the expected 512MB of densely

> packed '@' characters in each stripe file:
>
> [od -A x -a of DS0 stripe file]
>
> 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
>
> *
>
> 20000000
>
> [od -A x -a of DS1 stripe file]
>
> 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
>
> *
>
> 20000000
>
> Checking in kernel I find that there are now stalled RPCs queued on 
> any of the DS rpc_xprt queues and in fact all (16) of the rqst/slot 
> buffers are back on the free list.
>
> This suggests to me that the fact the PNFS I/O activity has completed 
> is not being successfully notified upstream to the MDS inode. This 
> suggests that the rpc_call_done callback passed to the 
> rpc_call_async() is not doing the write thing, and tracing through the

> code I have identified, I believe, the likely routine, namely 
> nfs_writeback_done_full(), which is registered in the 
> nfs_write_full_ops callbacks used for async NFSv4 writes. And there is

> an interesting comment about the nfs_writeback_done_full() which, if 
> accurate/uptodate, suggests that there is a known potential race 
> associated with this callback routine.
>
> [extract from linux/fs/nfs/write.c]
>
> /*
>
> * Handle a write reply that flushes a whole page.
>
> *
>
> * FIXME: There is an inherent race with invalidate_inode_pages and
>
> * writebacks since the page->count is kept > 1 for as long
>
> * as the page has a write request pending.
>
> */
>
> static void nfs_writeback_done_full(struct rpc_task *task, void
*calldata)
>
> {
>
> struct nfs_write_data *data = calldata;
>
> if (nfs_writeback_done(task, data) != 0)
>
> return;
>
> nfs_writeback_full_page(data, task->tk_status);
>
> }
>
> static const struct rpc_call_ops nfs_write_full_ops = {
>
> .rpc_call_done = nfs_writeback_done_full,
>
> .rpc_release = nfs_writedata_release,
>
> };
>
> I did a quick check of the LKML but nothing obvious jumped out at me.
>
> Therefore I'd appreciate any help/pointers on how to go about getting 
> past this problem... ;-)
>
> Fergal.
>
> --
>
> Fergal.McCarthy at HP.com
>
> (The contents of this message and any attachments to it are 
> confidential and may be legally privileged. If you have received this 
> message in error you should delete it from your system immediately and

> advise the sender. To any recipient of this message within HP, unless 
> otherwise stated, you should consider this message and attachments as 
> "HP CONFIDENTIAL".)
>
>
------------------------------------------------------------------------
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs

-- 
Dean Hildebrand
Ph.D. Candidate
University of Michigan



More information about the pNFS mailing list