[pnfs] Permanent inode access stall after large writeswith LAYOUT_NFSV4_FILES

Mc Carthy, Fergal Fergal.McCarthy at hp.com
Mon Oct 9 07:47:56 EDT 2006


Rahul,

	See my response to Dean's reply as well; don't want to bore you
all by repeating all the relevant details again here.

	First off, I'm using the CITI PNFS reference implementation on
both client and server, though I have added some code to the server to
manage file creation and maintain file layouts.
	As a result, with the move to the 2.6.17 backed PNFS code base I
was forced to make my stripe size 2MB; this is because the SUNRPC
implementation there raised the RPCSVC_MAXPAYLOAD setting to this value,
and, because the CITI reference nfslayoutdriver doesn't implement the
layout get_blocksize() operation, meaning that the pnfs_getiosize()
routine returns 0, then the PNFS ds_{w,r}size values (and related page
attributes) get initialised to the respective (i.e. MDS) servers
settings by the catch all default initialisation near the end of the
nfs_sb_init() routine.
	And since the nfs_coalesce_requests() call in the
nfs_flush_list() routine is using the ds_wpages setting (fetched via the
NFS_PROTO(inode)->wpages(inode) call) this means that stripe size has to
be at least RPCSVC_MAXPAYLOAD in size, otherwise the
BUG_ON(dbg_stripe_idx != stripe_idx) check in nfs4_pnfs_dserver_get()
will fail because the end of the region to be written is not on the same
DS device as the start, unless of course you were unlucky enough to have
a striping pattern which brought you back to the same DS again for the
end of the region - e.g. 2MB of data for 3 stripe/DS backed fs with
512KB stripe - in which case you would get data corruption problems
because data that should be striped across 3 DS devices would all be
being written to a single DS device.

	Also, the problem doesn't occur for me until I'm reaching sizes
of ~1GB and greater for a single file that I'm writing; I can write
smaller files without problems. In fact using a test app which just
creates multiple 256MB files in turn, I can write in a sustained fashion
for many many 10s of GB. And with a modified version of that app which
creates threads to manage the sync/close of the previously written files
I believe that I'm able to saturate buffer cache with approx 2GB of
dirty file data at a time without seeing any sort of stall issues. 

	As for how much gets written to a file at a time, well if you
are referring to the size of userspace I/O requests, that doesn't really
seem to matter to my problem, which is understandable since it is the
size of the page set that is generated by the nfs_coalesce_requests()
that really matters, as covered above.

	This problem has shown up on my test rig, which consists of some
dual CPU IA64 boxes (HP rx2600 Integrity servers) - with relatively slow
local SCSI disks for DS storage - being used for both clients and
servers. I can reproduce it with a single server (MDS and DS on same
node) and single client... In fact I even just reproduced it while
writing this on a single node in my test env, i.e. mounting the client
fs on the MDS/DS node. 
	However it has also shown up on another test environment which
used 32 bit HP Proliant servers backed by fast hardware RAID storage
talking to single and dual CPU clients, though these clients were
similar IA64 nodes again.

	Fergal.

--

Fergal.McCarthy at HP.com

(The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP, unless otherwise
stated, you should consider this message and attachments as "HP
CONFIDENTIAL".)
 

-----Original Message-----
From: Iyer, Rahul [mailto:Rahul.Iyer at netapp.com] 
Sent: 06 October 2006 22:44
To: Dean Hildebrand; Mc Carthy, Fergal
Cc: pnfs at linux-nfs.org
Subject: RE: [pnfs] Permanent inode access stall after large writeswith
LAYOUT_NFSV4_FILES

Hi guys,
I don't see this behaviour on the Netapp server. I just tried it out.
Also, when I test, I typically use a 1MB file and have 2 data servers
set up, with a stripe size of 128K. So, I always write more than 1
stripe to a file. So, that's another behavior I do not see. 

Fergal, is ths a uniprocessor client or a multiprocessor? In the past
we've had bugs that I saw on my dual processor, but weren't visible on a
uniprocessor. Maybe the reverse has happened??
Regards
Rahul


> -----Original Message-----
> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu] 
> Sent: Friday, October 06, 2006 2:27 PM
> To: Mc Carthy, Fergal
> Cc: pnfs at linux-nfs.org
> Subject: Re: [pnfs] Permanent inode access stall after large 
> writeswith LAYOUT_NFSV4_FILES
> 
> Hi Fergal,
> 
> I have never seen my client become totally hung (unless it 
> crashes), but a few of us have noticed some similar behavior 
> with regards to client writes. For some reason, sometimes the 
> first time you write more than 1 stripe to a file, it delays 
> the first write by the client timeo variable. What value of 
> timeo do you use on mount? Marc has noticed that if you 
> perform a read before a write on the mounted file system, 
> this behavior goes away.
> 
> An additional behavior I noticed awhile back was that if your 
> underlying file system doesn't support layoutreturn, 
> layoutreturn will fail and sometimes put the client can get 
> into a bad state. I checked in a fix a little ago that made 
> supporting layoutreturn optional. Not sure if a failed 
> layoutreturn is still a problem though. You can turn client 
> tracing on or use ethereal to see this (with garth's patch).
> 
> Are you sure none of your servers have experienced a problem? 
> Can you Ctrl-C the dd command?
> 
> Dean
> 
> Mc Carthy, Fergal wrote:
> >
> > I'm working with the PNFS 2.6.17-CITI_NFS4_ALL-pnfs-1-pnfs 
> code base, 
> > using LAYOUT_NFSV4_FILES with STRIPE_DENSE stripes.
> >
> > I'm repeatedly seeing a problem whereby if I try to write any 
> > significantly size amount of data in one go, e.g. 1GB, then the 
> > command hangs forever, with a stack trace like show below:
> >
> > # [root at zen12 ~]# dd if=/dev/zero of=/imports/test/dd.out count=1K 
> > bs=1M
> >
> > 1024+0 records in
> >
> > 1024+0 records out
> >
> > <hangs and never returns to prompt>
> >
> > crash> set 3406
> >
> > PID: 3406
> >
> > COMMAND: "dd"
> >
> > TASK: e0000040dbcd8000 [THREAD_INFO: e0000040dbcd8f00]
> >
> > CPU: 0
> >
> > STATE: TASK_INTERRUPTIBLE
> >
> > crash> bt
> >
> > PID: 3406 TASK: e0000040dbcd8000 CPU: 0 COMMAND: "dd"
> >
> > #0 [BSP:e0000040dbcd92e0] schedule at a000000100672010
> >
> > #1 [BSP:e0000040dbcd92c0] nfs_wait_bit_interruptible at 
> > a000000201b88b90
> >
> > #2 [BSP:e0000040dbcd9270] __wait_on_bit at a000000100674500
> >
> > #3 [BSP:e0000040dbcd9238] out_of_line_wait_on_bit at 
> a000000100674660
> >
> > #4 [BSP:e0000040dbcd9208] nfs_wait_on_request at a000000201b88ca0
> >
> > #5 [BSP:e0000040dbcd91b8] nfs_wait_on_requests_locked at 
> > a000000201b95400
> >
> > #6 [BSP:e0000040dbcd9168] nfs_writepages at a000000201b9a840
> >
> > #7 [BSP:e0000040dbcd9138] do_writepages at a0000001000dfe90
> >
> > #8 [BSP:e0000040dbcd9100] __filemap_fdatawrite_range at 
> > a0000001000cfe00
> >
> > #9 [BSP:e0000040dbcd90e0] filemap_fdatawrite at a0000001000cfe50
> >
> > #10 [BSP:e0000040dbcd90b8] nfs_file_release at a000000201b744a0
> >
> > #11 [BSP:e0000040dbcd9058] __fput at a00000010012e670
> >
> > #12 [BSP:e0000040dbcd9038] fput at a00000010012e6c0
> >
> > #13 [BSP:e0000040dbcd9008] filp_close at a00000010012aa70
> >
> > #14 [BSP:e0000040dbcd8f88] sys_close at a00000010012abc0
> >
> > #15 [BSP:e0000040dbcd8f88] ia64_ret_from_syscall at a00000010000b600
> >
> > EFRAME: e0000040dbcdfe40
> >
> > B0: 40000000000024a0 CR_IIP: a000000000010640
> >
> > CR_IPSR: 00001213081a6018 CR_IFS: 0000000000000004
> >
> > AR_PFS: c000000000000004 AR_RSC: 000000000000000f
> >
> > AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
> >
> > AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
> >
> > LOADRS: 0000000000280000 AR_BSPSTORE: 60000fff7fffc160
> >
> > B6: 20000000001d0dc0 B7: 2000000000118350
> >
> > PR: 000000000555a261 R1: 2000000000254238
> >
> > R2: 2000000000279874 R3: 0000000000000015
> >
> > R8: 0000000000000001 R9: 20000000002798d4
> >
> > R10: 0000000000000073 R11: c000000000000289
> >
> > R12: 60000fffff593700 R13: 20000000002c06a0
> >
> > R14: 00000000fffff940 R15: 0000000000000405
> >
> > R16: 00000000fbad8004 R17: 20000000002b7bb8
> >
> > R18: 60000fffff593128 R19: 60000fffff5930a0
> >
> > R20: 20000000002b7c28 R21: 60000fffff593178
> >
> > R22: 20000000002b5728 R23: 00000000000020e8
> >
> > R24: 20000000002b5728 R25: 200000000003d560
> >
> > R26: 20000000002798c8 R27: 20000000002b5760
> >
> > R28: 20000000002b5728 R29: 0000000000000025
> >
> > R30: 400000000000a768 R31: 400000000000a768
> >
> > F6: 1003ecccccccccccccccd F7: 1003e0000000000000000
> >
> > F8: 000000000000000000000 F9: 000000000000000000000
> >
> > F10: 000000000000000000000 F11: 000000000000000000000
> >
> > #16 [BSP:e0000040dbcd8f88] __kernel_syscall_via_break at 
> > a000000000010640
> >
> > The inode in question here is verifiably the inode of the 
> file on the 
> > mounted PNFS/MDS fs.
> >
> > The PNFS fs in this case has 2 DS nodes each serving 1 DS 
> filesystem, 
> > and if I write a 1GB file using a fill pattern of '@' and check the 
> > stripe files on the DS filesystems I see the expected 512MB 
> of densely 
> > packed '@' characters in each stripe file:
> >
> > [od -A x -a of DS0 stripe file]
> >
> > 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
> >
> > *
> >
> > 20000000
> >
> > [od -A x -a of DS1 stripe file]
> >
> > 000000 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
> >
> > *
> >
> > 20000000
> >
> > Checking in kernel I find that there are now stalled RPCs queued on 
> > any of the DS rpc_xprt queues and in fact all (16) of the rqst/slot 
> > buffers are back on the free list.
> >
> > This suggests to me that the fact the PNFS I/O activity has 
> completed 
> > is not being successfully notified upstream to the MDS inode. This 
> > suggests that the rpc_call_done callback passed to the
> > rpc_call_async() is not doing the write thing, and tracing 
> through the 
> > code I have identified, I believe, the likely routine, namely 
> > nfs_writeback_done_full(), which is registered in the 
> > nfs_write_full_ops callbacks used for async NFSv4 writes. 
> And there is 
> > an interesting comment about the nfs_writeback_done_full() 
> which, if 
> > accurate/uptodate, suggests that there is a known potential race 
> > associated with this callback routine.
> >
> > [extract from linux/fs/nfs/write.c]
> >
> > /*
> >
> > * Handle a write reply that flushes a whole page.
> >
> > *
> >
> > * FIXME: There is an inherent race with invalidate_inode_pages and
> >
> > * writebacks since the page->count is kept > 1 for as long
> >
> > * as the page has a write request pending.
> >
> > */
> >
> > static void nfs_writeback_done_full(struct rpc_task *task, void 
> > *calldata)
> >
> > {
> >
> > struct nfs_write_data *data = calldata;
> >
> > if (nfs_writeback_done(task, data) != 0)
> >
> > return;
> >
> > nfs_writeback_full_page(data, task->tk_status);
> >
> > }
> >
> > static const struct rpc_call_ops nfs_write_full_ops = {
> >
> > .rpc_call_done = nfs_writeback_done_full,
> >
> > .rpc_release = nfs_writedata_release,
> >
> > };
> >
> > I did a quick check of the LKML but nothing obvious jumped 
> out at me.
> >
> > Therefore I'd appreciate any help/pointers on how to go 
> about getting 
> > past this problem... ;-)
> >
> > Fergal.
> >
> > --
> >
> > Fergal.McCarthy at HP.com
> >
> > (The contents of this message and any attachments to it are 
> > confidential and may be legally privileged. If you have 
> received this 
> > message in error you should delete it from your system 
> immediately and 
> > advise the sender. To any recipient of this message within 
> HP, unless 
> > otherwise stated, you should consider this message and 
> attachments as 
> > "HP CONFIDENTIAL".)
> >
> > 
> ----------------------------------------------------------------------
> > --
> >
> > _______________________________________________
> > pNFS mailing list
> > pNFS at linux-nfs.org
> > http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 
> --
> Dean Hildebrand
> Ph.D. Candidate
> University of Michigan
> 
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 


More information about the pNFS mailing list