[pnfs] The Boundary issue
Iyer, Rahul
Rahul.Iyer at netapp.com
Thu Oct 12 18:28:05 EDT 2006
Hi all,
Just to recap the past few email exchanges on this issue and also
yesterday's call...
Dean suggests that the call to virtual_update_layout() be made in
nfs_sync_inode_wait(). This is where the flushing is started and then
the layoutcommit is done.
On the read path, we'll probably have to place the virtual_update_layout
call in nfs_readpages(). Is this fine by everyone?
The only concern with this implementation is that we'll add #ifdef
CONFIG_NFS_V4 around all these layoutget calls.
Regards
Rahul
> -----Original Message-----
> From: Iyer, Rahul
> Sent: Friday, September 29, 2006 2:24 PM
> To: Dean Hildebrand
> Cc: pnfs at linux-nfs.org
> Subject: RE: [pnfs] The Boundary issue
>
> Hi,
> Comments inline.
> Regards
> Rahul
>
> > -----Original Message-----
> > From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> > Sent: Friday, September 29, 2006 1:20 PM
> > To: Iyer, Rahul
> > Cc: Benny Halevy; pnfs at linux-nfs.org
> > Subject: Re: [pnfs] The Boundary issue
> >
> > Iyer, Rahul wrote:
> > > Hey Guys,
> > > In my patch, the virtual_update_layout() is done in
> > > pnfs_file_read/write(). At this point, we are aware of the
> > byte range,
> > > so there's no need to guess.
> > One minor point is that pnfs_file_read/write() is the
> read/write path,
> > not the flush path, and therefore the byte range is not the
> byte range
> > of the actual flushed read/write.
>
> Yes, I understand this. My point was that all the data that
> is written will eventually need to be flushed. So, we could
> get the layouts earlier.
>
> >
> > > However, if the genral consensus is to have the layoutget
> > on the flush
> > > path, then I recommend putting it in the boundary
> > calculation function
> > > (pnfs_getboundary), which is called from nfs_flush_list.
> > Personally, I
> > > feel this is simpler, and in no way less effective, than
> having the
> > > requests be formed without any boundary information and
> > then splitting
> > > them at a later point in time.
> > >
> > I would go higher in the call stack, as the boundary is
> just the first
> > attribute we come across, in the future there may be other
> attributes
> > that are needed even sooner.
> > nfs_sync_inode_wait
> > <ident?v=pnfs-2.6.17;i=nfs_sync_inode_wait> seems like a good
> > candidate since it calls layoutcommit (the begin and commit of the
> > layout) Dean
>
> Yes, we can put it there, but again, we're facing the problem
> where we might "pollute" NFS code with pNFS code. We could
> place it within
> CONFIG_NFSV4 constructs, I guess... That sound ok?
>
> Regards
> Rahul
>
>
>
>
> >
> > > Regards
> > > Rahul
> > >
> > >
> > >
> > >> -----Original Message-----
> > >> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> > >> Sent: Friday, September 29, 2006 7:42 AM
> > >> To: Benny Halevy
> > >> Cc: Iyer, Rahul; pnfs at linux-nfs.org
> > >> Subject: Re: [pnfs] The Boundary issue
> > >>
> > >> One main issue with your comment is that layoutget is
> > synchronous, so
> > >> it either delays up front at write or later at flush, either way
> > >> there is a delay. The writeback cache can't gather
> writes (reads)
> > >> without the layout, so it must wait for a result either way.
> > >>
> > >> For small I/O, deferring a layoutget until after the
> > getattr in the
> > >> open compound makes sense, because then we will know the
> threshold
> > >> attribute and whether or not we require a layout. For
> small I/O,
> > >> writes/reads are performed synchronously (stable writes),
> > so sending
> > >> the layoutget at write or flush (which turn out to be the
> > same thing)
> > >> doesn't matter.
> > >>
> > >> For large I/O, the Linux VFS will flush data quite
> aggressively.
> > >> Therefore a layoutget on flush will not occur at the very end of
> > >> application writes/reads. Guessing the correct byte range
> > at write
> > >> time seems unnecessary if the pNFS client can wait until
> > flush time
> > >> and use accurate information (it can still do some
> > guessing at flush
> > >> time as well).
> > >>
> > >> To me, layoutget should be at flush time, when the actual
> > byte range
> > >> is known. As for sending a layoutget on open, if the server can
> > >> determine the layoutget is in the same compound as the open and
> > >> wishes to reject the layout request, that seems like the
> > right plan.
> > >>
> > >> As a bit of an aside, I remember Greg Ganger sent out an
> > email a long
> > >> time ago talking about page gathering. I'm placing his comments
> > >> below.
> > >> I think he had the right idea. The Linux pNFS client
> > doesn't quite
> > >> work the way he envisions, but using his comment as a
> > guide, the best
> > >> way might be to not even bother with a boundary in the
> > general pNFS
> > >> code.
> > >> We should just gather the biggest I/O request we can then
> > hand it off
> > >> to the layout driver and let it divide it up as it sees
> > fit. This is
> > >> how the PVFS2 layout driver works.
> > >>
> > >> "This page gathering issue is one that I think can be
> > simplified by
> > >> turning it around. Rather than trying to come up with a generic
> > >> guidance interface (block size = X, stripe size = Y, etc.), the
> > >> interface can have a simple callback for layout drivers
> to request
> > >> more pages on certain boundaries.
> > >> This should end up being simpler and more robust on both
> > sides of the
> > >> interface (the rare win-win), if whatever the first guess
> > on guidance
> > >> interfaces is not the right choice for all layout
> > drivers... and it's
> > >> almost guaranteed not to be, if history is any guide...
> > >> Greg"
> > >>
> > >> Dean
> > >>
> > >> Benny Halevy wrote:
> > >>
> > >>> Marc, the problem with waiting as long as possible with
> > layoutget,
> > >>> until flush is that it adds latency to flush and this hurts
> > >>> applications that either block on fsync or close, or
> just writing
> > >>> enough data so they fill the cache and then wait behind the
> > >>>
> > >> syncer to
> > >>
> > >>> free up memory so they can write more.
> > >>> In this case getting a layout on write vs. on sync when
> > either you
> > >>> have enough dirty data for the file or you're low on memory
> > >>>
> > >> can help
> > >>
> > >>> establish a pipeline where you'll already have a layout in
> > >>>
> > >> hand (or at
> > >>
> > >>> least a layoutget on the wire) when you need to flush.
> > >>>
> > >>> More intelligence can be put into the byte range you ask
> > for. The
> > >>> client can detect sequential access (for either read or
> > write) and
> > >>> adjust the requested byte ranges accordingly. The layout
> > >>>
> > >> driver or the
> > >>
> > >>> file system can also adjust the layout according to the
> > >>>
> > >> file striping.
> > >>
> > >>> Benny
> > >>>
> > >>> Marc Eshel wrote:
> > >>>
> > >>>> Hi Rahul,
> > >>>> The point of not getting layout with file open is to wait
> > >>>>
> > >> and see how
> > >>
> > >>>> much we have to write and only above some limit we get the
> > >>>>
> > >> layout. I
> > >>
> > >>>> prefer the option of getting the layout at open time but
> > the other
> > >>>> option of getting it at IO time should be done as late as
> > >>>>
> > >> possible,
> > >>
> > >>>> so waiting for flush (like it is done now) is a better
> > >>>>
> > >> approach. We
> > >>
> > >>>> have a better idea of how much we need to write. Did you
> > >>>>
> > >> look at how
> > >>
> > >>>> complicated it will be to break the write to multiple
> DSs if the
> > >>>> write size is above stipe size?
> > >>>> Marc.
> > >>>>
> > >>>> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
> > >>>>
> > >>>>
> > >>>>
> > >>>>> Hi Guys,
> > >>>>> I was looking at the boundary issue. It seem rather
> > >>>>>
> > >> simple to fix in
> > >>
> > >>>>> the read/write code paths. There seem to be 2 solutions:
> > >>>>>
> > >>>>> 1. In both the read and write code paths,
> > >>>>> NFSPROTO(inode)->boundary(). This pointer points to
> > >>>>> pnfs_getboundary(). This function checks whether there is
> > >>>>>
> > >> a layout.
> > >>
> > >>>>> If there is one, it returns the stripesize present in the
> > >>>>>
> > >> file (via
> > >>
> > >>>>> a call to the layout driver's getstripesize function). In the
> > >>>>> absence of a layout pnfs_getboundary returns 0. In both
> > >>>>>
> > >> the read and
> > >>
> > >>>>> write cases, pnfs_getboundary() is called prior to making the
> > >>>>> request structs. So, we could call layoutget here and
> > >>>>>
> > >> we'd be done.
> > >>
> > >>>>> 2.
> > >>>>> The other approach is similar to what Benny
> mentioned. Get the
> > >>>>> layout on the write path (pnfs_file_write) rather than
> > the flush
> > >>>>> path. Similarly for the read. In fact, this call is
> > alredy being
> > >>>>> done to handle the non page cache I/O. All we need to do
> > >>>>>
> > >> is move it
> > >>
> > >>>>> further up in the function before the calls for the page
> > >>>>>
> > >> cache based
> > >>
> > >>>>> I/O.
> > >>>>>
> > >>>>> Regards
> > >>>>> Rahul
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> pNFS mailing list
> > >>>>> pNFS at linux-nfs.org
> > >>>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>>>
> > >>>>>
> > >>>> _______________________________________________
> > >>>> pNFS mailing list
> > >>>> pNFS at linux-nfs.org
> > >>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>>
> > >>>>
> > >>> _______________________________________________
> > >>> pNFS mailing list
> > >>> pNFS at linux-nfs.org
> > >>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>
> > >> --
> > >> Dean Hildebrand
> > >> Ph.D. Candidate
> > >> University of Michigan
> > >>
> > >> _______________________________________________
> > >> pNFS mailing list
> > >> pNFS at linux-nfs.org
> > >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>
> > >>
> >
> > --
> > Dean Hildebrand
> > Ph.D. Candidate
> > University of Michigan
> >
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>
More information about the pNFS
mailing list