[pnfs] The Boundary issue
Iyer, Rahul
Rahul.Iyer at netapp.com
Fri Sep 29 17:23:40 EDT 2006
Hi,
Comments inline.
Regards
Rahul
> -----Original Message-----
> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> Sent: Friday, September 29, 2006 1:20 PM
> To: Iyer, Rahul
> Cc: Benny Halevy; pnfs at linux-nfs.org
> Subject: Re: [pnfs] The Boundary issue
>
> Iyer, Rahul wrote:
> > Hey Guys,
> > In my patch, the virtual_update_layout() is done in
> > pnfs_file_read/write(). At this point, we are aware of the
> byte range,
> > so there's no need to guess.
> One minor point is that pnfs_file_read/write() is the
> read/write path, not the flush path, and therefore the byte
> range is not the byte range of the actual flushed read/write.
Yes, I understand this. My point was that all the data that is written
will eventually need to be flushed. So, we could get the layouts
earlier.
>
> > However, if the genral consensus is to have the layoutget
> on the flush
> > path, then I recommend putting it in the boundary
> calculation function
> > (pnfs_getboundary), which is called from nfs_flush_list.
> Personally, I
> > feel this is simpler, and in no way less effective, than having the
> > requests be formed without any boundary information and
> then splitting
> > them at a later point in time.
> >
> I would go higher in the call stack, as the boundary is just
> the first attribute we come across, in the future there may
> be other attributes that are needed even sooner.
> nfs_sync_inode_wait
> <ident?v=pnfs-2.6.17;i=nfs_sync_inode_wait> seems like a good
> candidate since it calls layoutcommit (the begin and commit
> of the layout) Dean
Yes, we can put it there, but again, we're facing the problem where we
might "pollute" NFS code with pNFS code. We could place it within
CONFIG_NFSV4 constructs, I guess... That sound ok?
Regards
Rahul
>
> > Regards
> > Rahul
> >
> >
> >
> >> -----Original Message-----
> >> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> >> Sent: Friday, September 29, 2006 7:42 AM
> >> To: Benny Halevy
> >> Cc: Iyer, Rahul; pnfs at linux-nfs.org
> >> Subject: Re: [pnfs] The Boundary issue
> >>
> >> One main issue with your comment is that layoutget is
> synchronous, so
> >> it either delays up front at write or later at flush, either way
> >> there is a delay. The writeback cache can't gather writes (reads)
> >> without the layout, so it must wait for a result either way.
> >>
> >> For small I/O, deferring a layoutget until after the
> getattr in the
> >> open compound makes sense, because then we will know the threshold
> >> attribute and whether or not we require a layout. For small I/O,
> >> writes/reads are performed synchronously (stable writes),
> so sending
> >> the layoutget at write or flush (which turn out to be the
> same thing)
> >> doesn't matter.
> >>
> >> For large I/O, the Linux VFS will flush data quite aggressively.
> >> Therefore a layoutget on flush will not occur at the very end of
> >> application writes/reads. Guessing the correct byte range
> at write
> >> time seems unnecessary if the pNFS client can wait until
> flush time
> >> and use accurate information (it can still do some
> guessing at flush
> >> time as well).
> >>
> >> To me, layoutget should be at flush time, when the actual
> byte range
> >> is known. As for sending a layoutget on open, if the server can
> >> determine the layoutget is in the same compound as the open and
> >> wishes to reject the layout request, that seems like the
> right plan.
> >>
> >> As a bit of an aside, I remember Greg Ganger sent out an
> email a long
> >> time ago talking about page gathering. I'm placing his comments
> >> below.
> >> I think he had the right idea. The Linux pNFS client
> doesn't quite
> >> work the way he envisions, but using his comment as a
> guide, the best
> >> way might be to not even bother with a boundary in the
> general pNFS
> >> code.
> >> We should just gather the biggest I/O request we can then
> hand it off
> >> to the layout driver and let it divide it up as it sees
> fit. This is
> >> how the PVFS2 layout driver works.
> >>
> >> "This page gathering issue is one that I think can be
> simplified by
> >> turning it around. Rather than trying to come up with a generic
> >> guidance interface (block size = X, stripe size = Y, etc.), the
> >> interface can have a simple callback for layout drivers to request
> >> more pages on certain boundaries.
> >> This should end up being simpler and more robust on both
> sides of the
> >> interface (the rare win-win), if whatever the first guess
> on guidance
> >> interfaces is not the right choice for all layout
> drivers... and it's
> >> almost guaranteed not to be, if history is any guide...
> >> Greg"
> >>
> >> Dean
> >>
> >> Benny Halevy wrote:
> >>
> >>> Marc, the problem with waiting as long as possible with
> layoutget,
> >>> until flush is that it adds latency to flush and this hurts
> >>> applications that either block on fsync or close, or just writing
> >>> enough data so they fill the cache and then wait behind the
> >>>
> >> syncer to
> >>
> >>> free up memory so they can write more.
> >>> In this case getting a layout on write vs. on sync when
> either you
> >>> have enough dirty data for the file or you're low on memory
> >>>
> >> can help
> >>
> >>> establish a pipeline where you'll already have a layout in
> >>>
> >> hand (or at
> >>
> >>> least a layoutget on the wire) when you need to flush.
> >>>
> >>> More intelligence can be put into the byte range you ask
> for. The
> >>> client can detect sequential access (for either read or
> write) and
> >>> adjust the requested byte ranges accordingly. The layout
> >>>
> >> driver or the
> >>
> >>> file system can also adjust the layout according to the
> >>>
> >> file striping.
> >>
> >>> Benny
> >>>
> >>> Marc Eshel wrote:
> >>>
> >>>> Hi Rahul,
> >>>> The point of not getting layout with file open is to wait
> >>>>
> >> and see how
> >>
> >>>> much we have to write and only above some limit we get the
> >>>>
> >> layout. I
> >>
> >>>> prefer the option of getting the layout at open time but
> the other
> >>>> option of getting it at IO time should be done as late as
> >>>>
> >> possible,
> >>
> >>>> so waiting for flush (like it is done now) is a better
> >>>>
> >> approach. We
> >>
> >>>> have a better idea of how much we need to write. Did you
> >>>>
> >> look at how
> >>
> >>>> complicated it will be to break the write to multiple DSs if the
> >>>> write size is above stipe size?
> >>>> Marc.
> >>>>
> >>>> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
> >>>>
> >>>>
> >>>>
> >>>>> Hi Guys,
> >>>>> I was looking at the boundary issue. It seem rather
> >>>>>
> >> simple to fix in
> >>
> >>>>> the read/write code paths. There seem to be 2 solutions:
> >>>>>
> >>>>> 1. In both the read and write code paths,
> >>>>> NFSPROTO(inode)->boundary(). This pointer points to
> >>>>> pnfs_getboundary(). This function checks whether there is
> >>>>>
> >> a layout.
> >>
> >>>>> If there is one, it returns the stripesize present in the
> >>>>>
> >> file (via
> >>
> >>>>> a call to the layout driver's getstripesize function). In the
> >>>>> absence of a layout pnfs_getboundary returns 0. In both
> >>>>>
> >> the read and
> >>
> >>>>> write cases, pnfs_getboundary() is called prior to making the
> >>>>> request structs. So, we could call layoutget here and
> >>>>>
> >> we'd be done.
> >>
> >>>>> 2.
> >>>>> The other approach is similar to what Benny mentioned. Get the
> >>>>> layout on the write path (pnfs_file_write) rather than
> the flush
> >>>>> path. Similarly for the read. In fact, this call is
> alredy being
> >>>>> done to handle the non page cache I/O. All we need to do
> >>>>>
> >> is move it
> >>
> >>>>> further up in the function before the calls for the page
> >>>>>
> >> cache based
> >>
> >>>>> I/O.
> >>>>>
> >>>>> Regards
> >>>>> Rahul
> >>>>>
> >>>>> _______________________________________________
> >>>>> pNFS mailing list
> >>>>> pNFS at linux-nfs.org
> >>>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> pNFS mailing list
> >>>> pNFS at linux-nfs.org
> >>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>>
> >>>>
> >>> _______________________________________________
> >>> pNFS mailing list
> >>> pNFS at linux-nfs.org
> >>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>
> >> --
> >> Dean Hildebrand
> >> Ph.D. Candidate
> >> University of Michigan
> >>
> >> _______________________________________________
> >> pNFS mailing list
> >> pNFS at linux-nfs.org
> >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>
> >>
>
> --
> Dean Hildebrand
> Ph.D. Candidate
> University of Michigan
>
More information about the pNFS
mailing list