[pnfs] The Boundary issue

Iyer, Rahul Rahul.Iyer at netapp.com
Thu Oct 12 18:28:05 EDT 2006


Hi all,
Just to recap the past few email exchanges on this issue and  also
yesterday's call...
Dean suggests that the call to virtual_update_layout() be made in
nfs_sync_inode_wait(). This is where the flushing is started and then
the layoutcommit is done. 

On the read path, we'll probably have to place the virtual_update_layout
call in nfs_readpages(). Is this fine by everyone?

The only concern with this implementation is that we'll add #ifdef
CONFIG_NFS_V4 around all these layoutget calls.
Regards
Rahul


> -----Original Message-----
> From: Iyer, Rahul 
> Sent: Friday, September 29, 2006 2:24 PM
> To: Dean Hildebrand
> Cc: pnfs at linux-nfs.org
> Subject: RE: [pnfs] The Boundary issue
> 
> Hi,
> Comments inline.
> Regards
> Rahul 
> 
> > -----Original Message-----
> > From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> > Sent: Friday, September 29, 2006 1:20 PM
> > To: Iyer, Rahul
> > Cc: Benny Halevy; pnfs at linux-nfs.org
> > Subject: Re: [pnfs] The Boundary issue
> > 
> > Iyer, Rahul wrote:
> > > Hey Guys,
> > > In my patch, the virtual_update_layout() is done in 
> > > pnfs_file_read/write(). At this point, we are aware of the
> > byte range,
> > > so there's no need to guess.
> > One minor point is that pnfs_file_read/write() is the 
> read/write path, 
> > not the flush path, and therefore the byte range is not the 
> byte range 
> > of the actual flushed read/write.
> 
> Yes, I understand this. My point was that all the data that 
> is written will eventually need to be flushed. So, we could 
> get the layouts earlier. 
> 
> > 
> > > However, if the genral consensus is to have the layoutget
> > on the flush
> > > path, then I recommend putting it in the boundary
> > calculation function
> > > (pnfs_getboundary), which is called from nfs_flush_list. 
> > Personally, I
> > > feel this is simpler, and in no way less effective, than 
> having the 
> > > requests be formed without any boundary information and
> > then splitting
> > > them at a later point in time.
> > >   
> > I would go higher in the call stack, as the boundary is 
> just the first 
> > attribute we come across, in the future there may be other 
> attributes 
> > that are needed even sooner.
> > nfs_sync_inode_wait
> > <ident?v=pnfs-2.6.17;i=nfs_sync_inode_wait> seems like a good 
> > candidate since it calls layoutcommit (the begin and commit of the 
> > layout) Dean
> 
> Yes, we can put it there, but again, we're facing the problem 
> where we might "pollute" NFS code with pNFS code. We could 
> place it within
> CONFIG_NFSV4 constructs, I guess... That sound ok?
> 
> Regards
> Rahul
> 
> 
> 
> 
> > 
> > > Regards
> > > Rahul
> > >
> > >
> > >   
> > >> -----Original Message-----
> > >> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> > >> Sent: Friday, September 29, 2006 7:42 AM
> > >> To: Benny Halevy
> > >> Cc: Iyer, Rahul; pnfs at linux-nfs.org
> > >> Subject: Re: [pnfs] The Boundary issue
> > >>
> > >> One main issue with your comment is that layoutget is
> > synchronous, so
> > >> it either delays up front at write or later at flush, either way 
> > >> there is a delay.  The writeback cache can't gather 
> writes (reads) 
> > >> without the layout, so it must wait for a result either way.
> > >>
> > >> For small I/O, deferring a layoutget until after the
> > getattr in the
> > >> open compound makes sense, because then we will know the 
> threshold 
> > >> attribute and whether or not we require a layout.  For 
> small I/O, 
> > >> writes/reads are performed synchronously (stable writes),
> > so sending
> > >> the layoutget at write or flush (which turn out to be the
> > same thing)
> > >> doesn't matter.
> > >>
> > >> For large I/O, the Linux VFS will flush data quite 
> aggressively.  
> > >> Therefore a layoutget on flush will not occur at the very end of 
> > >> application writes/reads.  Guessing the correct byte range
> > at write
> > >> time seems unnecessary if the pNFS client can wait until
> > flush time
> > >> and use accurate information (it can still do some
> > guessing at flush
> > >> time as well).
> > >>
> > >> To me, layoutget should be at flush time, when the actual
> > byte range
> > >> is known.  As for sending a layoutget on open, if the server can 
> > >> determine the layoutget is in the same compound as the open and 
> > >> wishes to reject the layout request, that seems like the
> > right plan.
> > >>
> > >> As a bit of an aside, I remember Greg Ganger sent out an
> > email a long
> > >> time ago talking about page gathering.  I'm placing his comments 
> > >> below.
> > >> I think he had the right idea.  The Linux pNFS client
> > doesn't quite
> > >> work the way he envisions, but using his comment as a
> > guide, the best
> > >> way might be to not even bother with a boundary in the
> > general pNFS
> > >> code.
> > >> We should just gather the biggest I/O request we can then
> > hand it off
> > >> to the layout driver and let it divide it up as it sees
> > fit.  This is
> > >> how the PVFS2 layout driver works.
> > >>
> > >> "This page gathering issue is one that I think can be
> > simplified by
> > >> turning it around.  Rather than trying to come up with a generic 
> > >> guidance interface (block size = X, stripe size = Y, etc.), the 
> > >> interface can have a simple callback for layout drivers 
> to request 
> > >> more pages on certain boundaries.
> > >> This should end up being simpler and more robust on both
> > sides of the
> > >> interface (the rare win-win), if whatever the first guess
> > on guidance
> > >> interfaces is not the right choice for all layout
> > drivers... and it's
> > >> almost guaranteed not to be, if history is any guide...
> > >> Greg"
> > >>
> > >> Dean
> > >>
> > >> Benny Halevy wrote:
> > >>     
> > >>> Marc, the problem with waiting as long as possible with
> > layoutget,
> > >>> until flush is that it adds latency to flush and this hurts 
> > >>> applications that either block on fsync or close, or 
> just writing 
> > >>> enough data so they fill the cache and then wait behind the
> > >>>       
> > >> syncer to
> > >>     
> > >>> free up memory so they can write more.
> > >>> In this case getting a layout on write vs. on sync when
> > either you
> > >>> have enough dirty data for the file or you're low on memory
> > >>>       
> > >> can help
> > >>     
> > >>> establish a pipeline where you'll already have a layout in
> > >>>       
> > >> hand (or at
> > >>     
> > >>> least a layoutget on the wire) when you need to flush.
> > >>>
> > >>> More intelligence can be put into the byte range you ask
> > for.  The
> > >>> client can detect sequential access (for either read or
> > write) and
> > >>> adjust the requested byte ranges accordingly. The layout
> > >>>       
> > >> driver or the
> > >>     
> > >>> file system can also adjust the layout according to the
> > >>>       
> > >> file striping.
> > >>     
> > >>> Benny
> > >>>
> > >>> Marc Eshel wrote:
> > >>>       
> > >>>> Hi Rahul,
> > >>>> The point of not getting layout with file open is to wait
> > >>>>         
> > >> and see how
> > >>     
> > >>>> much we have to write and only above some limit we get the
> > >>>>         
> > >> layout. I
> > >>     
> > >>>> prefer the option of getting the layout at open time but
> > the other
> > >>>> option of getting it at IO time should be done as late as
> > >>>>         
> > >> possible,
> > >>     
> > >>>> so waiting for flush (like it is done now) is a better
> > >>>>         
> > >> approach. We
> > >>     
> > >>>> have a better idea of how much we need to write. Did you
> > >>>>         
> > >> look at how
> > >>     
> > >>>> complicated it will be to break the write to multiple 
> DSs if the 
> > >>>> write size is above stipe size?
> > >>>> Marc.
> > >>>>
> > >>>> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
> > >>>>
> > >>>>  
> > >>>>         
> > >>>>> Hi Guys,
> > >>>>> I was looking at the boundary issue. It seem rather
> > >>>>>           
> > >> simple to fix in
> > >>     
> > >>>>> the read/write code paths. There seem to be 2 solutions:
> > >>>>>
> > >>>>> 1. In both the read and write code paths, 
> > >>>>> NFSPROTO(inode)->boundary(). This pointer points to 
> > >>>>> pnfs_getboundary(). This function checks whether there is
> > >>>>>           
> > >> a layout. 
> > >>     
> > >>>>> If there is one, it returns the stripesize present in the
> > >>>>>           
> > >> file (via
> > >>     
> > >>>>> a call to the layout driver's getstripesize function). In the 
> > >>>>> absence of a layout pnfs_getboundary returns 0. In both
> > >>>>>           
> > >> the read and
> > >>     
> > >>>>> write cases, pnfs_getboundary() is called prior to making the 
> > >>>>> request structs. So, we could call layoutget here and
> > >>>>>           
> > >> we'd be done.
> > >>     
> > >>>>> 2.
> > >>>>> The other approach is similar to what Benny 
> mentioned. Get the 
> > >>>>> layout on the write path (pnfs_file_write) rather than
> > the flush
> > >>>>> path. Similarly for the read. In fact, this call is
> > alredy being
> > >>>>> done to handle the non page cache I/O. All we need to do
> > >>>>>           
> > >> is move it
> > >>     
> > >>>>> further up in the function before the calls for the page
> > >>>>>           
> > >> cache based
> > >>     
> > >>>>> I/O.
> > >>>>>
> > >>>>> Regards
> > >>>>> Rahul
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> pNFS mailing list
> > >>>>> pNFS at linux-nfs.org
> > >>>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>>>     
> > >>>>>           
> > >>>> _______________________________________________
> > >>>> pNFS mailing list
> > >>>> pNFS at linux-nfs.org
> > >>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>>   
> > >>>>         
> > >>> _______________________________________________
> > >>> pNFS mailing list
> > >>> pNFS at linux-nfs.org
> > >>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>>       
> > >> --
> > >> Dean Hildebrand
> > >> Ph.D. Candidate
> > >> University of Michigan
> > >>
> > >> _______________________________________________
> > >> pNFS mailing list
> > >> pNFS at linux-nfs.org
> > >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> > >>
> > >>     
> > 
> > --
> > Dean Hildebrand
> > Ph.D. Candidate
> > University of Michigan
> > 
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 


More information about the pNFS mailing list