[pnfs] The Boundary issue

Iyer, Rahul Rahul.Iyer at netapp.com
Fri Sep 29 17:23:40 EDT 2006


Hi,
Comments inline.
Regards
Rahul 

> -----Original Message-----
> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu] 
> Sent: Friday, September 29, 2006 1:20 PM
> To: Iyer, Rahul
> Cc: Benny Halevy; pnfs at linux-nfs.org
> Subject: Re: [pnfs] The Boundary issue
> 
> Iyer, Rahul wrote:
> > Hey Guys,
> > In my patch, the virtual_update_layout() is done in 
> > pnfs_file_read/write(). At this point, we are aware of the 
> byte range, 
> > so there's no need to guess.
> One minor point is that pnfs_file_read/write() is the 
> read/write path, not the flush path, and therefore the byte 
> range is not the byte range of the actual flushed read/write.

Yes, I understand this. My point was that all the data that is written
will eventually need to be flushed. So, we could get the layouts
earlier. 

> 
> > However, if the genral consensus is to have the layoutget 
> on the flush 
> > path, then I recommend putting it in the boundary 
> calculation function 
> > (pnfs_getboundary), which is called from nfs_flush_list. 
> Personally, I 
> > feel this is simpler, and in no way less effective, than having the 
> > requests be formed without any boundary information and 
> then splitting 
> > them at a later point in time.
> >   
> I would go higher in the call stack, as the boundary is just 
> the first attribute we come across, in the future there may 
> be other attributes that are needed even sooner.  
> nfs_sync_inode_wait 
> <ident?v=pnfs-2.6.17;i=nfs_sync_inode_wait> seems like a good 
> candidate since it calls layoutcommit (the begin and commit 
> of the layout) Dean

Yes, we can put it there, but again, we're facing the problem where we
might "pollute" NFS code with pNFS code. We could place it within
CONFIG_NFSV4 constructs, I guess... That sound ok?

Regards
Rahul




> 
> > Regards
> > Rahul
> >
> >
> >   
> >> -----Original Message-----
> >> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu]
> >> Sent: Friday, September 29, 2006 7:42 AM
> >> To: Benny Halevy
> >> Cc: Iyer, Rahul; pnfs at linux-nfs.org
> >> Subject: Re: [pnfs] The Boundary issue
> >>
> >> One main issue with your comment is that layoutget is 
> synchronous, so 
> >> it either delays up front at write or later at flush, either way 
> >> there is a delay.  The writeback cache can't gather writes (reads) 
> >> without the layout, so it must wait for a result either way.
> >>
> >> For small I/O, deferring a layoutget until after the 
> getattr in the 
> >> open compound makes sense, because then we will know the threshold 
> >> attribute and whether or not we require a layout.  For small I/O, 
> >> writes/reads are performed synchronously (stable writes), 
> so sending 
> >> the layoutget at write or flush (which turn out to be the 
> same thing) 
> >> doesn't matter.
> >>
> >> For large I/O, the Linux VFS will flush data quite aggressively.  
> >> Therefore a layoutget on flush will not occur at the very end of 
> >> application writes/reads.  Guessing the correct byte range 
> at write 
> >> time seems unnecessary if the pNFS client can wait until 
> flush time 
> >> and use accurate information (it can still do some 
> guessing at flush 
> >> time as well).
> >>
> >> To me, layoutget should be at flush time, when the actual 
> byte range 
> >> is known.  As for sending a layoutget on open, if the server can 
> >> determine the layoutget is in the same compound as the open and 
> >> wishes to reject the layout request, that seems like the 
> right plan.
> >>
> >> As a bit of an aside, I remember Greg Ganger sent out an 
> email a long 
> >> time ago talking about page gathering.  I'm placing his comments 
> >> below.
> >> I think he had the right idea.  The Linux pNFS client 
> doesn't quite 
> >> work the way he envisions, but using his comment as a 
> guide, the best 
> >> way might be to not even bother with a boundary in the 
> general pNFS 
> >> code.
> >> We should just gather the biggest I/O request we can then 
> hand it off 
> >> to the layout driver and let it divide it up as it sees 
> fit.  This is 
> >> how the PVFS2 layout driver works.
> >>
> >> "This page gathering issue is one that I think can be 
> simplified by 
> >> turning it around.  Rather than trying to come up with a generic 
> >> guidance interface (block size = X, stripe size = Y, etc.), the 
> >> interface can have a simple callback for layout drivers to request 
> >> more pages on certain boundaries.
> >> This should end up being simpler and more robust on both 
> sides of the 
> >> interface (the rare win-win), if whatever the first guess 
> on guidance 
> >> interfaces is not the right choice for all layout 
> drivers... and it's 
> >> almost guaranteed not to be, if history is any guide...
> >> Greg"
> >>
> >> Dean
> >>
> >> Benny Halevy wrote:
> >>     
> >>> Marc, the problem with waiting as long as possible with 
> layoutget, 
> >>> until flush is that it adds latency to flush and this hurts 
> >>> applications that either block on fsync or close, or just writing 
> >>> enough data so they fill the cache and then wait behind the
> >>>       
> >> syncer to
> >>     
> >>> free up memory so they can write more.
> >>> In this case getting a layout on write vs. on sync when 
> either you 
> >>> have enough dirty data for the file or you're low on memory
> >>>       
> >> can help
> >>     
> >>> establish a pipeline where you'll already have a layout in
> >>>       
> >> hand (or at
> >>     
> >>> least a layoutget on the wire) when you need to flush.
> >>>
> >>> More intelligence can be put into the byte range you ask 
> for.  The 
> >>> client can detect sequential access (for either read or 
> write) and 
> >>> adjust the requested byte ranges accordingly. The layout
> >>>       
> >> driver or the
> >>     
> >>> file system can also adjust the layout according to the
> >>>       
> >> file striping.
> >>     
> >>> Benny
> >>>
> >>> Marc Eshel wrote:
> >>>       
> >>>> Hi Rahul,
> >>>> The point of not getting layout with file open is to wait
> >>>>         
> >> and see how
> >>     
> >>>> much we have to write and only above some limit we get the
> >>>>         
> >> layout. I
> >>     
> >>>> prefer the option of getting the layout at open time but 
> the other 
> >>>> option of getting it at IO time should be done as late as
> >>>>         
> >> possible,
> >>     
> >>>> so waiting for flush (like it is done now) is a better
> >>>>         
> >> approach. We
> >>     
> >>>> have a better idea of how much we need to write. Did you
> >>>>         
> >> look at how
> >>     
> >>>> complicated it will be to break the write to multiple DSs if the 
> >>>> write size is above stipe size?
> >>>> Marc.
> >>>>
> >>>> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
> >>>>
> >>>>  
> >>>>         
> >>>>> Hi Guys,
> >>>>> I was looking at the boundary issue. It seem rather
> >>>>>           
> >> simple to fix in
> >>     
> >>>>> the read/write code paths. There seem to be 2 solutions:
> >>>>>
> >>>>> 1. In both the read and write code paths, 
> >>>>> NFSPROTO(inode)->boundary(). This pointer points to 
> >>>>> pnfs_getboundary(). This function checks whether there is
> >>>>>           
> >> a layout. 
> >>     
> >>>>> If there is one, it returns the stripesize present in the
> >>>>>           
> >> file (via
> >>     
> >>>>> a call to the layout driver's getstripesize function). In the 
> >>>>> absence of a layout pnfs_getboundary returns 0. In both
> >>>>>           
> >> the read and
> >>     
> >>>>> write cases, pnfs_getboundary() is called prior to making the 
> >>>>> request structs. So, we could call layoutget here and
> >>>>>           
> >> we'd be done.
> >>     
> >>>>> 2.
> >>>>> The other approach is similar to what Benny mentioned. Get the 
> >>>>> layout on the write path (pnfs_file_write) rather than 
> the flush 
> >>>>> path. Similarly for the read. In fact, this call is 
> alredy being 
> >>>>> done to handle the non page cache I/O. All we need to do
> >>>>>           
> >> is move it
> >>     
> >>>>> further up in the function before the calls for the page
> >>>>>           
> >> cache based
> >>     
> >>>>> I/O.
> >>>>>
> >>>>> Regards
> >>>>> Rahul
> >>>>>
> >>>>> _______________________________________________
> >>>>> pNFS mailing list
> >>>>> pNFS at linux-nfs.org
> >>>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>>>     
> >>>>>           
> >>>> _______________________________________________
> >>>> pNFS mailing list
> >>>> pNFS at linux-nfs.org
> >>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>>   
> >>>>         
> >>> _______________________________________________
> >>> pNFS mailing list
> >>> pNFS at linux-nfs.org
> >>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>       
> >> --
> >> Dean Hildebrand
> >> Ph.D. Candidate
> >> University of Michigan
> >>
> >> _______________________________________________
> >> pNFS mailing list
> >> pNFS at linux-nfs.org
> >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>
> >>     
> 
> --
> Dean Hildebrand
> Ph.D. Candidate
> University of Michigan
> 


More information about the pNFS mailing list