[pnfs] The Boundary issue

Iyer, Rahul Rahul.Iyer at netapp.com
Fri Sep 29 15:16:59 EDT 2006


Hey Guys,
In my patch, the virtual_update_layout() is done in
pnfs_file_read/write(). At this point, we are aware of the byte range,
so there's no need to guess. Also, I believe, the tradeoff is where we
want to block (behind thee synchronous layoutget), the write path or the
flush path. Wouldn't blocking on the flush path be less desirable, since
during the flush (for example memory pressure), you'd want the memory to
be cleaned ASAP?

However, if the genral consensus is to have the layoutget on the flush
path, then I recommend putting it in the boundary calculation function
(pnfs_getboundary), which is called from nfs_flush_list. Personally, I
feel this is simpler, and in no way less effective, than having the
requests be formed without any boundary information and then splitting
them at a later point in time. 
Regards
Rahul


> -----Original Message-----
> From: Dean Hildebrand [mailto:dhildebz at eecs.umich.edu] 
> Sent: Friday, September 29, 2006 7:42 AM
> To: Benny Halevy
> Cc: Iyer, Rahul; pnfs at linux-nfs.org
> Subject: Re: [pnfs] The Boundary issue
> 
> One main issue with your comment is that layoutget is 
> synchronous, so it either delays up front at write or later 
> at flush, either way there is a delay.  The writeback cache 
> can't gather writes (reads) without the layout, so it must 
> wait for a result either way.
> 
> For small I/O, deferring a layoutget until after the getattr 
> in the open compound makes sense, because then we will know 
> the threshold attribute and whether or not we require a 
> layout.  For small I/O, writes/reads are performed 
> synchronously (stable writes), so sending the layoutget at 
> write or flush (which turn out to be the same thing) doesn't matter.
> 
> For large I/O, the Linux VFS will flush data quite aggressively.  
> Therefore a layoutget on flush will not occur at the very end 
> of application writes/reads.  Guessing the correct byte range 
> at write time seems unnecessary if the pNFS client can wait 
> until flush time and use accurate information (it can still 
> do some guessing at flush time as well).
> 
> To me, layoutget should be at flush time, when the actual 
> byte range is known.  As for sending a layoutget on open, if 
> the server can determine the layoutget is in the same 
> compound as the open and wishes to reject the layout request, 
> that seems like the right plan.
> 
> As a bit of an aside, I remember Greg Ganger sent out an 
> email a long time ago talking about page gathering.  I'm 
> placing his comments below.  
> I think he had the right idea.  The Linux pNFS client doesn't 
> quite work the way he envisions, but using his comment as a 
> guide, the best way might be to not even bother with a 
> boundary in the general pNFS code.  
> We should just gather the biggest I/O request we can then 
> hand it off to the layout driver and let it divide it up as 
> it sees fit.  This is how the PVFS2 layout driver works.
> 
> "This page gathering issue is one that I think can be 
> simplified by turning it around.  Rather than trying to come 
> up with a generic guidance interface (block size = X, stripe 
> size = Y, etc.), the interface can have a simple callback for 
> layout drivers to request more pages on certain boundaries.
> This should end up being simpler and more robust on both 
> sides of the interface (the rare win-win), if whatever the 
> first guess on guidance interfaces is not the right choice 
> for all layout drivers... and it's almost guaranteed not to 
> be, if history is any guide...
> Greg"
> 
> Dean
> 
> Benny Halevy wrote:
> > Marc, the problem with waiting as long as possible with layoutget, 
> > until flush is that it adds latency to flush and this hurts 
> > applications that either block on fsync or close, or just writing 
> > enough data so they fill the cache and then wait behind the 
> syncer to 
> > free up memory so they can write more.
> > In this case getting a layout on write vs. on sync when either you 
> > have enough dirty data for the file or you're low on memory 
> can help 
> > establish a pipeline where you'll already have a layout in 
> hand (or at 
> > least a layoutget on the wire) when you need to flush.
> >
> > More intelligence can be put into the byte range you ask for.  The 
> > client can detect sequential access (for either read or write) and 
> > adjust the requested byte ranges accordingly. The layout 
> driver or the 
> > file system can also adjust the layout according to the 
> file striping.
> >
> > Benny
> >
> > Marc Eshel wrote:
> >> Hi Rahul,
> >> The point of not getting layout with file open is to wait 
> and see how 
> >> much we have to write and only above some limit we get the 
> layout. I 
> >> prefer the option of getting the layout at open time but the other 
> >> option of getting it at IO time should be done as late as 
> possible, 
> >> so waiting for flush (like it is done now) is a better 
> approach. We 
> >> have a better idea of how much we need to write. Did you 
> look at how 
> >> complicated it will be to break the write to multiple DSs if the 
> >> write size is above stipe size?
> >> Marc.
> >>
> >> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
> >>
> >>  
> >>> Hi Guys,
> >>> I was looking at the boundary issue. It seem rather 
> simple to fix in 
> >>> the read/write code paths. There seem to be 2 solutions:
> >>>
> >>> 1. In both the read and write code paths, 
> >>> NFSPROTO(inode)->boundary(). This pointer points to 
> >>> pnfs_getboundary(). This function checks whether there is 
> a layout. 
> >>> If there is one, it returns the stripesize present in the 
> file (via 
> >>> a call to the layout driver's getstripesize function). In the 
> >>> absence of a layout pnfs_getboundary returns 0. In both 
> the read and 
> >>> write cases, pnfs_getboundary() is called prior to making the 
> >>> request structs. So, we could call layoutget here and 
> we'd be done.
> >>>
> >>> 2.
> >>> The other approach is similar to what Benny mentioned. Get the 
> >>> layout on the write path (pnfs_file_write) rather than the flush 
> >>> path. Similarly for the read. In fact, this call is alredy being 
> >>> done to handle the non page cache I/O. All we need to do 
> is move it 
> >>> further up in the function before the calls for the page 
> cache based 
> >>> I/O.
> >>>
> >>> Regards
> >>> Rahul
> >>>
> >>> _______________________________________________
> >>> pNFS mailing list
> >>> pNFS at linux-nfs.org
> >>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>>     
> >>
> >> _______________________________________________
> >> pNFS mailing list
> >> pNFS at linux-nfs.org
> >> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >>   
> >
> > _______________________________________________
> > pNFS mailing list
> > pNFS at linux-nfs.org
> > http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 
> --
> Dean Hildebrand
> Ph.D. Candidate
> University of Michigan
> 
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 


More information about the pNFS mailing list