[pnfs] The Boundary issue
Dean Hildebrand
dhildebz at eecs.umich.edu
Fri Sep 29 10:41:37 EDT 2006
One main issue with your comment is that layoutget is synchronous, so it
either delays up front at write or later at flush, either way there is a
delay. The writeback cache can't gather writes (reads) without the
layout, so it must wait for a result either way.
For small I/O, deferring a layoutget until after the getattr in the open
compound makes sense, because then we will know the threshold attribute
and whether or not we require a layout. For small I/O, writes/reads are
performed synchronously (stable writes), so sending the layoutget at
write or flush (which turn out to be the same thing) doesn't matter.
For large I/O, the Linux VFS will flush data quite aggressively.
Therefore a layoutget on flush will not occur at the very end of
application writes/reads. Guessing the correct byte range at write time
seems unnecessary if the pNFS client can wait until flush time and use
accurate information (it can still do some guessing at flush time as well).
To me, layoutget should be at flush time, when the actual byte range is
known. As for sending a layoutget on open, if the server can determine
the layoutget is in the same compound as the open and wishes to reject
the layout request, that seems like the right plan.
As a bit of an aside, I remember Greg Ganger sent out an email a long
time ago talking about page gathering. I'm placing his comments below.
I think he had the right idea. The Linux pNFS client doesn't quite work
the way he envisions, but using his comment as a guide, the best way
might be to not even bother with a boundary in the general pNFS code.
We should just gather the biggest I/O request we can then hand it off to
the layout driver and let it divide it up as it sees fit. This is how
the PVFS2 layout driver works.
"This page gathering issue is one that I think can be simplified by turning
it around. Rather than trying to come up with a generic guidance interface
(block size = X, stripe size = Y, etc.), the interface can have a simple
callback for layout drivers to request more pages on certain boundaries.
This should end up being simpler and more robust on both sides of the
interface (the rare win-win), if whatever the first guess on guidance
interfaces is not the right choice for all layout drivers... and it's
almost guaranteed not to be, if history is any guide...
Greg"
Dean
Benny Halevy wrote:
> Marc, the problem with waiting as long as possible with layoutget, until
> flush is that it adds latency to flush and this hurts applications
> that either
> block on fsync or close, or just writing enough data so they fill the
> cache
> and then wait behind the syncer to free up memory so they can write more.
> In this case getting a layout on write vs. on sync when either you have
> enough dirty data for the file or you're low on memory can help establish
> a pipeline where you'll already have a layout in hand (or at least a
> layoutget
> on the wire) when you need to flush.
>
> More intelligence can be put into the byte range you ask for. The client
> can detect sequential access (for either read or write) and adjust the
> requested
> byte ranges accordingly. The layout driver or the file system can also
> adjust
> the layout according to the file striping.
>
> Benny
>
> Marc Eshel wrote:
>> Hi Rahul,
>> The point of not getting layout with file open is to wait and see how
>> much we have to write and only above some limit we get the layout. I
>> prefer the option of getting the layout at open time but the other
>> option of getting it at IO time should be done as late as possible,
>> so waiting for flush (like it is done now) is a better approach. We
>> have a better idea of how much we need to write. Did you look at how
>> complicated it will be to break the write to multiple DSs if the
>> write size is above stipe size?
>> Marc.
>>
>> pnfs-bounces at linux-nfs.org wrote on 09/28/2006 03:43:56 PM:
>>
>>
>>> Hi Guys,
>>> I was looking at the boundary issue. It seem rather simple to fix in
>>> the
>>> read/write code paths. There seem to be 2 solutions:
>>>
>>> 1. In both the read and write code paths,
>>> NFSPROTO(inode)->boundary(). This
>>> pointer points to pnfs_getboundary(). This function checks whether
>>> there
>>> is a layout. If there is one, it returns the stripesize present in the
>>> file (via a call to the layout driver's getstripesize function). In the
>>> absence of a layout pnfs_getboundary returns 0. In both the read and
>>> write cases, pnfs_getboundary() is called prior to making the request
>>> structs. So, we could call layoutget here and we'd be done.
>>>
>>> 2.
>>> The other approach is similar to what Benny mentioned. Get the
>>> layout on
>>> the write path (pnfs_file_write) rather than the flush path. Similarly
>>> for the read. In fact, this call is alredy being done to handle the non
>>> page cache I/O. All we need to do is move it further up in the function
>>> before the calls for the page cache based I/O.
>>>
>>> Regards
>>> Rahul
>>>
>>> _______________________________________________
>>> pNFS mailing list
>>> pNFS at linux-nfs.org
>>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>>>
>>
>> _______________________________________________
>> pNFS mailing list
>> pNFS at linux-nfs.org
>> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>>
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
--
Dean Hildebrand
Ph.D. Candidate
University of Michigan
More information about the pNFS
mailing list