[pnfs] Need for coherent layout driver interface

William A. (Andy) Adamson andros at citi.umich.edu
Thu Mar 20 09:40:53 EDT 2008


On Wed, Mar 19, 2008 at 2:04 PM, Dean Hildebrand <seattleplus at gmail.com>
wrote:

>
>
> William A. (Andy) Adamson wrote:
> >
> >
> > On Fri, Mar 14, 2008 at 3:03 PM, Dean Hildebrand
> > <seattleplus at gmail.com <mailto:seattleplus at gmail.com>> wrote:
> >
> >     Excerpt from block patch:
> >     > --- a/include/linux/nfs4_pnfs.h
> >     > +++ b/include/linux/nfs4_pnfs.h
> >     > @@ -141,6 +141,10 @@ struct layoutdriver_io_operations {
> >     >       int (*write_begin) (struct pnfs_layout_segment *lseg,
> >     struct page *page,
> >     >                           loff_t pos, unsigned count,
> >     >                           struct pnfs_fsdata **fsdata);
> >     > +     /* Hook into nfs_create_request, for setting wb_private */
> >     > +     void (*new_request)(struct pnfs_layout_segment *lseg,
> >     > +                         struct nfs_page *req, loff_t pos,
> >     unsigned count,
> >     > +                         struct pnfs_fsdata *fsdata);
> >     > void (*free_request_data) (struct nfs_page *);
> >
> >     > void (*free_fsdata) (struct pnfs_fsdata *data);
> >
> >     I have a problem with these additions to the layout driver i/o
> >     interface.  This isn't a complaint necessarily at just these new
> >     ops for
> >     the block driver (although they make the problem much much worse),
> but
> >     our I/O interface in general.  I would like a freeze on the
> >     layoutdriver_io_operations struct (without the block patches) until
> we
> >     can figure it out.  If we don't do it now, it will become onerous to
> >     change the layout drivers later on.
> >
> >
> > It is a good idea to review the interface.
> >
> >
> >
> >
> >     Posing as a new layout driver implementor for a second, I have the
> >     following functions that I might want to implement to do writes:
> >     write_begin
> >     new_request
> >     flush_one
> >     free_request_data
> >     write_pagelist
> >     free_fsdata
> >     free_request_data
> >
> >     What is the motivation of each one?  Why would I implement one and
> not
> >     another?  Why are there names so generic, e.g., what type of new
> >     request?
> >
> >
> > Good questions. The file layout driver introduced a new private field
> > in struct nfs_page: wb_private. it stores the data server info during
> > it's write_pagelist() so that it's flush_one() does not have to look
> > it up again.
> I think you got the order that these functions are called mixed up.
>
>
Sorry,  I mixed up nfs_writepage and pnfs_writepagelist. You are correct -
flush_one ends up calling write_pagelist.



> flush_one stores the data server in wb_private and write_pagelist uses it.
>

as does filelayout_commit().


>
>
> The whole wb_private and flush_one code is file layout specific hacks to
> the layout driver and need to be removed.  I can see how it was useful
> just to get file layout up and running, but no one else requires it.
> Thankfully no one else has started using it, as it places too much
> layout driver specific data in common nfs code.


The  2.6.18 pnfs codebase added the wb_devid field to struct nfs_page. It
was used only by the file layout to avoid looking up the data server in both
filelayout_write_pagelist and in filelayout_commit(). wb_devid was set in
file layout_write_pagelist and used in filelayout_commit when commit to data
servers. The wb_private field performs the same function.

As far as filelayout_flush_one goes - I'm open to another way to get the job
done. It made the most sense to me to set the data server at coalesce time,
but perhaps we don't need to.

What filelayout_flush_one avoids is creating an nfs_write_data struct with a
struct nfs_page list containing pages going to multiple storage servers.

The pg_test interface is designed to coalesce struct nfs_page's up to the
dswrite size without crossing a stripe boundary. This should mean the
coalesced struct nfs_page list will all be going to the same storage server.
I think this code is still OK WRT draft-21 file layout striping.

If this is the case, or if we can make it so, filelayout_flush_one will only
ever see a list of struct nfs_pages going to a single Data Server and  using
nfs_flush_one plus getting the data sever for the stripe in
filelayout_write_pagelist instead of filelayout_flush_one will work.

I still vote to keep a mechanism such as wb_private so that
filelayout_commit to the data servers does not have to look up data servers
all over again, which would be a definite performance hit. I believe
wb_private is also used by the block layout driver.


>
> In addition, the file layout driver can't depend on the pages being in
> the cache when write_pagelist is called (O_DIRECT).  So requiring
> flush_one be called before write_pagelist is just plain broken.  With
> O_DIRECT, the logic of determining which page goes to which data server
> needs to be done, and not in filelayout_flush_one.  We need to revert
> file layout back to how every other layout driver works, determine which
> pages go to which data servers in write_pagelist.
>
> > the free_request_data() callout is called when the struct nfs_page is
> > un-allocated. The file layout driver doesn't need a new_request
> > callout because it sets wb_private in the layout driver.
>


>
> Again, this does not work.


sure it does.


> The file layout driver needs the ability to
> take in a series of pages, determine which data servers they are sent
> to, and write them to the data servers.  This is what write_pagelist is
> designed to do and will need to do again for file layout.


And in that case, wb_private is set in filelayout_write_pagelist and used in
filelayout_commit to data servers so that data server lookup is not
duplicated - just like wb_devid was used in the 2.6.18 kernel.


> >
> > So, the type of request is layout driver specific private data passed
> > in struct nfs_page.
> >
> >     Why is there no symmetry, e.g., write_begin/end exists but not
> >     read_begin/end?
> >
> >
> > there is no need for read begin/end because data is read in block
> > size.  the write_begin/end hooks for the block layout exist to fill
> > data (either read the data or place zero's) around a bit of a block
> > that is written, and to bail back to NFS if the block I/O fails.
> My point is that we are creating interfaces in a "as needed" manner,
> without any idea of the big picture.


Well, we are still getting the big picture - new draft, new kernel I/O path,
and the addition of the block layout.

Sure, at the moment write needs
> some extra work that read doesn't , but who says that is the way it will
> always be?  If you look at the VFS interface functions like
> read/write/fread/fwrite, they are created in a symmetric manner.  This
> is what I want for the layout driver.
>
> In summary though, the write_begin/end hooks are great general functions
> which I could see being very useful to other layout drivers.


how do you suggest the file layout driver use write_begin/end? does the
object driver use it?


> >
> > also,  gathering writes in memory to flush all at once asynchronously
> > means extra code, and means that the read and write path will not be
> > symmetric WRT the interface.
> The interface exists to hide implementation details like writeback
> caching and readahead.  Not all layout drivers will work the same way,
> and we can't design the interface for the implementation.  For example,
> pnfs_flush_one only exists as a performance optimization.

If Trond
> changes how the general pnfs i/o path works, it may become obsolete.
> write/read_pagelist are generic functions that, barring a few tweaks
> here and there, should be around for awhile.
> >
> >
> >
> >     ...now as me... We are beginning to create a completely incoherent
> >     interface.  We need to view this interface as an abstraction layer--
> >     have a goal of abstracting NFS logic out of the interface e.g., a
> >     single
> >     write operation instead of flush_one AND write_pagelist.
> >
> >
> > write_pagelist populates the nfs page cache. flush_one flushes the
> > page cache after (potentially) many write_pagelist calls. two calls
> > are needed for the two different jobs.
> I think you have your function names mixed up again.


yes. sorry.


>  But again, how
> come only file layout needs 2 calls and every other layout driver
> doesn't?  flush_one is a very useful hack, but a hack nonetheless, and
> needs to be removed.


> >
> >
> >     The functions
> >     in pnfs.c should be able to package up the i/o requests from the nfs
> >     layer and use a simplified interface to the layout driver.
> >
> >     We need a symmetric and coherent interface for both the read and
> write
> >     path.  (something simple like 3 functions for read and 3 functions
> for
> >     write: begin_read/write, read/write_pagelist, end_read/write)
> >
> >     What do people think?  Something to talk about at the next call?
>  What
> >     are the requirements of this I/O interface?  Based on your input,
> I'd
> >     like to start looking into what can be done.
> >
> >
> > I think it is a good idea.  Part of this discussion should be I/O
> > recovery through the MDS, which influences the interface.
> Definitely.  I encountered some issues with the fact that
> pnfs_try_to_read/write/commit functions return error codes.  As they are
> asynchronous functions, it seems like the rest of nfs is not designed to
> handle these error codes.
> >
> > -->Andy
> >
> >
> >     Dean
> >     _______________________________________________
> >     pNFS mailing list
> >     pNFS at linux-nfs.org <mailto:pNFS at linux-nfs.org>
> >     http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20080320/0364dcd1/attachment-0001.htm 


More information about the pNFS mailing list