[pnfs] nfsd layout cache design for layoutrecall

Benny Halevy bhalevy at panasas.com
Thu May 3 16:22:59 EDT 2007


Sorry, the full patch attached now.

Benny Halevy wrote:
> My nfsd layout cache changes are finally in a good enough state
> for review.  Connectathon tests passing with them, but no
> layout recalls are exercised yet (at least the get/return path is working).
> 
> I tried to capture the main ideas below.  I'm also attaching a full
> patch of what I have in my workspace so you can apply it in your
> branch to see the whole thing.
> 
> The change is pretty big and I will cut it into many pieces for
> submission of course.
> 
> * struct nfsd4_layout_seg added
> Embedded in the get, return, and recall structures, as well
> as used as a parameter to several internal functions.
> 
> struct nfsd4_layout_seg {
> 	u64                      lo_clid;
> 	u32                      lo_type;
> 	u32                      lo_iomode;
> 	u64                      lo_offset;
> 	u64                      lo_length;
> };
> 
> Since this simplified the internal design and implementation
> I think it's worthwhile to define and use this type in the NFSv4.1
> protocol. Although this is wire-format change, now is probably the
> last opportunity to make such a big change before last call.
> I'll raise this in the next pnfs spec review meeting...
> 
> * simplification of the filesystem layout recall callback interface.
> the nfs4_cb_layout structure broken into two parts.  One, holding
> the minimum set of "args" the filesystem fills for doing a layoutrecall.
> 
> struct nfsd4_pnfs_cb_layout {
> 	u32                     cbl_recall_type;	/* request */
> 	struct nfsd4_layout_seg cbl_seg;		/* request */
> 	u32                     cbl_layoutchanged;	/* request */
> };
> 
> The other (nfs4_layoutrecall), embeds the structure above and
> contains internal state for tracking the outstanding layoutrecall
> state internally.
> 
> * added a can_merge_layouts() method to the vfs api. called by the pnfsd
>   layoutget function to see if the layout segments can be merged in the
>   layout cache, otherwise each is kept separately, even if it overlaps
>   with another outstanding layout.
> 
> * added a cl_layoutrecalls list to struct nfs4_client for tracking
>   outstanding layout recalls
> 
> * changed cl_count in nfs4_client into a kref.
>   nfs4_{get,put}_client take and release a reference on the
>   client structure under the state lock.
> 
> * layoutrecall manipulation is also done under the state lock,
>   not under the recall lock.  Special care taken when sending
>   the recalls to avoid distributed deadlock - client sending
>   layoutreturns while serving a cb_layoutrecall, described
>   in more details below...
> 
> * struct nfs4_layout simplified:
> struct nfs4_layout {
> 	struct list_head        lo_perclnt;
> 	struct list_head        lo_perfile;
> 	struct nfs4_client     *lo_client;       /* backpointer */
> 	struct nfs4_file       *lo_file;         /* backpointer */
> 	struct nfsd4_layout_seg lo_seg;
> };
> 
> * struct nfs4_layoutrecall:
> enum {
> 	LAYOUTRECALL_FLAG_PENDING = 1<<0,
> 	LAYOUTRECALL_FLAG_DONE    = 1<<1,
> };
> 
> /* layoutrecall request (from exported filesystem) */
> struct nfs4_layoutrecall {
> 	struct nfsd4_pnfs_cb_layout     cb;         /* request */
> 	struct list_head                clr_perclnt;    /* on cl_layoutrecalls */
> 	struct nfs4_client             *clr_client;
> 	struct nfs4_file               *clr_file;
> 	int                             clr_status;
> 	unsigned                        clr_flags;
> 	struct timespec                 clr_time;       /* last activity */
> };
> 
> layoutrecall is only listed on the cl_layoutrecall list but holds
> a back pointer (and a reference) to a file structure in the RECALL_FILE
> and RECALL_FSID cases.
> 
> clr_flags field used for handshake between the thread sending the layoutrecall
> and the thread handling layoutreturns as the the layoutreturn can arrive
> during the layoutrecall rpc if the client is sending it before replying
> to the cb_layoutrecall or if the layoutreturn request just gets to the server
> before the cb_layoutrecall response.
> 
> * struct nfs4_file:
> added fsid and filehandle to the structure.
> these are accessed in the get, return, and recall paths
> through a pointer to the nfs4_file structure
> rather than copying the data into each nfs4_layout and nfs4_layoutrecall
> instance.
> 
> +#ifdef CONFIG_PNFS
> +	/* used by layoutget / layoutrecall */
> +	struct nfs4_fsid	fi_fsid;
> +	u32			fi_fhlen;
> +	u32			fi_fhval[NFS4_FHSIZE];
> +#endif /* CONFIG_PNFS */
> 
> * added #define NFS4_LENGTH_EOF (~(u64)0) and a couple helper functions:
>   end_offset(start, len) and last_byte_offset(start, len)
>   to simplify length "arithmetic" (for layout segment matching)
> 
> * nfs4_layout
>   tracks outstanding layouts
>   listed on a per-client and per-file lists
>   keeps reference on the respective client and file.
> 
>   * alloc_layout() added to allocate an uninitialized nfs4_layout structure.
>     The idea is to allocate it early in layoutget if filesystem does not support
>     merging layouts or if it is the first (and possibly the only) layout for the
>     file.  We want to avoid getting a layout from the filesystem in low memory
>     conditions rather than getting one, discover we can't allocate a layout
>     structure to track it, and having to return it to the filesystem.
> 
>   * alloc_init_layout() can initialize an existing layout structure (that
>     we allocated in advance).  It puts the initialized layout on the
>     client and file layout lists.
> 
>   * destroy_layout() removes the layout structure from the client and file
>     lists and frees it.
> 
> * nfs4_layoutrecall
>   tracks outstanding layoutrecalls
>   listed only per client
>   but keeps a reference both on the respective client and file
>   (if applicable, can be NULL in the RECALL_ALL case)
>   clr_flags field used for marking the layoutrecall done and deciding
>   who's freeing it.
> 
>   * alloc_init_layoutrecall() allocates a nfs4_layoutrecall structure.
>     Optionally, it can initialize it by cloning an existing nfs_layoutrecall
>     structure.  This is used to make layoutrecall instances from a "master copy"
>     when doing a "wildcard" recall to multiple clients.
> 
>   * hash_layoutrecall() enlists the layoutrecall on the client cl_layoutrecall
>     list and takes a reference on the respective client and file.
> 
>   * destroy_layoutrecall() unhashes the layoutrecall structure and frees it.
> 
>   
> * nfs4_pnfs_get_layout()
>   * is_layout_recalled() looks for a layoutrecall matching a layoutget
>     (same client and layout_type, for RECALL_ALL or for the same FSID, or overlapping
>     in the RECALL_FILE case)
> 
>   * extend_layout() can merge a layout segment into an existing one
>     by extending its range
> 
>   * merge_layout() looks for an existing layout segment on a file
>     matching a new layout segment (lo_type, clid, iomode) and 
>     either overlapping with or adjacent to it.
> 
>   * algorithm:
>     if exists a matching recall
> 	return recallconflict error
> 
>     if filesystem can't merge or no layouts on file
> 	allocate an empty layout structure
> 
>     call filesystem layout_get, process status
> 
>     if can merge, try merging the layout into an existing one
>     and exit if succeeded, otherwise,
> 
>     allocate a new layout if needed or initialize the one allocated in advance
>     and exit if succeeded, otherwise,
> 
>     internally return the layout to the filesystem, return nfserr_layouttrylater;
> 
> * nfs4_pnfs_return_layout()
>   * trim_layout() trims a layoutsegment if part of is returned.
>     note: does not support splitting a layout structure, remains will
>     get freed on final layoutreturn.  if splitting layoutreturn is voluntary,
>     the server may end up thinking the client has a layout it already returned
>     which is ok.
> 
>   * layoutrecall_done() marks a nfs4_layoutrecall as done and destroys it if
>     not marked pending (i.e. layoutrecall kthread still responsible for it)
> 
>   * recall_return_perfect_match() determines if a layoutreturn finalizes a layoutrecall.
>     the predicate is:
>     - same client and lo_type (implicit in the implementation), and
>     - same iomode, and
>     - same return/recall_type (FILE or FSID or ALL)
>       - if RECALL_FILE then same file and byte range, else
>       - if RECALL_FSID then same fsid, else
>       - if RECALL_ALL
> 
>   * recall_return_partial_match() determines if a layoutreturn matches
>     a layoutrecall (by return/recall types, fsid, and byte range)
>     it is called for tracking client layoutreturn activity.
>     the predicate is:
>     - same client and lo_type (implicit in the implementation), and
>     - same iomode, or either the recall or return are for IOMODE_ANY
>     - either recall or return are for ALL, or
>       (recall and return are either FILE or FSID)
>     - either recall or return are for FSID and their fsid matches, or
>     - both the recall and return are for the same FILE
>       and their byte ranges are overlapping
> 
>   * algorithm:
>   call filesystem layout_return(), exit on error
> 
>   update layouts:
> 	for RETURN_FILE:
> 		lookup for matchings layouts on the file's fi_layouts list
> 		(same client, lo_type, same iomode or return IOMODE_ANY,
> 		 and overlapping byte range)
> 
> 			if found, trim it, and destroy it if no bytes left.
> 
> 	for RETURN_{FSID,ALL}:
> 		lookup matching layouts on the client's cl_layouts list
> 		(same lo_type, same iomode or return IOMODE_ANY, and
> 		 same fsid for RETURN_FSID)
> 
> 			destroy and matching layout
> 
>   update layoutrecalls:
> 	lookup matching layoutrecalls on client's cl_layoutrecall list
> 	(same lo_type)
> 		if perfect_match
> 			layoutrecall_done;
> 		else if layouts_found and partial_match
> 			update last activity timestamp;
> 
> * nfsd_layout_recall_cb()
>   serves as filesystem layourrecall callback API function
>   validate arguments
>   prepare a "master" nfs4_layoutrecall from the nfsd4_pnfs_cb_layout args
>   take reference on client and file (if not RECALL_ALL)
>   spawn a kernel thread to do the layoutrecall in the background
>   (kthread will free the master nfs4_layoutrecall structure when done)
> 
> * do_layout_recall() worker thread
>   if layoutrecall is for a single client, put on todo list, otherwise
>   under nfs4_state_lock(), lookup any client having any layout matching the
>   recall type:
> 	RECALL_FILE: has any layout on the recalled file (not necessarily
> 		overlapping, for safety)
> 	RECALL_FSID: has any layout on a file in the recalled fsid
> 	RECALL_ALL: has a layout
>   for each matching client, clone the master layoutrecall structure and
>   put on todo list
> 	note: don't do the synchronous layoutrecall rpc under the state_lock.
> 	in the future we may want to just issue async cb_layoutrecall
> 	rpc, and wait for them to complete outside the state_lock)
>   release nfs4_state_unlock()
>   at this point master nfs4_layoutrecall is destroyed (in the wildcard case)
> 
>   for each nfs4_layoutrecall on the todo list
> 	if no callback path to this client just destroy the structure, don't send
> 	mark layoutrecall as PENDING and enlist it on the client's cl_layoutrecalls
> 	list (take refcounts on client and file).
> 	do the cb_layoutrecall rpc
> 	under state_lock: if rpc succeeded and layoutrecall not marked DONE then
> 		unmark it as PENDING, otherwise (either rpc failed or layouurecall done),
> 		destroy_layoutrecall()
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs


-- 
Benny Halevy
Panasas Inc.
Accelerating Time to Results(TM) with Clustered Storage

www.panasas.com
bhalevy at panasas.com
Tel/Fax: +972-3-647-8340
Mobile: +972-54-802-8340
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nfsd_layout_cache-full.patch
Url: http://linux-nfs.org/pipermail/pnfs/attachments/20070503/750fd04f/attachment-0001.diff 


More information about the pNFS mailing list