[pnfs] nfsd layout cache design for layoutrecall
Benny Halevy
bhalevy at panasas.com
Thu May 3 16:22:59 EDT 2007
Sorry, the full patch attached now.
Benny Halevy wrote:
> My nfsd layout cache changes are finally in a good enough state
> for review. Connectathon tests passing with them, but no
> layout recalls are exercised yet (at least the get/return path is working).
>
> I tried to capture the main ideas below. I'm also attaching a full
> patch of what I have in my workspace so you can apply it in your
> branch to see the whole thing.
>
> The change is pretty big and I will cut it into many pieces for
> submission of course.
>
> * struct nfsd4_layout_seg added
> Embedded in the get, return, and recall structures, as well
> as used as a parameter to several internal functions.
>
> struct nfsd4_layout_seg {
> u64 lo_clid;
> u32 lo_type;
> u32 lo_iomode;
> u64 lo_offset;
> u64 lo_length;
> };
>
> Since this simplified the internal design and implementation
> I think it's worthwhile to define and use this type in the NFSv4.1
> protocol. Although this is wire-format change, now is probably the
> last opportunity to make such a big change before last call.
> I'll raise this in the next pnfs spec review meeting...
>
> * simplification of the filesystem layout recall callback interface.
> the nfs4_cb_layout structure broken into two parts. One, holding
> the minimum set of "args" the filesystem fills for doing a layoutrecall.
>
> struct nfsd4_pnfs_cb_layout {
> u32 cbl_recall_type; /* request */
> struct nfsd4_layout_seg cbl_seg; /* request */
> u32 cbl_layoutchanged; /* request */
> };
>
> The other (nfs4_layoutrecall), embeds the structure above and
> contains internal state for tracking the outstanding layoutrecall
> state internally.
>
> * added a can_merge_layouts() method to the vfs api. called by the pnfsd
> layoutget function to see if the layout segments can be merged in the
> layout cache, otherwise each is kept separately, even if it overlaps
> with another outstanding layout.
>
> * added a cl_layoutrecalls list to struct nfs4_client for tracking
> outstanding layout recalls
>
> * changed cl_count in nfs4_client into a kref.
> nfs4_{get,put}_client take and release a reference on the
> client structure under the state lock.
>
> * layoutrecall manipulation is also done under the state lock,
> not under the recall lock. Special care taken when sending
> the recalls to avoid distributed deadlock - client sending
> layoutreturns while serving a cb_layoutrecall, described
> in more details below...
>
> * struct nfs4_layout simplified:
> struct nfs4_layout {
> struct list_head lo_perclnt;
> struct list_head lo_perfile;
> struct nfs4_client *lo_client; /* backpointer */
> struct nfs4_file *lo_file; /* backpointer */
> struct nfsd4_layout_seg lo_seg;
> };
>
> * struct nfs4_layoutrecall:
> enum {
> LAYOUTRECALL_FLAG_PENDING = 1<<0,
> LAYOUTRECALL_FLAG_DONE = 1<<1,
> };
>
> /* layoutrecall request (from exported filesystem) */
> struct nfs4_layoutrecall {
> struct nfsd4_pnfs_cb_layout cb; /* request */
> struct list_head clr_perclnt; /* on cl_layoutrecalls */
> struct nfs4_client *clr_client;
> struct nfs4_file *clr_file;
> int clr_status;
> unsigned clr_flags;
> struct timespec clr_time; /* last activity */
> };
>
> layoutrecall is only listed on the cl_layoutrecall list but holds
> a back pointer (and a reference) to a file structure in the RECALL_FILE
> and RECALL_FSID cases.
>
> clr_flags field used for handshake between the thread sending the layoutrecall
> and the thread handling layoutreturns as the the layoutreturn can arrive
> during the layoutrecall rpc if the client is sending it before replying
> to the cb_layoutrecall or if the layoutreturn request just gets to the server
> before the cb_layoutrecall response.
>
> * struct nfs4_file:
> added fsid and filehandle to the structure.
> these are accessed in the get, return, and recall paths
> through a pointer to the nfs4_file structure
> rather than copying the data into each nfs4_layout and nfs4_layoutrecall
> instance.
>
> +#ifdef CONFIG_PNFS
> + /* used by layoutget / layoutrecall */
> + struct nfs4_fsid fi_fsid;
> + u32 fi_fhlen;
> + u32 fi_fhval[NFS4_FHSIZE];
> +#endif /* CONFIG_PNFS */
>
> * added #define NFS4_LENGTH_EOF (~(u64)0) and a couple helper functions:
> end_offset(start, len) and last_byte_offset(start, len)
> to simplify length "arithmetic" (for layout segment matching)
>
> * nfs4_layout
> tracks outstanding layouts
> listed on a per-client and per-file lists
> keeps reference on the respective client and file.
>
> * alloc_layout() added to allocate an uninitialized nfs4_layout structure.
> The idea is to allocate it early in layoutget if filesystem does not support
> merging layouts or if it is the first (and possibly the only) layout for the
> file. We want to avoid getting a layout from the filesystem in low memory
> conditions rather than getting one, discover we can't allocate a layout
> structure to track it, and having to return it to the filesystem.
>
> * alloc_init_layout() can initialize an existing layout structure (that
> we allocated in advance). It puts the initialized layout on the
> client and file layout lists.
>
> * destroy_layout() removes the layout structure from the client and file
> lists and frees it.
>
> * nfs4_layoutrecall
> tracks outstanding layoutrecalls
> listed only per client
> but keeps a reference both on the respective client and file
> (if applicable, can be NULL in the RECALL_ALL case)
> clr_flags field used for marking the layoutrecall done and deciding
> who's freeing it.
>
> * alloc_init_layoutrecall() allocates a nfs4_layoutrecall structure.
> Optionally, it can initialize it by cloning an existing nfs_layoutrecall
> structure. This is used to make layoutrecall instances from a "master copy"
> when doing a "wildcard" recall to multiple clients.
>
> * hash_layoutrecall() enlists the layoutrecall on the client cl_layoutrecall
> list and takes a reference on the respective client and file.
>
> * destroy_layoutrecall() unhashes the layoutrecall structure and frees it.
>
>
> * nfs4_pnfs_get_layout()
> * is_layout_recalled() looks for a layoutrecall matching a layoutget
> (same client and layout_type, for RECALL_ALL or for the same FSID, or overlapping
> in the RECALL_FILE case)
>
> * extend_layout() can merge a layout segment into an existing one
> by extending its range
>
> * merge_layout() looks for an existing layout segment on a file
> matching a new layout segment (lo_type, clid, iomode) and
> either overlapping with or adjacent to it.
>
> * algorithm:
> if exists a matching recall
> return recallconflict error
>
> if filesystem can't merge or no layouts on file
> allocate an empty layout structure
>
> call filesystem layout_get, process status
>
> if can merge, try merging the layout into an existing one
> and exit if succeeded, otherwise,
>
> allocate a new layout if needed or initialize the one allocated in advance
> and exit if succeeded, otherwise,
>
> internally return the layout to the filesystem, return nfserr_layouttrylater;
>
> * nfs4_pnfs_return_layout()
> * trim_layout() trims a layoutsegment if part of is returned.
> note: does not support splitting a layout structure, remains will
> get freed on final layoutreturn. if splitting layoutreturn is voluntary,
> the server may end up thinking the client has a layout it already returned
> which is ok.
>
> * layoutrecall_done() marks a nfs4_layoutrecall as done and destroys it if
> not marked pending (i.e. layoutrecall kthread still responsible for it)
>
> * recall_return_perfect_match() determines if a layoutreturn finalizes a layoutrecall.
> the predicate is:
> - same client and lo_type (implicit in the implementation), and
> - same iomode, and
> - same return/recall_type (FILE or FSID or ALL)
> - if RECALL_FILE then same file and byte range, else
> - if RECALL_FSID then same fsid, else
> - if RECALL_ALL
>
> * recall_return_partial_match() determines if a layoutreturn matches
> a layoutrecall (by return/recall types, fsid, and byte range)
> it is called for tracking client layoutreturn activity.
> the predicate is:
> - same client and lo_type (implicit in the implementation), and
> - same iomode, or either the recall or return are for IOMODE_ANY
> - either recall or return are for ALL, or
> (recall and return are either FILE or FSID)
> - either recall or return are for FSID and their fsid matches, or
> - both the recall and return are for the same FILE
> and their byte ranges are overlapping
>
> * algorithm:
> call filesystem layout_return(), exit on error
>
> update layouts:
> for RETURN_FILE:
> lookup for matchings layouts on the file's fi_layouts list
> (same client, lo_type, same iomode or return IOMODE_ANY,
> and overlapping byte range)
>
> if found, trim it, and destroy it if no bytes left.
>
> for RETURN_{FSID,ALL}:
> lookup matching layouts on the client's cl_layouts list
> (same lo_type, same iomode or return IOMODE_ANY, and
> same fsid for RETURN_FSID)
>
> destroy and matching layout
>
> update layoutrecalls:
> lookup matching layoutrecalls on client's cl_layoutrecall list
> (same lo_type)
> if perfect_match
> layoutrecall_done;
> else if layouts_found and partial_match
> update last activity timestamp;
>
> * nfsd_layout_recall_cb()
> serves as filesystem layourrecall callback API function
> validate arguments
> prepare a "master" nfs4_layoutrecall from the nfsd4_pnfs_cb_layout args
> take reference on client and file (if not RECALL_ALL)
> spawn a kernel thread to do the layoutrecall in the background
> (kthread will free the master nfs4_layoutrecall structure when done)
>
> * do_layout_recall() worker thread
> if layoutrecall is for a single client, put on todo list, otherwise
> under nfs4_state_lock(), lookup any client having any layout matching the
> recall type:
> RECALL_FILE: has any layout on the recalled file (not necessarily
> overlapping, for safety)
> RECALL_FSID: has any layout on a file in the recalled fsid
> RECALL_ALL: has a layout
> for each matching client, clone the master layoutrecall structure and
> put on todo list
> note: don't do the synchronous layoutrecall rpc under the state_lock.
> in the future we may want to just issue async cb_layoutrecall
> rpc, and wait for them to complete outside the state_lock)
> release nfs4_state_unlock()
> at this point master nfs4_layoutrecall is destroyed (in the wildcard case)
>
> for each nfs4_layoutrecall on the todo list
> if no callback path to this client just destroy the structure, don't send
> mark layoutrecall as PENDING and enlist it on the client's cl_layoutrecalls
> list (take refcounts on client and file).
> do the cb_layoutrecall rpc
> under state_lock: if rpc succeeded and layoutrecall not marked DONE then
> unmark it as PENDING, otherwise (either rpc failed or layouurecall done),
> destroy_layoutrecall()
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
--
Benny Halevy
Panasas Inc.
Accelerating Time to Results(TM) with Clustered Storage
www.panasas.com
bhalevy at panasas.com
Tel/Fax: +972-3-647-8340
Mobile: +972-54-802-8340
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nfsd_layout_cache-full.patch
Url: http://linux-nfs.org/pipermail/pnfs/attachments/20070503/750fd04f/attachment-0001.diff
More information about the pNFS
mailing list