[pnfs] nfsd layout cache design for layoutrecall
Benny Halevy
bhalevy at panasas.com
Thu May 3 16:21:06 EDT 2007
My nfsd layout cache changes are finally in a good enough state
for review. Connectathon tests passing with them, but no
layout recalls are exercised yet (at least the get/return path is working).
I tried to capture the main ideas below. I'm also attaching a full
patch of what I have in my workspace so you can apply it in your
branch to see the whole thing.
The change is pretty big and I will cut it into many pieces for
submission of course.
* struct nfsd4_layout_seg added
Embedded in the get, return, and recall structures, as well
as used as a parameter to several internal functions.
struct nfsd4_layout_seg {
u64 lo_clid;
u32 lo_type;
u32 lo_iomode;
u64 lo_offset;
u64 lo_length;
};
Since this simplified the internal design and implementation
I think it's worthwhile to define and use this type in the NFSv4.1
protocol. Although this is wire-format change, now is probably the
last opportunity to make such a big change before last call.
I'll raise this in the next pnfs spec review meeting...
* simplification of the filesystem layout recall callback interface.
the nfs4_cb_layout structure broken into two parts. One, holding
the minimum set of "args" the filesystem fills for doing a layoutrecall.
struct nfsd4_pnfs_cb_layout {
u32 cbl_recall_type; /* request */
struct nfsd4_layout_seg cbl_seg; /* request */
u32 cbl_layoutchanged; /* request */
};
The other (nfs4_layoutrecall), embeds the structure above and
contains internal state for tracking the outstanding layoutrecall
state internally.
* added a can_merge_layouts() method to the vfs api. called by the pnfsd
layoutget function to see if the layout segments can be merged in the
layout cache, otherwise each is kept separately, even if it overlaps
with another outstanding layout.
* added a cl_layoutrecalls list to struct nfs4_client for tracking
outstanding layout recalls
* changed cl_count in nfs4_client into a kref.
nfs4_{get,put}_client take and release a reference on the
client structure under the state lock.
* layoutrecall manipulation is also done under the state lock,
not under the recall lock. Special care taken when sending
the recalls to avoid distributed deadlock - client sending
layoutreturns while serving a cb_layoutrecall, described
in more details below...
* struct nfs4_layout simplified:
struct nfs4_layout {
struct list_head lo_perclnt;
struct list_head lo_perfile;
struct nfs4_client *lo_client; /* backpointer */
struct nfs4_file *lo_file; /* backpointer */
struct nfsd4_layout_seg lo_seg;
};
* struct nfs4_layoutrecall:
enum {
LAYOUTRECALL_FLAG_PENDING = 1<<0,
LAYOUTRECALL_FLAG_DONE = 1<<1,
};
/* layoutrecall request (from exported filesystem) */
struct nfs4_layoutrecall {
struct nfsd4_pnfs_cb_layout cb; /* request */
struct list_head clr_perclnt; /* on cl_layoutrecalls */
struct nfs4_client *clr_client;
struct nfs4_file *clr_file;
int clr_status;
unsigned clr_flags;
struct timespec clr_time; /* last activity */
};
layoutrecall is only listed on the cl_layoutrecall list but holds
a back pointer (and a reference) to a file structure in the RECALL_FILE
and RECALL_FSID cases.
clr_flags field used for handshake between the thread sending the layoutrecall
and the thread handling layoutreturns as the the layoutreturn can arrive
during the layoutrecall rpc if the client is sending it before replying
to the cb_layoutrecall or if the layoutreturn request just gets to the server
before the cb_layoutrecall response.
* struct nfs4_file:
added fsid and filehandle to the structure.
these are accessed in the get, return, and recall paths
through a pointer to the nfs4_file structure
rather than copying the data into each nfs4_layout and nfs4_layoutrecall
instance.
+#ifdef CONFIG_PNFS
+ /* used by layoutget / layoutrecall */
+ struct nfs4_fsid fi_fsid;
+ u32 fi_fhlen;
+ u32 fi_fhval[NFS4_FHSIZE];
+#endif /* CONFIG_PNFS */
* added #define NFS4_LENGTH_EOF (~(u64)0) and a couple helper functions:
end_offset(start, len) and last_byte_offset(start, len)
to simplify length "arithmetic" (for layout segment matching)
* nfs4_layout
tracks outstanding layouts
listed on a per-client and per-file lists
keeps reference on the respective client and file.
* alloc_layout() added to allocate an uninitialized nfs4_layout structure.
The idea is to allocate it early in layoutget if filesystem does not support
merging layouts or if it is the first (and possibly the only) layout for the
file. We want to avoid getting a layout from the filesystem in low memory
conditions rather than getting one, discover we can't allocate a layout
structure to track it, and having to return it to the filesystem.
* alloc_init_layout() can initialize an existing layout structure (that
we allocated in advance). It puts the initialized layout on the
client and file layout lists.
* destroy_layout() removes the layout structure from the client and file
lists and frees it.
* nfs4_layoutrecall
tracks outstanding layoutrecalls
listed only per client
but keeps a reference both on the respective client and file
(if applicable, can be NULL in the RECALL_ALL case)
clr_flags field used for marking the layoutrecall done and deciding
who's freeing it.
* alloc_init_layoutrecall() allocates a nfs4_layoutrecall structure.
Optionally, it can initialize it by cloning an existing nfs_layoutrecall
structure. This is used to make layoutrecall instances from a "master copy"
when doing a "wildcard" recall to multiple clients.
* hash_layoutrecall() enlists the layoutrecall on the client cl_layoutrecall
list and takes a reference on the respective client and file.
* destroy_layoutrecall() unhashes the layoutrecall structure and frees it.
* nfs4_pnfs_get_layout()
* is_layout_recalled() looks for a layoutrecall matching a layoutget
(same client and layout_type, for RECALL_ALL or for the same FSID, or overlapping
in the RECALL_FILE case)
* extend_layout() can merge a layout segment into an existing one
by extending its range
* merge_layout() looks for an existing layout segment on a file
matching a new layout segment (lo_type, clid, iomode) and
either overlapping with or adjacent to it.
* algorithm:
if exists a matching recall
return recallconflict error
if filesystem can't merge or no layouts on file
allocate an empty layout structure
call filesystem layout_get, process status
if can merge, try merging the layout into an existing one
and exit if succeeded, otherwise,
allocate a new layout if needed or initialize the one allocated in advance
and exit if succeeded, otherwise,
internally return the layout to the filesystem, return nfserr_layouttrylater;
* nfs4_pnfs_return_layout()
* trim_layout() trims a layoutsegment if part of is returned.
note: does not support splitting a layout structure, remains will
get freed on final layoutreturn. if splitting layoutreturn is voluntary,
the server may end up thinking the client has a layout it already returned
which is ok.
* layoutrecall_done() marks a nfs4_layoutrecall as done and destroys it if
not marked pending (i.e. layoutrecall kthread still responsible for it)
* recall_return_perfect_match() determines if a layoutreturn finalizes a layoutrecall.
the predicate is:
- same client and lo_type (implicit in the implementation), and
- same iomode, and
- same return/recall_type (FILE or FSID or ALL)
- if RECALL_FILE then same file and byte range, else
- if RECALL_FSID then same fsid, else
- if RECALL_ALL
* recall_return_partial_match() determines if a layoutreturn matches
a layoutrecall (by return/recall types, fsid, and byte range)
it is called for tracking client layoutreturn activity.
the predicate is:
- same client and lo_type (implicit in the implementation), and
- same iomode, or either the recall or return are for IOMODE_ANY
- either recall or return are for ALL, or
(recall and return are either FILE or FSID)
- either recall or return are for FSID and their fsid matches, or
- both the recall and return are for the same FILE
and their byte ranges are overlapping
* algorithm:
call filesystem layout_return(), exit on error
update layouts:
for RETURN_FILE:
lookup for matchings layouts on the file's fi_layouts list
(same client, lo_type, same iomode or return IOMODE_ANY,
and overlapping byte range)
if found, trim it, and destroy it if no bytes left.
for RETURN_{FSID,ALL}:
lookup matching layouts on the client's cl_layouts list
(same lo_type, same iomode or return IOMODE_ANY, and
same fsid for RETURN_FSID)
destroy and matching layout
update layoutrecalls:
lookup matching layoutrecalls on client's cl_layoutrecall list
(same lo_type)
if perfect_match
layoutrecall_done;
else if layouts_found and partial_match
update last activity timestamp;
* nfsd_layout_recall_cb()
serves as filesystem layourrecall callback API function
validate arguments
prepare a "master" nfs4_layoutrecall from the nfsd4_pnfs_cb_layout args
take reference on client and file (if not RECALL_ALL)
spawn a kernel thread to do the layoutrecall in the background
(kthread will free the master nfs4_layoutrecall structure when done)
* do_layout_recall() worker thread
if layoutrecall is for a single client, put on todo list, otherwise
under nfs4_state_lock(), lookup any client having any layout matching the
recall type:
RECALL_FILE: has any layout on the recalled file (not necessarily
overlapping, for safety)
RECALL_FSID: has any layout on a file in the recalled fsid
RECALL_ALL: has a layout
for each matching client, clone the master layoutrecall structure and
put on todo list
note: don't do the synchronous layoutrecall rpc under the state_lock.
in the future we may want to just issue async cb_layoutrecall
rpc, and wait for them to complete outside the state_lock)
release nfs4_state_unlock()
at this point master nfs4_layoutrecall is destroyed (in the wildcard case)
for each nfs4_layoutrecall on the todo list
if no callback path to this client just destroy the structure, don't send
mark layoutrecall as PENDING and enlist it on the client's cl_layoutrecalls
list (take refcounts on client and file).
do the cb_layoutrecall rpc
under state_lock: if rpc succeeded and layoutrecall not marked DONE then
unmark it as PENDING, otherwise (either rpc failed or layouurecall done),
destroy_layoutrecall()
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nfsd_layout_cache-full.patch
Url: http://linux-nfs.org/pipermail/pnfs/attachments/20070503/af10f314/attachment.diff
More information about the pNFS
mailing list