[pnfs] nfsd layout cache design for layoutrecall

Benny Halevy bhalevy at panasas.com
Thu May 3 16:21:06 EDT 2007


My nfsd layout cache changes are finally in a good enough state
for review.  Connectathon tests passing with them, but no
layout recalls are exercised yet (at least the get/return path is working).

I tried to capture the main ideas below.  I'm also attaching a full
patch of what I have in my workspace so you can apply it in your
branch to see the whole thing.

The change is pretty big and I will cut it into many pieces for
submission of course.

* struct nfsd4_layout_seg added
Embedded in the get, return, and recall structures, as well
as used as a parameter to several internal functions.

struct nfsd4_layout_seg {
	u64                      lo_clid;
	u32                      lo_type;
	u32                      lo_iomode;
	u64                      lo_offset;
	u64                      lo_length;
};

Since this simplified the internal design and implementation
I think it's worthwhile to define and use this type in the NFSv4.1
protocol. Although this is wire-format change, now is probably the
last opportunity to make such a big change before last call.
I'll raise this in the next pnfs spec review meeting...

* simplification of the filesystem layout recall callback interface.
the nfs4_cb_layout structure broken into two parts.  One, holding
the minimum set of "args" the filesystem fills for doing a layoutrecall.

struct nfsd4_pnfs_cb_layout {
	u32                     cbl_recall_type;	/* request */
	struct nfsd4_layout_seg cbl_seg;		/* request */
	u32                     cbl_layoutchanged;	/* request */
};

The other (nfs4_layoutrecall), embeds the structure above and
contains internal state for tracking the outstanding layoutrecall
state internally.

* added a can_merge_layouts() method to the vfs api. called by the pnfsd
  layoutget function to see if the layout segments can be merged in the
  layout cache, otherwise each is kept separately, even if it overlaps
  with another outstanding layout.

* added a cl_layoutrecalls list to struct nfs4_client for tracking
  outstanding layout recalls

* changed cl_count in nfs4_client into a kref.
  nfs4_{get,put}_client take and release a reference on the
  client structure under the state lock.

* layoutrecall manipulation is also done under the state lock,
  not under the recall lock.  Special care taken when sending
  the recalls to avoid distributed deadlock - client sending
  layoutreturns while serving a cb_layoutrecall, described
  in more details below...

* struct nfs4_layout simplified:
struct nfs4_layout {
	struct list_head        lo_perclnt;
	struct list_head        lo_perfile;
	struct nfs4_client     *lo_client;       /* backpointer */
	struct nfs4_file       *lo_file;         /* backpointer */
	struct nfsd4_layout_seg lo_seg;
};

* struct nfs4_layoutrecall:
enum {
	LAYOUTRECALL_FLAG_PENDING = 1<<0,
	LAYOUTRECALL_FLAG_DONE    = 1<<1,
};

/* layoutrecall request (from exported filesystem) */
struct nfs4_layoutrecall {
	struct nfsd4_pnfs_cb_layout     cb;         /* request */
	struct list_head                clr_perclnt;    /* on cl_layoutrecalls */
	struct nfs4_client             *clr_client;
	struct nfs4_file               *clr_file;
	int                             clr_status;
	unsigned                        clr_flags;
	struct timespec                 clr_time;       /* last activity */
};

layoutrecall is only listed on the cl_layoutrecall list but holds
a back pointer (and a reference) to a file structure in the RECALL_FILE
and RECALL_FSID cases.

clr_flags field used for handshake between the thread sending the layoutrecall
and the thread handling layoutreturns as the the layoutreturn can arrive
during the layoutrecall rpc if the client is sending it before replying
to the cb_layoutrecall or if the layoutreturn request just gets to the server
before the cb_layoutrecall response.

* struct nfs4_file:
added fsid and filehandle to the structure.
these are accessed in the get, return, and recall paths
through a pointer to the nfs4_file structure
rather than copying the data into each nfs4_layout and nfs4_layoutrecall
instance.

+#ifdef CONFIG_PNFS
+	/* used by layoutget / layoutrecall */
+	struct nfs4_fsid	fi_fsid;
+	u32			fi_fhlen;
+	u32			fi_fhval[NFS4_FHSIZE];
+#endif /* CONFIG_PNFS */

* added #define NFS4_LENGTH_EOF (~(u64)0) and a couple helper functions:
  end_offset(start, len) and last_byte_offset(start, len)
  to simplify length "arithmetic" (for layout segment matching)

* nfs4_layout
  tracks outstanding layouts
  listed on a per-client and per-file lists
  keeps reference on the respective client and file.

  * alloc_layout() added to allocate an uninitialized nfs4_layout structure.
    The idea is to allocate it early in layoutget if filesystem does not support
    merging layouts or if it is the first (and possibly the only) layout for the
    file.  We want to avoid getting a layout from the filesystem in low memory
    conditions rather than getting one, discover we can't allocate a layout
    structure to track it, and having to return it to the filesystem.

  * alloc_init_layout() can initialize an existing layout structure (that
    we allocated in advance).  It puts the initialized layout on the
    client and file layout lists.

  * destroy_layout() removes the layout structure from the client and file
    lists and frees it.

* nfs4_layoutrecall
  tracks outstanding layoutrecalls
  listed only per client
  but keeps a reference both on the respective client and file
  (if applicable, can be NULL in the RECALL_ALL case)
  clr_flags field used for marking the layoutrecall done and deciding
  who's freeing it.

  * alloc_init_layoutrecall() allocates a nfs4_layoutrecall structure.
    Optionally, it can initialize it by cloning an existing nfs_layoutrecall
    structure.  This is used to make layoutrecall instances from a "master copy"
    when doing a "wildcard" recall to multiple clients.

  * hash_layoutrecall() enlists the layoutrecall on the client cl_layoutrecall
    list and takes a reference on the respective client and file.

  * destroy_layoutrecall() unhashes the layoutrecall structure and frees it.

  
* nfs4_pnfs_get_layout()
  * is_layout_recalled() looks for a layoutrecall matching a layoutget
    (same client and layout_type, for RECALL_ALL or for the same FSID, or overlapping
    in the RECALL_FILE case)

  * extend_layout() can merge a layout segment into an existing one
    by extending its range

  * merge_layout() looks for an existing layout segment on a file
    matching a new layout segment (lo_type, clid, iomode) and 
    either overlapping with or adjacent to it.

  * algorithm:
    if exists a matching recall
	return recallconflict error

    if filesystem can't merge or no layouts on file
	allocate an empty layout structure

    call filesystem layout_get, process status

    if can merge, try merging the layout into an existing one
    and exit if succeeded, otherwise,

    allocate a new layout if needed or initialize the one allocated in advance
    and exit if succeeded, otherwise,

    internally return the layout to the filesystem, return nfserr_layouttrylater;

* nfs4_pnfs_return_layout()
  * trim_layout() trims a layoutsegment if part of is returned.
    note: does not support splitting a layout structure, remains will
    get freed on final layoutreturn.  if splitting layoutreturn is voluntary,
    the server may end up thinking the client has a layout it already returned
    which is ok.

  * layoutrecall_done() marks a nfs4_layoutrecall as done and destroys it if
    not marked pending (i.e. layoutrecall kthread still responsible for it)

  * recall_return_perfect_match() determines if a layoutreturn finalizes a layoutrecall.
    the predicate is:
    - same client and lo_type (implicit in the implementation), and
    - same iomode, and
    - same return/recall_type (FILE or FSID or ALL)
      - if RECALL_FILE then same file and byte range, else
      - if RECALL_FSID then same fsid, else
      - if RECALL_ALL

  * recall_return_partial_match() determines if a layoutreturn matches
    a layoutrecall (by return/recall types, fsid, and byte range)
    it is called for tracking client layoutreturn activity.
    the predicate is:
    - same client and lo_type (implicit in the implementation), and
    - same iomode, or either the recall or return are for IOMODE_ANY
    - either recall or return are for ALL, or
      (recall and return are either FILE or FSID)
    - either recall or return are for FSID and their fsid matches, or
    - both the recall and return are for the same FILE
      and their byte ranges are overlapping

  * algorithm:
  call filesystem layout_return(), exit on error

  update layouts:
	for RETURN_FILE:
		lookup for matchings layouts on the file's fi_layouts list
		(same client, lo_type, same iomode or return IOMODE_ANY,
		 and overlapping byte range)

			if found, trim it, and destroy it if no bytes left.

	for RETURN_{FSID,ALL}:
		lookup matching layouts on the client's cl_layouts list
		(same lo_type, same iomode or return IOMODE_ANY, and
		 same fsid for RETURN_FSID)

			destroy and matching layout

  update layoutrecalls:
	lookup matching layoutrecalls on client's cl_layoutrecall list
	(same lo_type)
		if perfect_match
			layoutrecall_done;
		else if layouts_found and partial_match
			update last activity timestamp;

* nfsd_layout_recall_cb()
  serves as filesystem layourrecall callback API function
  validate arguments
  prepare a "master" nfs4_layoutrecall from the nfsd4_pnfs_cb_layout args
  take reference on client and file (if not RECALL_ALL)
  spawn a kernel thread to do the layoutrecall in the background
  (kthread will free the master nfs4_layoutrecall structure when done)

* do_layout_recall() worker thread
  if layoutrecall is for a single client, put on todo list, otherwise
  under nfs4_state_lock(), lookup any client having any layout matching the
  recall type:
	RECALL_FILE: has any layout on the recalled file (not necessarily
		overlapping, for safety)
	RECALL_FSID: has any layout on a file in the recalled fsid
	RECALL_ALL: has a layout
  for each matching client, clone the master layoutrecall structure and
  put on todo list
	note: don't do the synchronous layoutrecall rpc under the state_lock.
	in the future we may want to just issue async cb_layoutrecall
	rpc, and wait for them to complete outside the state_lock)
  release nfs4_state_unlock()
  at this point master nfs4_layoutrecall is destroyed (in the wildcard case)

  for each nfs4_layoutrecall on the todo list
	if no callback path to this client just destroy the structure, don't send
	mark layoutrecall as PENDING and enlist it on the client's cl_layoutrecalls
	list (take refcounts on client and file).
	do the cb_layoutrecall rpc
	under state_lock: if rpc succeeded and layoutrecall not marked DONE then
		unmark it as PENDING, otherwise (either rpc failed or layouurecall done),
		destroy_layoutrecall()

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nfsd_layout_cache-full.patch
Url: http://linux-nfs.org/pipermail/pnfs/attachments/20070503/af10f314/attachment.diff 


More information about the pNFS mailing list