[pnfs] linux server layout management frame work patch
Benny Halevy
bhalevy at panasas.com
Wed Oct 25 10:22:04 EDT 2006
Andy, I like the basic design,
I have some comments and questions inline below...
Benny
>
>
> pnfsd and the underlying pnfs file system need to cooperate in layout
managemet
> in much the same way that nfsd and cluster file systems cooperate in
> delegation management.
>
> note that since one of the pnfs fs implentations is a cluster file
system,
> this code should support multiple MDS.
>
> the underlying pnfs file system will hand out layouts, and will
inform pnfsd
> when a layout needs to be recalled.
>
> pnfsd will bookeep layouts, and manage the recall process.
>
> this patch provides a frame work for the pnfsd management of layouts,
and
> proposes export_operation interfaces for layout management
communication with
> the underlying pnfs file system.
>
> it is to be determined exactly what layout management functionality
will be
> given to pnfsd vrs the underlying pnfs file system.
>
> -----------------------
> pnfsd layout bookeeping
> -----------------------
>
> 1) a list of all layouts handed out by a client
> struct nfs4_client->cl_layouts
> which will be used to recall layouts when a clients state is reaped.
>
> 2) a list of all layouts on a file
> struct nfs4_file->fi_layouts
> which will be used to handle recalls. a recall can have a different byte
> range than the given layout, this needs to be remembered.
> the underlying pnfs file system will determine when a recall is to occur.
> see the pnfs file system interface section.
> TODO: this list should be ordered by increasing offset for searching.
>
> 3) an stand alone lru list of layouts in recall
> layout_recall_lru
> modeled after the delegation lru, this list holds layouts in recall,
waiting
> for clients to complete the layout return(s) that will complete the
recall.
why do we need lru order here?
>
> 4) struct nfs4_layout
> adds a reference count, the above lists, an nfs4_file back pointer
> used to get file information given a layout, and a time that a recall
started.
> it then includes the existing struct nfs4_cb_layout which holds layout
> parameters.
>
> 5) struct nfs4_layout slab based allocation and free routines as well as
> reference count put and get routines.
>
> 6) the pnfs_hash_cb_layout routine adds a layout to the layout_recall_lru
> what is not written is the management of recall layout byte ranges.
>
> ---------------------------------------------
> pnfs file system layout management interface
> ---------------------------------------------
>
> these will be entries in the superblocks export_operations interface.
>
> existing interfaces:
> 1) int (*layout_get) (struct inode *inode, void *buf);
> called by pnfsd upon receiving a layoutget operation from a client.
> pnfs file system returns a layout in the void buf parameter.
>
> 2) int (*layout_return) (struct inode *inode, void *p);
> called by pnfsd upon receiving a layout return operation from a client.
>
> 3) callback from pnfs filesystem on MDS
> int (*cb_layout_recall) (struct super_block *sb, struct inode *inode,
void *p);
> this identifies the layout to be recalled. this interface needs to be
> expanded - there is no byte range or iomode information.
right, and also layoutrecall_type (i.e. FILE/FSID/ALL)
>
>
> behavior and issues
> --------------------
>
> pnfsd receives a LAYOUTGET operation from a client.
> 1) calls export_operations->layoutget.
> - pnfs fs checks for layout conflicts
> - pnfs fs checks for overlap with in-progress recall.
> pnfsd does not do this - another MDS could have a overlapping
recall
> in progress.
> 2) if layout returned, pnfsd allocatesa an nfs4_layout struct and adds
> it to the nfs4_file and nfs4_client hashes.
> - locking on the hash lists - shared between pnfsd and pnfs fs???
yup, as I suggested in the bakeathon pnfsd might want to create
a placeholder for the new layout in advance, insert it in the hash
(marked as in-progress) and pass a pointer to the placeholder to the
fs, while the hash is locked. Typically, the fs will succeed and the
is can fill-in the new layout, otherwise, pnfsd can clean up after
the fs returns with an error.
>
> pnfsd receives a CB_LAYOUT callback from the pnfs
>
> 1) pnfsd looks up layout using iomode, offset and range in
> nfs4_file->fi_layouts.
> - if exact match, remove layout
exact is too narrow, should grab everything in the recalled scope
<layout_type, layoutrecall_type, iomode, byte range>
Also, I'd consider keeping the layout on the fi_layouts list until
its returned and just hang it also on the layout_recall_lru (with
a pointer to the layoutrecall control structure for bookkeeping)
> - if byte range subset, leave non-recalled portion on list.
> - if no overlap - return error.
I think we should let the client decide that and return an error to
CB_LAYOUTRECALL
in case the server is confused.
> 2) place the cb_layout on the layout_recall_lru.
> 3) bump the layout referencve and call the CB_LAYOUT operation to the
> client(s) that hold an overlapping layout.
> - some pnfs fs care about iomode, others don't - issue???
>
> pnfsd receives a LAYOUTRETURN operation from a client.
>
> 1) pnfsd checks the layout_recall_lru for an exactly matching layout.
if found,
> this signals a sucessful completion of a layout callback.
> - pnfsd needs to let pnfs fs know about successful callback- how???
we need to track the recall "meta-operation" and call fs asynchronous
callback function when the client calls layoutreturn with the exact scope
<layout_type, layoutrecall_type, iomode, byte range>. The layout_recall_lru
(again, why lru? :) can hang from this control structure.
> 2) pnfsd calls the export_operations->layoutreturn
> - does pnfsd need to pay attention to layoutreturns that match
> a layout on the layout_recall_lru in all aspects except matching
> byte-range, or just let the pnfs fs handle partial callback
returns???
since we (pnfs-obj) care about partial match, the fs needs an opportunity
to check that.
>
> pnfsd expires an nfs4_client due to lease expiration or admin
intervention
> 1) pnfsd needs to reap state, so for layouts, should call the
> export_operations->layoutreturn for all layouts on the
nfs4_client->cl_layout
> list.
> - does the pnfs fs need a flag to indicate layout reap?
>
> ---
>
>
> diff -puN fs/nfsd/nfs4state.c~pnfsd-layout-mgnt fs/nfsd/nfs4state.c
> --- pnfs-old/fs/nfsd/nfs4state.c~pnfsd-layout-mgnt 2006-09-26
12:10:12.000000000 -0400
> +++ pnfs-old-andros/fs/nfsd/nfs4state.c 2006-09-26
12:28:35.000000000 -0400
> @@ -440,6 +440,7 @@ create_client(struct xdr_netobj name, ch
> INIT_LIST_HEAD(&clp->cl_strhash);
> INIT_LIST_HEAD(&clp->cl_openowners);
> INIT_LIST_HEAD(&clp->cl_delegations);
> + INIT_LIST_HEAD(&clp->cl_layouts);
> INIT_LIST_HEAD(&clp->cl_lru);
> out:
> return clp;
> @@ -1004,6 +1005,7 @@ alloc_init_file(struct inode *ino)
> INIT_LIST_HEAD(&fp->fi_hash);
> INIT_LIST_HEAD(&fp->fi_stateids);
> INIT_LIST_HEAD(&fp->fi_delegations);
> + INIT_LIST_HEAD(&fp->fi_layouts);
> list_add(&fp->fi_hash, &file_hashtbl[hashval]);
> fp->fi_inode = igrab(ino);
> fp->fi_id = current_fileid++;
> @@ -1031,6 +1033,9 @@ nfsd4_free_slabs(void)
> nfsd4_free_slab(&file_slab);
> nfsd4_free_slab(&stateid_slab);
> nfsd4_free_slab(&deleg_slab);
> +#if defined(CONFIG_PNFS)
> + nfsd4_free_slab(&pnfs_layout_slab);
> +#endif /* CONFIG_PNFS */
> }
>
> static int
> @@ -1052,6 +1057,13 @@ nfsd4_init_slabs(void)
> sizeof(struct nfs4_delegation), 0, 0, NULL, NULL);
> if (deleg_slab == NULL)
> goto out_nomem;
> +#if defined(CONFIG_PNFS)
> + pnfs_layout_slab = kmem_cache_create("pnfs layouts",
> + sizeof(struct nfs4_layout), 0, 0, NULL, NULL);
> + if (pnfs_layout_slab == NULL)
> + goto out_nomem;
> +#endif /* CONFIG_PNFS */
> +
> return 0;
> out_nomem:
> nfsd4_free_slabs();
> @@ -3348,6 +3360,102 @@ nfs4_reset_lease(time_t leasetime)
> #if defined(CONFIG_PNFS)
>
> /*
> + * Layout state - NFSv4.1 pNFS
> + */
> +static struct list_head layout_recall_lru;
> +static int num_layouts;
> +
> +static kmem_cache_t *pnfs_layout_slab = NULL;
> +
> +/* Called under the state lock. */
> +static void
> +pnfs_unhash_layout(struct nfs4_layout *lp)
> +{
> + list_del_init(&lp->lo_perfile);
> +
> +
> +static void
> +free_nfs4_layout(struct kref *kref)
> +{
> + struct nfs4_layout *lp;
> +
> + lp = container_of(kref, struct nfs4_layout, lo_ref);
> + pnfs_unhash_layout(lp);
> + kmem_cache_free(pnfs_layout_slab, lp);
> +}
> +
> +static inline void
> +put_nfs4_layout(struct nfs4_layout *lp)
> +{
> + kref_put(&lp->lo_ref, free_nfs4_layout);
> +}
> +
> +static inline void
> +get_nfs4_layout(struct nfs4_layout *lp)
> +{
> + kref_get(&lp->lo_ref);
> +}
> +
> +static struct nfs4_layout *
> +pnfs_alloc_init_layout(struct nfs4_client *clp, struct svc_fh
*current_fh, struct nfsd4_pnfs_layoutget *lg)
> +{
> + struct nfs4_layout *lp;
> + struct nfs4_file *fp = NULL;
> + struct nfs4_callback *cb = &clp->cl_callback;
> + struct inode *ino = current_fh->fh_dentry->d_inode;
> +
> + dprintk("NFSD alloc_init_layout\n");
> + lp = kmem_cache_alloc(pnfs_layout_slab, GFP_KERNEL);
> + if (lp == NULL)
> + return lp;
> +
> + kref_init(&lp->lo_ref);
> + INIT_LIST_HEAD(&lp->lo_perfile);
> + INIT_LIST_HEAD(&lp->lo_perclnt);
> + INIT_LIST_HEAD(&lp->lo_recall_lru);
> +
> + fp = find_file(ino);
> + if (!fp) {
> + fp = alloc_init_file(ino);
> + if (fp == NULL)
> + goto out_free;
> + } else
> + get_nfs4_file(fp);
> +
> + lp->lo_file = fp;
> + lp->lo_time = 0;
> + memset(lo_cb_layout, 0, sizeof(struct nfs4_cb_layout));
> + lp->lo_client = clp;
> + lp->lo_sb = ino->i_sb;
> + lp->lo_ident = cb->cb_ident;
> + lp->lo_fhlen = current_fh->fh_handle.fh_size;
> + memcpy(lp->lo_fhval, ¤t_fh->fh_handle.fh_base,
> + current_fh->fh_handle.fh_size);
> + lp->lo_layout_type = lg->lg_type;
> + lp->lo_iomode = lg->lg_iomode;
> + lp->lo_offset = lg->lg_offset;
> + lp->lo_length = lg->lg_length;
> + num_layouts++;
> + return lp;
> +
> +out_free:
> + put_nfs4_layout(lp);
> + return ERR_PTR(-ENOMEM);
> +}
> +
> +static void
> +pnfs_hash_layoutget(struct nfs4_layout *lp)
> +{
> + list_add(&lp->lo_perfile, &fp->fi_layouts);
> + list_add(&lp->lo_perclnt, &clp->cl_layouts);
> +}
> +
> +static void
> +pnfs_hash_cb_layout(struct nfs4_layout *lp)
> +{
> + list_add(&lp->lo_recall_lru, &layout_recall_lru)
> +}
> +/*
> * get_state() and cb_get_state() are
> */
> static void
> diff -puN include/linux/nfsd/state.h~pnfsd-layout-mgnt
include/linux/nfsd/state.h
> --- pnfs-old/include/linux/nfsd/state.h~pnfsd-layout-mgnt
2006-09-26 12:10:12.000000000 -0400
> +++ pnfs-old-andros/include/linux/nfsd/state.h 2006-09-26
12:23:29.000000000 -0400
> @@ -123,6 +123,7 @@ struct nfs4_client {
> struct list_head cl_strhash; /* hash by cl_name */
> struct list_head cl_openowners;
> struct list_head cl_delegations;
> + struct list_head cl_layouts;
> struct list_head cl_lru; /* tail queue */
> struct xdr_netobj cl_name; /* id generated by client */
> char cl_recdir[HEXDIR_LEN]; /* recovery dir */
> @@ -158,6 +159,26 @@ struct nfs4_cb_layout {
> u32 cbl_fhval[NFS4_FHSIZE];
> };
>
> +struct nfs4_layout {
> + struct kref lo_ref;
> + struct list_head lo_perfile; /* hash by f_id */
> + struct list_head lo_perclnt; /* hash by clientid */
> + struct list_head lo_recall_lru; /* when in recall */
> + struct nfs4_file *lo_file; /* backpointer */
> + time_t lo_time; /* time recall started */
> + struct nfs4_cb_layout lo_cb_layout;
> +};
> +
> +#define lo_clienti lo_cb_layout.cbl_client
> +#define lo_sb lo_cb_layout.cbl_so
> +#define lo_ident lo_cb_layout.cbl_ident
> +#define lo_fhlen lo_cb_layout.cbl_fhlen
> +#define lo_fhval lo_cb_layout.cbl_fhval
> +#define lo_layout_type lo_cb_layout.cbl_layout_type
> +#define lo_iomode lo_cb_layout.cbl_iomode
> +#define lo_offset lo_cb_layout.cbl_offset
> +#define lo_length lo_cb_layout.cbl_length
> +
> /* struct nfs4_client_reset
> * one per old client. Populates reset_str_hashtbl. Filled from
conf_id_hashtbl
> * upon lease reset, or from upcall to state_daemon (to read in state
> @@ -245,6 +266,7 @@ struct nfs4_file {
> struct list_head fi_hash; /* hash by "struct inode *" */
> struct list_head fi_stateids;
> struct list_head fi_delegations;
> + struct list_head fi_layouts;
> struct inode *fi_inode;
> u32 fi_id; /* used with stateowner->so_id
> * for stateid_hashtbl hash */
> _
>
William A.(Andy) Adamson wrote:
> marc, benny, and i met at the bakeathon to design the layout recall/return
> managemen architecture for the pnfsd used by the GPFS and Panasas pnfs file
> system.
> here is the beginning framework which i'm passing onto benny and marc. i'm
> posting this to the list because it will eventually be added to the CVS tree,
> and for general interest.
>
> -->Andy
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
More information about the pNFS
mailing list