[pnfs] linux server layout management frame work patch
William A. (Andy) Adamson
andros at citi.umich.edu
Wed Oct 25 11:21:08 EDT 2006
hi benny
response in line below
-->Andy
On 10/25/06, Benny Halevy <bhalevy at panasas.com> wrote:
>
> Andy, I like the basic design,
> I have some comments and questions inline below...
>
> Benny
>
> >
> >
> > pnfsd and the underlying pnfs file system need to cooperate in layout
> managemet
> > in much the same way that nfsd and cluster file systems cooperate in
> > delegation management.
> >
> > note that since one of the pnfs fs implentations is a cluster file
> system,
> > this code should support multiple MDS.
> >
> > the underlying pnfs file system will hand out layouts, and will
> inform pnfsd
> > when a layout needs to be recalled.
> >
> > pnfsd will bookeep layouts, and manage the recall process.
> >
> > this patch provides a frame work for the pnfsd management of layouts,
> and
> > proposes export_operation interfaces for layout management
> communication with
> > the underlying pnfs file system.
> >
> > it is to be determined exactly what layout management functionality
> will be
> > given to pnfsd vrs the underlying pnfs file system.
> >
> > -----------------------
> > pnfsd layout bookeeping
> > -----------------------
> >
> > 1) a list of all layouts handed out by a client
> > struct nfs4_client->cl_layouts
> > which will be used to recall layouts when a clients state is reaped.
> >
> > 2) a list of all layouts on a file
> > struct nfs4_file->fi_layouts
> > which will be used to handle recalls. a recall can have a different byte
> > range than the given layout, this needs to be remembered.
> > the underlying pnfs file system will determine when a recall is to
> occur.
> > see the pnfs file system interface section.
> > TODO: this list should be ordered by increasing offset for searching.
> >
> > 3) an stand alone lru list of layouts in recall
> > layout_recall_lru
> > modeled after the delegation lru, this list holds layouts in recall,
> waiting
> > for clients to complete the layout return(s) that will complete the
> recall.
>
> why do we need lru order here?
the client has lease time to give the layout return that matches the
callback request. we remember the time the call back was issued - the lru
allows the server to walk the layout list checking the times looking for
layouts to reap due to lease expiration. once the server encounters a time
that is within the lease it can stop looking.
>
> > 4) struct nfs4_layout
> > adds a reference count, the above lists, an nfs4_file back pointer
> > used to get file information given a layout, and a time that a recall
> started.
> > it then includes the existing struct nfs4_cb_layout which holds layout
> > parameters.
> >
> > 5) struct nfs4_layout slab based allocation and free routines as well as
> > reference count put and get routines.
> >
> > 6) the pnfs_hash_cb_layout routine adds a layout to the
> layout_recall_lru
> > what is not written is the management of recall layout byte ranges.
> >
> > ---------------------------------------------
> > pnfs file system layout management interface
> > ---------------------------------------------
> >
> > these will be entries in the superblocks export_operations interface.
> >
> > existing interfaces:
> > 1) int (*layout_get) (struct inode *inode, void *buf);
> > called by pnfsd upon receiving a layoutget operation from a client.
> > pnfs file system returns a layout in the void buf parameter.
> >
> > 2) int (*layout_return) (struct inode *inode, void *p);
> > called by pnfsd upon receiving a layout return operation from a client.
> >
> > 3) callback from pnfs filesystem on MDS
> > int (*cb_layout_recall) (struct super_block *sb, struct inode *inode,
> void *p);
> > this identifies the layout to be recalled. this interface needs to be
> > expanded - there is no byte range or iomode information.
>
> right, and also layoutrecall_type (i.e. FILE/FSID/ALL)
>
> >
> >
> > behavior and issues
> > --------------------
> >
> > pnfsd receives a LAYOUTGET operation from a client.
> > 1) calls export_operations->layoutget.
> > - pnfs fs checks for layout conflicts
> > - pnfs fs checks for overlap with in-progress recall.
> > pnfsd does not do this - another MDS could have a overlapping
> recall
> > in progress.
> > 2) if layout returned, pnfsd allocatesa an nfs4_layout struct and adds
> > it to the nfs4_file and nfs4_client hashes.
> > - locking on the hash lists - shared between pnfsd and pnfs fs???
>
> yup, as I suggested in the bakeathon pnfsd might want to create
> a placeholder for the new layout in advance, insert it in the hash
> (marked as in-progress) and pass a pointer to the placeholder to the
> fs, while the hash is locked. Typically, the fs will succeed and the
> is can fill-in the new layout, otherwise, pnfsd can clean up after
> the fs returns with an error.
ok
>
> > pnfsd receives a CB_LAYOUT callback from the pnfs
> >
> > 1) pnfsd looks up layout using iomode, offset and range in
> > nfs4_file->fi_layouts.
> > - if exact match, remove layout
>
> exact is too narrow, should grab everything in the recalled scope
> <layout_type, layoutrecall_type, iomode, byte range>
you're right - the CB_LAYOUTRECALL is not complete until the full range
specified in the recall is returned in a LAYOUTRECALL. so, exact is too
narrow. th
A pNFS client must
always return layout segments that comprise the full range specified
by the recall. Note, the full recalled layout range need not be
returned as part of a single operation, but may be returned in
segments.
Also, I'd consider keeping the layout on the fi_layouts list until
> its returned and just hang it also on the layout_recall_lru (with
> a pointer to the layoutrecall control structure for bookkeeping)
why? the purpose of moving it off the fi_layouts is so that it is not in
the search for legal layouts.
> - if byte range subset, leave non-recalled portion on list.
> > - if no overlap - return error.
>
> I think we should let the client decide that and return an error to
> CB_LAYOUTRECALL
> in case the server is confused.
yeah, i don't know if pnfsd cares about portions of a recall being
returned - it
should probably just care about getting the one full range LAYOUTRETURN that
allows us to complete the recall
> 2) place the cb_layout on the layout_recall_lru.
> > 3) bump the layout referencve and call the CB_LAYOUT operation to the
> > client(s) that hold an overlapping layout.
> > - some pnfs fs care about iomode, others don't - issue???
> >
> > pnfsd receives a LAYOUTRETURN operation from a client.
> >
> > 1) pnfsd checks the layout_recall_lru for an exactly matching layout.
> if found,
> > this signals a sucessful completion of a layout callback.
> > - pnfsd needs to let pnfs fs know about successful callback- how???
>
> we need to track the recall "meta-operation" and call fs asynchronous
> callback function when the client calls layoutreturn with the exact scope
> <layout_type, layoutrecall_type, iomode, byte range>. The
> layout_recall_lru
> (again, why lru? :) can hang from this control structure.
>
> > 2) pnfsd calls the export_operations->layoutreturn
> > - does pnfsd need to pay attention to layoutreturns that match
> > a layout on the layout_recall_lru in all aspects except matching
> > byte-range, or just let the pnfs fs handle partial callback
> returns???
>
> since we (pnfs-obj) care about partial match, the fs needs an opportunity
> to check that.
ok - so pnfsd passes all layoutreturns to fs, and only bookeeps the
layoutreturn that allows pnfsd to complete the recall.
>
> > pnfsd expires an nfs4_client due to lease expiration or admin
> intervention
> > 1) pnfsd needs to reap state, so for layouts, should call the
> > export_operations->layoutreturn for all layouts on the
> nfs4_client->cl_layout
> > list.
> > - does the pnfs fs need a flag to indicate layout reap?
> >
> > ---
> >
> >
> > diff -puN fs/nfsd/nfs4state.c~pnfsd-layout-mgnt fs/nfsd/nfs4state.c
> > --- pnfs-old/fs/nfsd/nfs4state.c~pnfsd-layout-mgnt 2006-09-26
> 12:10:12.000000000 -0400
> > +++ pnfs-old-andros/fs/nfsd/nfs4state.c 2006-09-26
> 12:28:35.000000000 -0400
> > @@ -440,6 +440,7 @@ create_client(struct xdr_netobj name, ch
> > INIT_LIST_HEAD(&clp->cl_strhash);
> > INIT_LIST_HEAD(&clp->cl_openowners);
> > INIT_LIST_HEAD(&clp->cl_delegations);
> > + INIT_LIST_HEAD(&clp->cl_layouts);
> > INIT_LIST_HEAD(&clp->cl_lru);
> > out:
> > return clp;
> > @@ -1004,6 +1005,7 @@ alloc_init_file(struct inode *ino)
> > INIT_LIST_HEAD(&fp->fi_hash);
> > INIT_LIST_HEAD(&fp->fi_stateids);
> > INIT_LIST_HEAD(&fp->fi_delegations);
> > + INIT_LIST_HEAD(&fp->fi_layouts);
> > list_add(&fp->fi_hash, &file_hashtbl[hashval]);
> > fp->fi_inode = igrab(ino);
> > fp->fi_id = current_fileid++;
> > @@ -1031,6 +1033,9 @@ nfsd4_free_slabs(void)
> > nfsd4_free_slab(&file_slab);
> > nfsd4_free_slab(&stateid_slab);
> > nfsd4_free_slab(&deleg_slab);
> > +#if defined(CONFIG_PNFS)
> > + nfsd4_free_slab(&pnfs_layout_slab);
> > +#endif /* CONFIG_PNFS */
> > }
> >
> > static int
> > @@ -1052,6 +1057,13 @@ nfsd4_init_slabs(void)
> > sizeof(struct nfs4_delegation), 0, 0, NULL, NULL);
> > if (deleg_slab == NULL)
> > goto out_nomem;
> > +#if defined(CONFIG_PNFS)
> > + pnfs_layout_slab = kmem_cache_create("pnfs layouts",
> > + sizeof(struct nfs4_layout), 0, 0, NULL, NULL);
> > + if (pnfs_layout_slab == NULL)
> > + goto out_nomem;
> > +#endif /* CONFIG_PNFS */
> > +
> > return 0;
> > out_nomem:
> > nfsd4_free_slabs();
> > @@ -3348,6 +3360,102 @@ nfs4_reset_lease(time_t leasetime)
> > #if defined(CONFIG_PNFS)
> >
> > /*
> > + * Layout state - NFSv4.1 pNFS
> > + */
> > +static struct list_head layout_recall_lru;
> > +static int num_layouts;
> > +
> > +static kmem_cache_t *pnfs_layout_slab = NULL;
> > +
> > +/* Called under the state lock. */
> > +static void
> > +pnfs_unhash_layout(struct nfs4_layout *lp)
> > +{
> > + list_del_init(&lp->lo_perfile);
> > +
> > +
> > +static void
> > +free_nfs4_layout(struct kref *kref)
> > +{
> > + struct nfs4_layout *lp;
> > +
> > + lp = container_of(kref, struct nfs4_layout, lo_ref);
> > + pnfs_unhash_layout(lp);
> > + kmem_cache_free(pnfs_layout_slab, lp);
> > +}
> > +
> > +static inline void
> > +put_nfs4_layout(struct nfs4_layout *lp)
> > +{
> > + kref_put(&lp->lo_ref, free_nfs4_layout);
> > +}
> > +
> > +static inline void
> > +get_nfs4_layout(struct nfs4_layout *lp)
> > +{
> > + kref_get(&lp->lo_ref);
> > +}
> > +
> > +static struct nfs4_layout *
> > +pnfs_alloc_init_layout(struct nfs4_client *clp, struct svc_fh
> *current_fh, struct nfsd4_pnfs_layoutget *lg)
> > +{
> > + struct nfs4_layout *lp;
> > + struct nfs4_file *fp = NULL;
> > + struct nfs4_callback *cb = &clp->cl_callback;
> > + struct inode *ino = current_fh->fh_dentry->d_inode;
> > +
> > + dprintk("NFSD alloc_init_layout\n");
> > + lp = kmem_cache_alloc(pnfs_layout_slab, GFP_KERNEL);
> > + if (lp == NULL)
> > + return lp;
> > +
> > + kref_init(&lp->lo_ref);
> > + INIT_LIST_HEAD(&lp->lo_perfile);
> > + INIT_LIST_HEAD(&lp->lo_perclnt);
> > + INIT_LIST_HEAD(&lp->lo_recall_lru);
> > +
> > + fp = find_file(ino);
> > + if (!fp) {
> > + fp = alloc_init_file(ino);
> > + if (fp == NULL)
> > + goto out_free;
> > + } else
> > + get_nfs4_file(fp);
> > +
> > + lp->lo_file = fp;
> > + lp->lo_time = 0;
> > + memset(lo_cb_layout, 0, sizeof(struct nfs4_cb_layout));
> > + lp->lo_client = clp;
> > + lp->lo_sb = ino->i_sb;
> > + lp->lo_ident = cb->cb_ident;
> > + lp->lo_fhlen = current_fh->fh_handle.fh_size;
> > + memcpy(lp->lo_fhval, ¤t_fh->fh_handle.fh_base,
> > + current_fh->fh_handle.fh_size);
> > + lp->lo_layout_type = lg->lg_type;
> > + lp->lo_iomode = lg->lg_iomode;
> > + lp->lo_offset = lg->lg_offset;
> > + lp->lo_length = lg->lg_length;
> > + num_layouts++;
> > + return lp;
> > +
> > +out_free:
> > + put_nfs4_layout(lp);
> > + return ERR_PTR(-ENOMEM);
> > +}
> > +
> > +static void
> > +pnfs_hash_layoutget(struct nfs4_layout *lp)
> > +{
> > + list_add(&lp->lo_perfile, &fp->fi_layouts);
> > + list_add(&lp->lo_perclnt, &clp->cl_layouts);
> > +}
> > +
> > +static void
> > +pnfs_hash_cb_layout(struct nfs4_layout *lp)
> > +{
> > + list_add(&lp->lo_recall_lru, &layout_recall_lru)
> > +}
> > +/*
> > * get_state() and cb_get_state() are
> > */
> > static void
> > diff -puN include/linux/nfsd/state.h~pnfsd-layout-mgnt
> include/linux/nfsd/state.h
> > --- pnfs-old/include/linux/nfsd/state.h~pnfsd-layout-mgnt
> 2006-09-26 12:10:12.000000000 -0400
> > +++ pnfs-old-andros/include/linux/nfsd/state.h 2006-09-26
> 12:23:29.000000000 -0400
> > @@ -123,6 +123,7 @@ struct nfs4_client {
> > struct list_head cl_strhash; /* hash by cl_name */
> > struct list_head cl_openowners;
> > struct list_head cl_delegations;
> > + struct list_head cl_layouts;
> > struct list_head cl_lru; /* tail queue */
> > struct xdr_netobj cl_name; /* id generated by client */
> > char cl_recdir[HEXDIR_LEN]; /* recovery dir */
> > @@ -158,6 +159,26 @@ struct nfs4_cb_layout {
> > u32 cbl_fhval[NFS4_FHSIZE];
> > };
> >
> > +struct nfs4_layout {
> > + struct kref lo_ref;
> > + struct list_head lo_perfile; /* hash by f_id */
> > + struct list_head lo_perclnt; /* hash by clientid */
> > + struct list_head lo_recall_lru; /* when in recall */
> > + struct nfs4_file *lo_file; /* backpointer */
> > + time_t lo_time; /* time recall started */
> > + struct nfs4_cb_layout lo_cb_layout;
> > +};
> > +
> > +#define lo_clienti lo_cb_layout.cbl_client
> > +#define lo_sb lo_cb_layout.cbl_so
> > +#define lo_ident lo_cb_layout.cbl_ident
> > +#define lo_fhlen lo_cb_layout.cbl_fhlen
> > +#define lo_fhval lo_cb_layout.cbl_fhval
> > +#define lo_layout_type lo_cb_layout.cbl_layout_type
> > +#define lo_iomode lo_cb_layout.cbl_iomode
> > +#define lo_offset lo_cb_layout.cbl_offset
> > +#define lo_length lo_cb_layout.cbl_length
> > +
> > /* struct nfs4_client_reset
> > * one per old client. Populates reset_str_hashtbl. Filled from
> conf_id_hashtbl
> > * upon lease reset, or from upcall to state_daemon (to read in state
> > @@ -245,6 +266,7 @@ struct nfs4_file {
> > struct list_head fi_hash; /* hash by "struct inode *" */
> > struct list_head fi_stateids;
> > struct list_head fi_delegations;
> > + struct list_head fi_layouts;
> > struct inode *fi_inode;
> > u32 fi_id; /* used with stateowner->so_id
> > * for stateid_hashtbl hash */
> > _
> >
>
>
> William A.(Andy) Adamson wrote:
> > marc, benny, and i met at the bakeathon to design the layout
> recall/return
> > managemen architecture for the pnfsd used by the GPFS and Panasas pnfs
> file
> > system.
> > here is the beginning framework which i'm passing onto benny and marc.
> i'm
> > posting this to the list because it will eventually be added to the CVS
> tree,
> > and for general interest.
> >
> > -->Andy
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > pNFS mailing list
> > pNFS at linux-nfs.org
> > http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://linux-nfs.org/pipermail/pnfs/attachments/20061025/da6e76f6/attachment-0001.htm
More information about the pNFS
mailing list