[pnfs] linux server layout management frame work patch

Benny Halevy bhalevy at panasas.com
Wed Oct 25 10:22:04 EDT 2006


Andy, I like the basic design,
I have some comments and questions inline below...

Benny

 >
 >
 > pnfsd and the underlying pnfs file system need to cooperate in layout 
managemet
 > in much the same way that nfsd and cluster file systems cooperate in
 > delegation management.
 >
 > note that since one of the pnfs fs implentations is a cluster file 
system,
 > this code should support multiple MDS.
 >
 > the underlying pnfs file system will hand out layouts, and will 
inform pnfsd
 > when a layout needs to be recalled.
 >
 > pnfsd will bookeep layouts, and manage the recall process.
 >
 > this patch provides a frame work for the pnfsd management of layouts, 
and
 > proposes export_operation interfaces for layout management 
communication with
 > the underlying pnfs file system.
 >
 > it is to be determined exactly what layout management functionality 
will be
 > given to pnfsd vrs the underlying pnfs file system.
 >
 > -----------------------
 > pnfsd layout bookeeping
 > -----------------------
 >
 > 1) a list of all layouts handed out by a client
 >     struct nfs4_client->cl_layouts
 > which will be used to recall layouts when a clients state is reaped.
 >
 > 2) a list of all layouts on a file
 >     struct nfs4_file->fi_layouts
 > which will be used to handle recalls. a recall can have a different byte
 > range than the given layout, this needs to be remembered.
 > the underlying pnfs file system will determine when a recall is to occur.
 > see the pnfs file system interface section.
 > TODO: this list should be ordered by increasing offset for searching.
 >
 > 3) an stand alone lru list of layouts in recall
 >     layout_recall_lru
 > modeled after the delegation lru, this list holds layouts in recall, 
waiting
 > for clients to complete the layout return(s) that will complete the 
recall.

why do we need lru order here?

 >
 > 4) struct nfs4_layout
 >     adds a reference count, the above lists, an nfs4_file back pointer
 > used to get file information given a layout, and a time that a recall 
started.
 > it then includes the existing struct nfs4_cb_layout which holds layout
 > parameters.
 >
 > 5) struct nfs4_layout slab based allocation and free routines as well as
 > reference count put and get routines.
 >
 > 6) the pnfs_hash_cb_layout routine adds a layout to the layout_recall_lru
 > what is not written is the management of recall layout byte ranges.
 >
 > ---------------------------------------------
 > pnfs file system layout management interface
 > ---------------------------------------------
 >
 > these will be entries in the superblocks export_operations interface.
 >
 > existing interfaces:
 > 1) int (*layout_get) (struct inode *inode, void *buf);
 > called by pnfsd upon receiving a layoutget operation from a client.
 > pnfs file system returns a layout in the void buf parameter.
 >
 > 2) int (*layout_return) (struct inode *inode, void *p);
 > called by pnfsd upon receiving a layout return operation from a client.
 >
 > 3) callback from pnfs filesystem on MDS
 > int (*cb_layout_recall) (struct super_block *sb, struct inode *inode, 
void *p);
 > this identifies the layout to be recalled. this interface needs to be
 > expanded - there is no byte range or iomode information.

right, and also layoutrecall_type (i.e. FILE/FSID/ALL)

 >
 >
 > behavior and issues
 > --------------------
 >
 > pnfsd receives a LAYOUTGET operation from a client.
 > 1) calls export_operations->layoutget.
 >     - pnfs fs checks for layout conflicts
 >     - pnfs fs checks for overlap with in-progress recall.
 >       pnfsd does not do this - another MDS could have a overlapping 
recall
 >       in progress.
 > 2) if layout returned, pnfsd allocatesa an nfs4_layout struct and adds
 > it to the nfs4_file and nfs4_client hashes.
 >     - locking on the hash lists - shared between pnfsd and pnfs fs???

yup, as I suggested in the bakeathon pnfsd might want to create
a placeholder for the new layout in advance, insert it in the hash
(marked as in-progress) and pass a pointer to the placeholder to the
fs, while the hash is locked. Typically, the fs will succeed and the
is can fill-in the new layout, otherwise, pnfsd can clean up after
the fs returns with an error.

 >
 > pnfsd receives a CB_LAYOUT callback from the pnfs
 >
 > 1) pnfsd looks up layout using iomode,  offset and range in
 > nfs4_file->fi_layouts.
 >     - if exact match, remove layout

exact is too narrow, should grab everything in the recalled scope
<layout_type, layoutrecall_type, iomode, byte range>
Also, I'd consider keeping the layout on the fi_layouts list until
its returned and just hang it also on the layout_recall_lru (with
a pointer to the layoutrecall control structure for bookkeeping)

 >     - if byte range subset, leave non-recalled portion on list.
 >     - if no overlap - return error.

I think we should let the client decide that and return an error to 
CB_LAYOUTRECALL
in case the server is confused.

 > 2) place the cb_layout on the layout_recall_lru.
 > 3) bump the layout referencve and  call the CB_LAYOUT operation to the
 > client(s) that hold an overlapping layout.
 >     - some pnfs fs care about iomode, others don't - issue???
 >
 > pnfsd receives a LAYOUTRETURN operation from a client.
 >
 > 1) pnfsd checks the layout_recall_lru for an exactly matching layout. 
if found,
 > this signals a sucessful completion of a layout callback.
 >     - pnfsd needs to let pnfs fs know about successful callback- how???

we need to track the recall "meta-operation" and call fs asynchronous
callback function when the client calls layoutreturn with the exact scope
<layout_type, layoutrecall_type, iomode, byte range>. The layout_recall_lru
(again, why lru? :) can hang from this control structure.

 > 2) pnfsd calls the export_operations->layoutreturn
 >     - does pnfsd need to pay attention to layoutreturns that match
 >     a layout on the layout_recall_lru in all aspects except matching
 >     byte-range, or just let the pnfs fs handle partial callback 
returns???

since we (pnfs-obj) care about partial match, the fs needs an opportunity
to check that.

 >
 > pnfsd expires an nfs4_client due to lease expiration or admin 
intervention
 > 1) pnfsd needs to reap state, so for layouts, should call the
 > export_operations->layoutreturn for all layouts on the 
nfs4_client->cl_layout
 > list.
 >     - does the pnfs fs need a flag to indicate layout reap?
 >
 > ---
 >
 >
 > diff -puN fs/nfsd/nfs4state.c~pnfsd-layout-mgnt fs/nfsd/nfs4state.c
 > --- pnfs-old/fs/nfsd/nfs4state.c~pnfsd-layout-mgnt    2006-09-26 
12:10:12.000000000 -0400
 > +++ pnfs-old-andros/fs/nfsd/nfs4state.c    2006-09-26 
12:28:35.000000000 -0400
 > @@ -440,6 +440,7 @@ create_client(struct xdr_netobj name, ch
 >      INIT_LIST_HEAD(&clp->cl_strhash);
 >      INIT_LIST_HEAD(&clp->cl_openowners);
 >      INIT_LIST_HEAD(&clp->cl_delegations);
 > +    INIT_LIST_HEAD(&clp->cl_layouts);
 >      INIT_LIST_HEAD(&clp->cl_lru);
 >  out:
 >      return clp;
 > @@ -1004,6 +1005,7 @@ alloc_init_file(struct inode *ino)
 >          INIT_LIST_HEAD(&fp->fi_hash);
 >          INIT_LIST_HEAD(&fp->fi_stateids);
 >          INIT_LIST_HEAD(&fp->fi_delegations);
 > +        INIT_LIST_HEAD(&fp->fi_layouts);
 >          list_add(&fp->fi_hash, &file_hashtbl[hashval]);
 >          fp->fi_inode = igrab(ino);
 >          fp->fi_id = current_fileid++;
 > @@ -1031,6 +1033,9 @@ nfsd4_free_slabs(void)
 >      nfsd4_free_slab(&file_slab);
 >      nfsd4_free_slab(&stateid_slab);
 >      nfsd4_free_slab(&deleg_slab);
 > +#if defined(CONFIG_PNFS)
 > +    nfsd4_free_slab(&pnfs_layout_slab);
 > +#endif /* CONFIG_PNFS */
 >  }
 > 
 >  static int
 > @@ -1052,6 +1057,13 @@ nfsd4_init_slabs(void)
 >              sizeof(struct nfs4_delegation), 0, 0, NULL, NULL);
 >      if (deleg_slab == NULL)
 >          goto out_nomem;
 > +#if defined(CONFIG_PNFS)
 > +    pnfs_layout_slab = kmem_cache_create("pnfs layouts",
 > +            sizeof(struct nfs4_layout), 0, 0, NULL, NULL);
 > +    if (pnfs_layout_slab == NULL)
 > +        goto out_nomem;
 > +#endif /* CONFIG_PNFS */
 > +
 >      return 0;
 >  out_nomem:
 >      nfsd4_free_slabs();
 > @@ -3348,6 +3360,102 @@ nfs4_reset_lease(time_t leasetime)
 >  #if defined(CONFIG_PNFS)
 > 
 >  /*
 > + * Layout state - NFSv4.1 pNFS
 > + */
 > +static struct list_head layout_recall_lru;
 > +static int num_layouts;
 > +
 > +static kmem_cache_t *pnfs_layout_slab = NULL;
 > +
 > +/* Called under the state lock. */
 > +static void
 > +pnfs_unhash_layout(struct nfs4_layout *lp)
 > +{
 > +    list_del_init(&lp->lo_perfile);
 > +
 > +
 > +static void
 > +free_nfs4_layout(struct kref *kref)
 > +{
 > +    struct nfs4_layout *lp;
 > +
 > +    lp = container_of(kref, struct nfs4_layout, lo_ref);
 > +    pnfs_unhash_layout(lp);
 > +    kmem_cache_free(pnfs_layout_slab, lp);
 > +}
 > +
 > +static inline void
 > +put_nfs4_layout(struct nfs4_layout *lp)
 > +{
 > +    kref_put(&lp->lo_ref, free_nfs4_layout);
 > +}
 > +
 > +static inline void
 > +get_nfs4_layout(struct nfs4_layout *lp)
 > +{
 > +    kref_get(&lp->lo_ref);
 > +}
 > +
 > +static struct nfs4_layout *
 > +pnfs_alloc_init_layout(struct nfs4_client *clp, struct svc_fh 
*current_fh, struct nfsd4_pnfs_layoutget *lg)
 > +{
 > +    struct nfs4_layout *lp;
 > +    struct nfs4_file *fp = NULL;
 > +    struct nfs4_callback *cb = &clp->cl_callback;
 > +    struct inode *ino = current_fh->fh_dentry->d_inode;
 > +
 > +    dprintk("NFSD alloc_init_layout\n");
 > +    lp = kmem_cache_alloc(pnfs_layout_slab, GFP_KERNEL);
 > +    if (lp == NULL)
 > +        return lp;
 > +
 > +    kref_init(&lp->lo_ref);
 > +    INIT_LIST_HEAD(&lp->lo_perfile);
 > +    INIT_LIST_HEAD(&lp->lo_perclnt);
 > +    INIT_LIST_HEAD(&lp->lo_recall_lru);
 > +
 > +    fp = find_file(ino);
 > +    if (!fp) {
 > +        fp = alloc_init_file(ino);
 > +        if (fp == NULL)
 > +            goto out_free;
 > +    } else
 > +        get_nfs4_file(fp);
 > +
 > +    lp->lo_file = fp;
 > +    lp->lo_time = 0;
 > +    memset(lo_cb_layout, 0, sizeof(struct nfs4_cb_layout));
 > +    lp->lo_client = clp;
 > +    lp->lo_sb = ino->i_sb;
 > +    lp->lo_ident = cb->cb_ident;
 > +    lp->lo_fhlen = current_fh->fh_handle.fh_size;
 > +    memcpy(lp->lo_fhval, &current_fh->fh_handle.fh_base,
 > +                current_fh->fh_handle.fh_size);
 > +    lp->lo_layout_type = lg->lg_type;
 > +    lp->lo_iomode = lg->lg_iomode;
 > +    lp->lo_offset = lg->lg_offset;
 > +    lp->lo_length = lg->lg_length;
 > +    num_layouts++;
 > +    return lp;
 > +
 > +out_free:
 > +    put_nfs4_layout(lp);
 > +    return ERR_PTR(-ENOMEM);
 > +}
 > +
 > +static void
 > +pnfs_hash_layoutget(struct nfs4_layout *lp)
 > +{
 > +    list_add(&lp->lo_perfile, &fp->fi_layouts);
 > +    list_add(&lp->lo_perclnt, &clp->cl_layouts);
 > +}
 > +
 > +static void
 > +pnfs_hash_cb_layout(struct nfs4_layout *lp)
 > +{
 > +    list_add(&lp->lo_recall_lru, &layout_recall_lru)
 > +}
 > +/*
 >   * get_state() and cb_get_state() are
 >   */
 >  static void
 > diff -puN include/linux/nfsd/state.h~pnfsd-layout-mgnt 
include/linux/nfsd/state.h
 > --- pnfs-old/include/linux/nfsd/state.h~pnfsd-layout-mgnt    
2006-09-26 12:10:12.000000000 -0400
 > +++ pnfs-old-andros/include/linux/nfsd/state.h    2006-09-26 
12:23:29.000000000 -0400
 > @@ -123,6 +123,7 @@ struct nfs4_client {
 >      struct list_head    cl_strhash;     /* hash by cl_name */
 >      struct list_head    cl_openowners;
 >      struct list_head    cl_delegations;
 > +    struct list_head    cl_layouts;
 >      struct list_head        cl_lru;         /* tail queue */
 >      struct xdr_netobj    cl_name;     /* id generated by client */
 >      char                    cl_recdir[HEXDIR_LEN]; /* recovery dir */
 > @@ -158,6 +159,26 @@ struct nfs4_cb_layout {
 >      u32            cbl_fhval[NFS4_FHSIZE];
 >  };
 > 
 > +struct nfs4_layout {
 > +    struct kref        lo_ref;
 > +    struct list_head    lo_perfile; /* hash by f_id */
 > +    struct list_head    lo_perclnt; /* hash by clientid */
 > +    struct list_head    lo_recall_lru; /* when in recall */
 > +    struct nfs4_file        *lo_file; /* backpointer */
 > +    time_t                  lo_time; /* time recall started */
 > +    struct nfs4_cb_layout    lo_cb_layout;
 > +};
 > +
 > +#define lo_clienti      lo_cb_layout.cbl_client
 > +#define lo_sb           lo_cb_layout.cbl_so
 > +#define lo_ident        lo_cb_layout.cbl_ident
 > +#define lo_fhlen        lo_cb_layout.cbl_fhlen
 > +#define lo_fhval        lo_cb_layout.cbl_fhval
 > +#define lo_layout_type  lo_cb_layout.cbl_layout_type
 > +#define lo_iomode       lo_cb_layout.cbl_iomode
 > +#define lo_offset       lo_cb_layout.cbl_offset
 > +#define lo_length       lo_cb_layout.cbl_length
 > +
 >  /* struct nfs4_client_reset
 >   * one per old client. Populates reset_str_hashtbl. Filled from 
conf_id_hashtbl
 >   * upon lease reset, or from upcall to state_daemon (to read in state
 > @@ -245,6 +266,7 @@ struct nfs4_file {
 >      struct list_head        fi_hash;    /* hash by "struct inode *" */
 >      struct list_head        fi_stateids;
 >      struct list_head    fi_delegations;
 > +    struct list_head    fi_layouts;
 >      struct inode        *fi_inode;
 >      u32                     fi_id;      /* used with stateowner->so_id
 >                           * for stateid_hashtbl hash */
 > _
 >


William A.(Andy) Adamson wrote:
> marc, benny, and i met at the bakeathon to design the layout recall/return 
> managemen architecture for the pnfsd used by the GPFS and Panasas pnfs file 
> system.
> here is the beginning framework which i'm passing onto benny and marc. i'm 
> posting this to the list because it will eventually be added to the CVS tree, 
> and for general interest.
>
> -->Andy
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs



More information about the pNFS mailing list