[pnfs] nfsd layout cache design for layoutrecall
Benny Halevy
bhalevy at panasas.com
Fri May 4 03:04:54 EDT 2007
J. Bruce Fields wrote:
> Random remarks:
>
> On Thu, May 03, 2007 at 11:22:59PM +0300, Benny Halevy wrote:
>>> * changed cl_count in nfs4_client into a kref.
>>> nfs4_{get,put}_client take and release a reference on the
>>> client structure under the state lock.
>>>
>>> * layoutrecall manipulation is also done under the state lock,
>>> not under the recall lock. Special care taken when sending
>>> the recalls to avoid distributed deadlock - client sending
>>> layoutreturns while serving a cb_layoutrecall, described
>>> in more details below...
>
> We've never been totally happy with the way everything in nfsd4 is done
> under the big state lock. Not that it's up to you to fix that, but it's
> just something to keep in mind....
My impression is that the state_lock mutex is an overkill in many cases
that you'd be perfectly covered by a lightweight spinlock.
In the layout cache case there's no sense in using a separate
recall lock since you want both the client and file layouts
and layoutrecalls lists locked at the same time.
With a reference count on the client and file we may want to
consider taking the state_lock and when creating and destroying
nfs4_{client,file} as well hashing them in the globally, while
protecting pnfs state with a separate spinlock.
However, this will require taking twice as many locks in the path.
I'm worried less about contention when not blocking under the state_lock
but rather about the bus locking operations overhead.
>
>>> * struct nfs4_file:
>>> added fsid and filehandle to the structure.
>>> these are accessed in the get, return, and recall paths
>>> through a pointer to the nfs4_file structure
>>> rather than copying the data into each nfs4_layout and nfs4_layoutrecall
>>> instance.
>>>
>>> +#ifdef CONFIG_PNFS
>>> + /* used by layoutget / layoutrecall */
>>> + struct nfs4_fsid fi_fsid;
>>> + u32 fi_fhlen;
>>> + u32 fi_fhval[NFS4_FHSIZE];
>
> NFS4_FHSIZE is in bytes, isn't it? (Oops, I think we made that mistake
> in the v4.0 code too.)
Yeah, copy/paste :-/
would you rather define it as u8 fh_fhval[NFS4_FHSIZE] or
u32 fh_fhval[NFS4_FHSIZE/sizeof(u32)]? (same for cbr_fhval)
Thanks for catching this as I realized I didn't initialize these
fields when allocating nfs4_file.
+find_alloc_file(struct inode *ino)
+find_alloc_file(struct inode *ino, struct svc_fh *current_fh)
-alloc_init_file(struct inode *ino)
+alloc_init_file(struct inode *ino, struct svc_fh *current_fh)
@@ -1397,6 +1416,14 @@ alloc_init_file(struct inode *ino)
+#ifdef CONFIG_PNFS
+ fp->fi_fsid.major = current_fh->fh_export->ex_fsid;
+ fp->fi_fsid.minor = 0;
+ fp->fi_fhlen = current_fh->fh_handle.fh_size;
+ BUG_ON(fp->fi_fhlen > sizeof(fp->fi_fhval));
+ memcpy(fp->fi_fhval, ¤t_fh->fh_handle.fh_base,
+ fp->fi_fhlen);
+#endif /* CONFIG_PNFS */
>
>>> * nfs4_layoutrecall
>>> tracks outstanding layoutrecalls
>>> listed only per client
>>> but keeps a reference both on the respective client and file
>>> (if applicable, can be NULL in the RECALL_ALL case)
>>> clr_flags field used for marking the layoutrecall done and deciding
>>> who's freeing it.
>
> Is there a chance it would be possible to use reference counting for that?
Sure, refcounting does seem simpler. BTW, it's not strictly needed, as
the flags are manipulated under the state_lock.
More information about the pNFS
mailing list