[RFC,PATCH 0/4] Dynamic Pseudo Root
Neil Brown
neilb at suse.de
Mon Dec 10 17:34:05 EST 2007
Being the one who originally pushed for this problem to be left to
user space, maybe I should provide some background to my thinking.
Suppose we were to take the Solaris approach where non-exported
filesystems are mostly invisible to NFS, but just those parts needed
to walk the exported filesystems are visible.
LOOKUP/READDIR would need to be able to inspect each dentry that was
returned to place it in one of three classes:
- below a mountpoint in this filesystem
- above a mountpoint in this filesystem
- neither of the above.
In the first case, it is fully visible. In the second it is partially
visible (You would probably set the owner/mode to root/755 or
similar). In the third it is invisible.
Detecting the first case is fairly simple as it is largely inherited
from the parent with the cases where it isn't, it is simply resolved
from a lookup in the export table.
Differentiating between the second and third is a little more awkward.
The most straight forward approach would be to add a flag to the
dentry. When you export a filesystem you could set this flag on all
dentries on the path leading to the export point. Then simply test it
to decide between cases 2 and 3.
I didn't pursue this option for a number of reasons:
- adding a dentry flag is not something to be done lightly, and nfsd
is in many ways a niche user of the dcache. I expected a bit of
push-back and didn't think I would be able to push coherently for
it, partly due to the next point.
- A directory rename above the export point would lead to some of
these flags being wrong, so we would need a mechanism to repair
this. Any such mechanism would probably be racy. Creating an
in-kernel interface that is not reliable and predictable is not a
good idea.
- Every export point would need to be given to the kernel before it
could make this three-way decision accurately and my focus at the
time was to just keep a cache of useful information in the kernel
rather than all information. Possibly the focus on just-caching
was excessive - I'm not sure. So this point is offered as an
explanation rather than a justification.
The alternate approach (which I think someone did suggest) was to
build a parallel data structure which described just that part of the
filesystem that we wanted to export and have LOOKUP/READDIR check each
dentry against this datastructure.
My thought was that if we were to do this, then why not use existing
technology and build that data structure in a tmpfs filesystem (if you
want something that looks like a filesystem - then it is often best to
make it actually be a filesystem).
Given that design direction it was obvious that the filesystem could
be built from userspace, so that is the direction I recommended.
On the topic of handling thousands of exports, I'm not completely
sure I understand the details of the ZFS situation, but I assume that
each home directory appears like a separate filesystem with it's own
mountpoint so you get thousands of mountpoints and hence thousands of
export points.
I see a real possibility for an upcall storm when someone does
ls -l /home
or similar.
There seem to be two interesting cases.
In the first case all the mountpoints have the same export options
(same hosts, same flags, same security options). In that case we
could arrange for the kernel to auto-inherit the export options,
though I would rather avoid that if possible.
In the second case, each home is exported with different homes to
different hosts. e.g. my home is only exported to my host, yours only
to you.
This would require sending lots of detailed information to the kernel
to be able to handle that "ls -l /home" and seems like a pain no
matter how you approach it.
It reminds me a lot of the auto-mounter's dilemma. When someone does
"ls -l" they probably only want to know the names and whether they are
directories or not - information that the automounter can return
without actually mounting anything. But you don't actually know what
is wanted, so you have to return everything - or fake it and risk the
repercussions.
Our case is similar. Do I really need full export options just to
return a GETATTR of the top level directory?
It might be nice to allow a GETATTR of the top level directory of any
filesystem reached by crossing a 'crossmnt' to be answered using the
same options as the parent, but to require full option checking for
any other operation on the ROOT.... However I don't know if NFSv4
allows us to return NFSERR_WRONGSEC for some operations but not
others...
But maybe I've misunderstood the ZFS case -- requiring each
home directory to be a separate mountpoint does seem a little harsh. I
see lots of value in keeping home directories conceptually very
separate (no hard links, independent available space) but that
shouldn't require separate mountpoints or separate export points.
NeilBrown
More information about the NFSv4
mailing list