[RFC,PATCH 0/4] Dynamic Pseudo Root

Brent Callaghan brentc at apple.com
Mon Dec 10 18:14:22 EST 2007


Hi Neil,

Forgive my lack of familiarity with the Linux implementation, but
I can add some further info on the Solaris in-kernel pseudo filesystem.

When a Solaris filesystem is exported, the kernel does a little walk  
up the
file tree to the root, to make sure that all the filesystems along the
path are exported.  For any that are not, a special "transit only"
export is done.  It's a real export, but it has a bit set to indicate  
that
it's special.

The "transit only" exports each have a list of visible inodes.  So a  
LOOKUP
will return an inode only if it's in the visible list.  Same with  
READDIR.
Typically, the visible list isn't very big - probably not much longer  
than
the number of filesystems exported.  If you add or delete exports, you
may need to add/delete inodes in the visible list.

So I don't think you really need to mess with the dentry structure.

Also, the list is just inode numbers.  Rename has no effect.

I think you could get this to work with the "on demand" exports
in Linux, you just need to make sure that you do the file tree
walk to check for reachability and add any "transit" exports if  
needed, or
adjust their visible inode lists.

I can't offer any useful suggestions as to scaling up to hundreds
or thousands of ZFS exports.  Though I think the numbers need to
get pretty big before current exports schemes are challenged.
Note that for ZFS you also need to worry about snapshots/clones.
Each of these is a new filesystem that must be exported separately
and each home directory may have a good number of snapshots.
Hopefully, you're not tempted to advocate "nohide" for these :-)

	Brent


On Dec 10, 2007, at 2:34 PM, Neil Brown wrote:

>
> Being the one who originally pushed for this problem to be left to
> user space, maybe I should provide some background to my thinking.
>
>
> Suppose we were to take the Solaris approach where non-exported
> filesystems are mostly invisible to NFS, but just those parts needed
> to walk the exported filesystems are visible.
>
> LOOKUP/READDIR would need to be able to inspect each dentry that was
> returned to place it in one of three classes:
>   - below a mountpoint in this filesystem
>   - above a mountpoint in this filesystem
>   - neither of the above.
>
> In the first case, it is fully visible.  In the second it is partially
> visible (You would probably set the owner/mode to root/755 or
> similar).  In the third it is invisible.
>
> Detecting the first case is fairly simple as it is largely inherited
> from the parent with the cases where it isn't, it is simply resolved
> from a lookup in the export table.
>
> Differentiating between the second and third is a little more awkward.
> The most straight forward approach would be to add a flag to the
> dentry.  When you export a filesystem you could set this flag on all
> dentries on the path leading to the export point.  Then simply test it
> to decide between cases 2 and 3.
>
> I didn't pursue this option for a number of reasons:
>
> - adding a dentry flag is not something to be done lightly, and nfsd
>   is in many ways a niche user of the dcache.  I expected a bit of
>   push-back and didn't think I would be able to push coherently for
>   it, partly due to the next point.
> - A directory rename above the export point would lead to some of
>   these flags being wrong, so we would need a mechanism to repair
>   this.  Any such mechanism would probably be racy.  Creating an
>   in-kernel interface that is not reliable and predictable is not a
>   good idea.
> - Every export point would need to be given to the kernel before it
>   could make this three-way decision accurately and my focus at the
>   time was to just keep a cache of useful information in the kernel
>   rather than all information.  Possibly the focus on just-caching
>   was excessive -  I'm not sure.  So this point is offered as an
>   explanation rather than a justification.
>
> The alternate approach (which I think someone did suggest) was to
> build a parallel data structure which described just that part of the
> filesystem that we wanted to export and have LOOKUP/READDIR check each
> dentry against this datastructure.
> My thought was that if we were to do this, then why not use existing
> technology and build that data structure in a tmpfs filesystem (if you
> want something that looks like a filesystem - then it is often best to
> make it actually be a filesystem).
>
> Given that design direction it was obvious that the filesystem could
> be built from userspace, so that is the direction I recommended.
>
>
> On the topic of handling thousands of exports, I'm not completely
> sure I understand the details of the ZFS situation, but I assume that
> each home directory appears like a separate filesystem with it's own
> mountpoint so you get thousands of mountpoints and hence thousands of
> export points.
>
> I see a real possibility for an upcall storm when someone does
>   ls -l /home
> or similar.
>
> There seem to be two interesting cases.
> In the first case all the mountpoints have the same export options
> (same hosts, same flags, same security options).  In that case we
> could arrange for the kernel to auto-inherit the export options,
> though I would rather avoid that if possible.
>
> In the second case, each home is exported with different homes to
> different hosts. e.g. my home is only exported to my host, yours only
> to you.
> This would require sending lots of detailed information to the kernel
> to be able to handle that "ls -l /home" and seems like a pain no
> matter how you approach it.
>
> It reminds me a lot of the auto-mounter's dilemma.  When someone does
> "ls -l" they probably only want to know the names and whether they are
> directories or not - information that the automounter can return
> without actually mounting anything.  But you don't actually know what
> is wanted, so you have to return everything - or fake it and risk the
> repercussions.
> Our case is similar.  Do I really need full export options just to
> return a GETATTR of the top level directory?
>
> It might be nice to allow a GETATTR of the top level directory of any
> filesystem reached by crossing a 'crossmnt' to be answered using the
> same options as the parent, but to require full option checking for
> any other operation on the ROOT.... However I don't know if NFSv4
> allows us to return NFSERR_WRONGSEC for some operations but not
> others...
>
>
> But maybe I've misunderstood the ZFS case -- requiring each
> home directory to be a separate mountpoint does seem a little  
> harsh.  I
> see lots of value in keeping home directories conceptually very
> separate (no hard links, independent available space) but that
> shouldn't require separate mountpoints or separate export points.
>
>
> NeilBrown
>
> _______________________________________________
> NFSv4 mailing list
> NFSv4 at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4



More information about the NFSv4 mailing list