[pnfs] [mfasheh at suse.com: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2]

Glasgow_Jason at emc.com Glasgow_Jason at emc.com
Fri Jun 27 16:57:04 EDT 2008


Yes, I agree with Benny that this looks particularly interesting for
block based user-space pNFS servers.

-Jason


> -----Original Message-----
> From: pnfs-bounces at linux-nfs.org [mailto:pnfs-bounces at linux-nfs.org]
On
> Behalf Of Benny Halevy
> Sent: Friday, June 27, 2008 3:31 AM
> To: J. Bruce Fields
> Cc: pnfs at linux-nfs.org; Pete Wyckoff
> Subject: Re: [pnfs] [mfasheh at suse.com: [PATCH 0/4] Fiemap, an extent
> mapping ioctl - round 2]
> 
> On Jun. 26, 2008, 21:13 +0300, "J. Bruce Fields"
<bfields at fieldses.org>
> wrote:
> > Has anyone been paying attention to this?  (Would it be any use to a
> > pnfs server implementation?)
> 
> Sounds interesting.
> With additional synchronization primitives that would allow a pnfs
> back end to be notified when the extent map changes I think it could
> be useful for block-based back-ends.
> 
> For files and objects, one might imagine using it to pass the layout
> from the file system to the (s)pnfs server by describing it as the
> extent map.
> 
> Benny
> 
> P.S. It's also instrumental for implementing the OSD READ MAP command
> for usermode resident OSD targets (and well, that's a tangential
> topic.  unrelated to pnfs...)
> 
> 
> >
> > http://marc.info/?t=121443243900005&r=1&w=2
> >
> > --b.
> >
> > ----- Forwarded message from Mark Fasheh <mfasheh at suse.com> -----
> >
> > Date: Wed, 25 Jun 2008 15:18:35 -0700
> > From: Mark Fasheh <mfasheh at suse.com>
> > To: linux-fsdevel at vger.kernel.org
> > Cc: Andreas Dilger <adilger at shaw.ca>, Kalpak Shah
<Kalpak.Shah at sun.com>,
> > 	Eric Sandeen <sandeen at redhat.com>, Josef Bacik
<jbacik at redhat.com>
> > Subject: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2
> >
> > Hello,
> >
> > 	The following patches are the latest attempt at implementing a
> > fiemap ioctl, which can be used by userspace software to get extent
> > information for an inode in an efficient manner.
> >
> > 	These patches are against 2.6.26-rc3, though they probably apply
> > fine against Linus' latest tree. The fs patches are much more
complete
> this
> > time around, and the vfs patch has been trimmed down.
> >
> > 	An updated version of my ioctl wrapper test program is available
at:
> >
> >
http://www.kernel.org/pub/linux/kernel/people/mfasheh/fiemap/tests/
> >
> > 	A couple of notes regarding the VFS patch:
> >
> > 	Firstly, most behavior-changing fm_flags have been removed.
We're
> > left with SYNC and XATTR now. This is a very good thing because
frankly,
> I
> > think fiemap should be targeted as a straight-forward and relatively
> > uncomplicated API for exposing extents as they appear on disk. Think
> "one
> > notch above extent-based FIBMAP replacement". There's a flip side to
> this -
> > 'complicated' file systems should be free to implement their own
> > complementary ioctls where there is a unique need that FIEMAP does
not
> > address. Things like non-trivial device mappings, encryption
specifics
> > (beyond 'this extent is encrypted'), don't belong here.
> >
> > 	The passing back of unknown flags still exists - it was part of
the
> > original patches I took over. I'm not quite sure I'm happy about
this
> > feature. On one hand, it's nice to cleanly manage minor api
revisions,
> but
> > on the other hand I wonder if we're actually going to use this
> > functionality. The good news is that any app which simply doesn't
care
> about
> > micro-managing the set of passed flags can ignore the whole thing
and
> simply
> > just key on the error code.
> >
> >
> > Full changelog from my last posting:
> >
> > * Removed FIEMAP_FLAG_HSM_READ and FIEMAP_FLAG_LUN_ORDER.
> >
> > * FIEMAP_FLAG_NUM_EXTENTS also no longer exists. Instead, users
simply
> pass
> >   an fm_extent_count of zero. This now functions much like getxattr.
> >
> > * Updated doc describing the interface changes.
> >
> > * Added fiemap_check_flags() which checks against the current set of
> >   *understood* flags. File systems have been updated to use this.
> >
> > * Added key for use with block based file systems,
FIEMAP_EXTENT_MERGED.
> >
> > * Whether FIEMAP_EXTENT_SECONDARY sets FIEMAP_EXTENT_UNKNOWN is now
up
> to
> >   the file system.
> >
> > * Added myself to authors line in fiemap.h.
> >
> > * Ext4 and generic block based implementations are both updated.
Thanks
> to
> >   Eric and Josef for doing this.
> >
> >
> > Below this I will include the contents of fiemap.txt to make it
> convenient
> > for folks to get details on the API.
> >
> > In the meantime, let the circus begin...
> > 	--Mark
> >
> >
> > ============
> > Fiemap Ioctl
> > ============
> >
> > The fiemap ioctl is an efficient method for userspace to get file
> > extent mappings. Instead of block-by-block mapping (such as bmap),
> fiemap
> > returns a list of extents.
> >
> >
> > Request Basics
> > --------------
> >
> > A fiemap request is encoded within struct fiemap:
> >
> > struct fiemap {
> > 	__u64	fm_start;	 /* logical offset (inclusive) at
> > 				  * which to start mapping (in) */
> > 	__u64	fm_length;	 /* logical length of mapping which
> > 				  * userspace cares about (in) */
> > 	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request
(in/out) */
> > 	__u32	fm_mapped_extents; /* number of extents that were
> > 				    * mapped (out) */
> > 	__u32	fm_extent_count; /* size of fm_extents array (in) */
> > 	__u32	fm_reserved;
> > 	struct fiemap_extent fm_extents[0]; /* array of mapped extents
(out)
> */
> > };
> >
> >
> > fm_start, and fm_length specify the logical range within the file
> > which the process would like mappings for. Extents returned mirror
> > those on disk - that is, the logical offset of the 1st returned
extent
> > may start before fm_start, and the range covered by the last
returned
> > extent may end after fm_length. All offsets and lengths are in
bytes.
> >
> > Certain flags to modify the way in which mappings are looked up can
be
> > set in fm_flags. If the kernel doesn't understand some particular
> > flags, it will return EBADR and the contents of fm_flags will
contain
> > the set of flags which caused the error. If the kernel is compatible
> > with all flags passed, the contents of fm_flags will be unmodified.
> > It is up to userspace to determine whether rejection of a particular
> > flag is fatal to it's operation. This scheme is intended to allow
the
> > fiemap interface to grow in the future but without losing
> > compatibility with old software.
> >
> > fm_extent_count specifies the number of elements in the fm_extents[]
> array
> > that can be used to return extents.  If fm_extent_count is zero,
then
> the
> > fm_extents[] array is ignored (no extents will be returned), and the
> > fm_mapped_extents count will hold the number of extents needed in
> > fm_extents[] to hold the file's current mapping.  Note that there is
> > nothing to prevent the file from changing between calls to FIEMAP.
> >
> > Currently, there are three flags which can be set in fm_flags:
> >
> > * FIEMAP_FLAG_SYNC
> > If this flag is set, the kernel will sync the file before mapping
> extents.
> >
> > * FIEMAP_FLAG_XATTR
> > If this flag is set, the extents returned will describe the inodes
> > extended attribute lookup tree, instead of it's data tree.
> >
> >
> > Extent Mapping
> > --------------
> >
> > Extent information is returned within the embedded fm_extents array
> > which userspace must allocate along with the fiemap structure. The
> > number of elements in the fiemap_extents[] array should be passed
via
> > fm_extent_count. The number of extents mapped by kernel will be
> > returned via fm_mapped_extents. If the number of fiemap_extents
> > allocated is less than would be required to map the requested range,
> > the maximum number of extents that can be mapped in the fm_extent[]
> > array will be returned and fm_mapped_extents will be equal to
> > fm_extent_count. In that case, the last extent in the array will not
> > complete the requested range and will not have the
FIEMAP_EXTENT_LAST
> > flag set (see the next section on extent flags).
> >
> > Each extent is described by a single fiemap_extent structure as
> > returned in fm_extents.
> >
> > struct fiemap_extent {
> > 	__u64	fe_logical;  /* logical offset in bytes for the start of
> > 			      * the extent */
> > 	__u64	fe_physical; /* physical offset in bytes for the start
> > 			      * of the extent */
> > 	__u64	fe_length;   /* length in bytes for the extent */
> > 	__u32	fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
> > 	__u32	fe_device;   /* device number for extent */
> > };
> >
> > All offsets and lengths are in bytes and mirror those on disk.  It
is
> valid
> > for an extents logical offset to start before the request or it's
> logical
> > length to extend past the request.  Unless FIEMAP_EXTENT_NOT_ALIGNED
is
> > returned, fe_logical, fe_physical, and fe_length will be aligned to
the
> > block size of the file system.  With the exception of extents
flagged as
> > FIEMAP_EXTENT_MERGED, adjacent extents will not be merged.
> >
> > The fe_flags field contains flags which describe the extent
returned.
> > A special flag, FIEMAP_EXTENT_LAST is always set on the last extent
in
> > the file so that the process making fiemap calls can determine when
no
> > more extents are available, without having to call the ioctl again.
> >
> > Some flags are intentionally vague and will always be set in the
> > presence of other more specific flags. This way a program looking
for
> > a general property does not have to know all existing and future
flags
> > which imply that property.
> >
> > For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
> > are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program
looking
> > for inline or tail-packed data can key on the specific flag.
Software
> > which simply cares not to try operating on non-aligned extents
> > however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
> > worry about all present and future flags which might imply unaligned
> > data. Note that the opposite is not true - it would be valid for
> > FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
> >
> > * FIEMAP_EXTENT_LAST
> > This is the last extent in the file. A mapping attempt past this
> > extent will return nothing.
> >
> > * FIEMAP_EXTENT_UNKNOWN
> > The location of this extent is currently unknown. This may indicate
> > the data is stored on an inaccessible volume or that no storage has
> > been allocated for the file yet.
> >
> > * FIEMAP_EXTENT_DELALLOC
> >   - This will also set FIEMAP_EXTENT_UNKNOWN.
> > Delayed allocation - while there is data for this extent, it's
> > physical location has not been allocated yet.
> >
> > * FIEMAP_EXTENT_NO_DIRECT
> > Direct access to the data in this extent is illegal or will have
> > undefined results.
> >
> > * FIEMAP_EXTENT_SECONDARY
> > The data for this extent is in secondary storage (e.g. HSM).  If the
> > data is not also in the filesystem, then FIEMAP_EXTENT_NO_DIRECT
> > should also be set.
> >
> > * FIEMAP_EXTENT_NET
> >   - This will also set FIEMAP_EXTENT_NO_DIRECT
> > The data for this extent is not stored in a locally-accessible
device.
> >
> > * FIEMAP_EXTENT_DATA_COMPRESSED
> >   - This will also set FIEMAP_EXTENT_NO_DIRECT
> > The data in this extent has been compressed by the file system.
> >
> > * FIEMAP_EXTENT_DATA_ENCRYPTED
> >   - This will also set FIEMAP_EXTENT_NO_DIRECT
> > The data in this extent has been encrypted by the file system.
> >
> > * FIEMAP_EXTENT_NOT_ALIGNED
> > Extent offsets and length are not guaranteed to be block aligned.
> >
> > * FIEMAP_EXTENT_DATA_INLINE
> >   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> > Data is located within a meta data block.
> >
> > * FIEMAP_EXTENT_DATA_TAIL
> >   This will also set FIEMAP_EXTENT_NOT_ALIGNED
> > Data is packed into a block with data from other files.
> >
> > * FIEMAP_EXTENT_UNWRITTEN
> > Unwritten extent - the extent is allocated but it's data has not
been
> > initialized.  This indicates the extent's data will be all zero.
> >
> > * FIEMAP_EXTENT_MERGED
> > This will be set when a file does not support extents, i.e., it uses
a
> block
> > based addressing scheme.  Since returning an extent for each block
back
> to
> > userspace would be highly inefficient, the kernel will try to merge
most
> > adjacent blocks into 'extents'.
> >
> >
> > VFS -> File System Implementation
> > ---------------------------------
> >
> > File systems wishing to support fiemap must implement a ->fiemap
> callback on
> > their inode_operations structure. The fs ->fiemap call is
responsible
> for
> > defining it's set of supported fiemap flags, and calling a helper
> function on
> > each discovered extent:
> >
> > struct inode_operations {
> >        ...
> >
> >        int (*fiemap)(struct inode *, struct fiemap_extent_info *,
u64
> start,
> >                      u64 len);
> >
> > ->fiemap is passed struct fiemap_extent_info which describes the
> > fiemap request:
> >
> > struct fiemap_extent_info {
> > 	unsigned int fi_flags;		/* Flags as passed from user */
> > 	unsigned int fi_extents_mapped;	/* Number of mapped extents */
> > 	unsigned int fi_extents_max;	/* Size of fiemap_extent array
*/
> > 	struct fiemap_extent *fi_extents_start;	/* Start of
fiemap_extent
> array */
> > };
> >
> > It is intended that the file system should not need to access any of
> this
> > structure directly.
> >
> >
> > Flag checking should be done at the beginning of the ->fiemap
callback
> via the
> > fiemap_check_flags() helper:
> >
> > int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32
> fs_flags);
> >
> > The struct fieinfo should be passed in as recieved from
ioctl_fiemap().
> The
> > set of fiemap flags which the fs understands should be passed via
> fs_flags. If
> > fiemap_check_flags finds invalid user flags, it will place the bad
> values in
> > fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR,
> from
> > fiemap_check_flags(), it should immediately exit, returning that
error
> back to
> > ioctl_fiemap().
> >
> >
> > For each extent in the request range, the file system should call
> > the helper function, fiemap_fill_next_extent():
> >
> > int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64
> logical,
> > 			    u64 phys, u64 len, u32 flags, u32 dev);
> >
> > fiemap_fill_next_extent() will use the passed values to populate the
> > next free extent in the fm_extents array. 'General' extent flags
will
> > automatically be set from specific flags on behalf of the calling
file
> > system so that the userspace API is not broken.
> >
> > fiemap_fill_next_extent() returns 0 on success, and 1 when the
> > user-supplied fm_extents array is full. If an error is encountered
> > while copying the extent to user memory, -EFAULT will be returned.
> > --
> > To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel"
> in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > ----- End forwarded message -----
> > _______________________________________________
> > pNFS mailing list
> > pNFS at linux-nfs.org
> > http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs
> 
> _______________________________________________
> pNFS mailing list
> pNFS at linux-nfs.org
> http://linux-nfs.org/cgi-bin/mailman/listinfo/pnfs



More information about the pNFS mailing list