Commit Graph

26 Commits

Author SHA1 Message Date
Brian Behlendorf
9878a89d7a Add explicit MAXNAMELEN check
It turns out that the Linux VFS doesn't strictly handle all cases
where a component path name exceeds MAXNAMELEN.  It does however
appear to correctly handle MAXPATHLEN for us.

The right way to handle this appears to be to add an explicit
check to the zpl_lookup() function.  Several in-tree filesystems
handle this case the same way.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1279
2013-02-12 10:27:39 -08:00
Brian Behlendorf
09a661e960 Fix zpl_revalidate() NULL deref
In zpl_revalidate() it's possible for the nameidata to be NULL
for kernels which still accept the parameter.  In particular,
lookup_one_len() calls d_revalidate() with a NULL nameidata.

Resolve the issue by checking for a NULL nameidata in which case
just set the flags to 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1226
2013-01-22 09:38:17 -08:00
Brian Behlendorf
ee93035378 Use sb->s_d_op default dentry operations
As of Linux 2.6.37 the right way to register custom dentry
operations is to use the super block's ->s_d_op field.
For older kernels they should be registered as part of the
lookup operation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1223
2013-01-18 15:04:23 -08:00
Brian Behlendorf
7b3e34ba5a Fix 'zfs rollback' on mounted file systems
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #795
2013-01-17 09:51:20 -08:00
Brian Behlendorf
e89260a1c8 Directory xattr znodes hold a reference on their parent
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent.  Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it.  Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.

This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.

The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX.  Otherwise we
run the risk of deadlocking on the open create TX.

Ideally the security xattr should be fully constructed
before the new inode is unlocked.  However, doing so would
require far more extensive changes to ZFS.

This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #671
2012-12-03 12:10:46 -08:00
Yuxuan Shui
558ef6d080 Linux 3.6 compat, iops->create()
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create.  Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 14:42:25 -07:00
Yuxuan Shui
8f195a908f Linux 3.6 compat, iops->lookup()
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup.  Instead
only the inamedata->flags are passed.

ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
2012-10-14 13:06:54 -07:00
Richard Yao
ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao
0a6b03d3b8 Fix build failures on PaX/GRSecurity patched kernels
Gentoo Hardened kernels include the PaX/GRSecurity patches. They use a
dialect of C that relies on a GCC plugin. In particular, struct
file_operations has been marked do_const in the PaX/GRSecurity dialect,
which causes GCC to consider all instances of it as const. This caused
failures in the autotools checks and the ZFS source code.

To address this, we modify the autotools checks to take into account
differences between the PaX C dialect and the regular C dialect. We also
modify struct zfs_acl's z_ops member to be a pointer to a function
pointer table. Lastly, we modify zpl_put_link() to address a PaX change
to the function prototype of nd_get_link().  This avoids compiler errors
in the PaX/GRSecurity dialect.

Note that the change in zpl_put_link() causes a warning that becomes a
build failure when debugging is enabled. Fixing that warning requires
ryao/spl@5ca50ef459.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #484
2012-07-17 09:22:43 -07:00
Brian Behlendorf
b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Etienne Dechamps
cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps
5cb63a57f8 Implement the truncate_range() inode operation.
This operation allows "hole punching" in ZFS files. On Solaris this
is done via the vop_space() system call, which maps to the zfs_space()
function. So we just need to write zpl_truncate_range() as a wrapper
around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission
checking, although this isn't that big of a deal since truncate_range()
isn't even callable from userspace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:32 -08:00
Brian Behlendorf
f31b3ebe6e Allow xattrs on symlinks
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call.  Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.

The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes.  This was done simply to be consistent with the Solaris
behavior.

Upon futher reflection I believe this should be allowed under
Linux.  The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system.  This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #272
2011-11-29 10:24:24 -08:00
Brian Behlendorf
6f2255ba8a Set mtime on symbolic links
Register the setattr/getattr callbacks for symlinks.  Without these
the generic inode_setattr() and generic_fillattr() functions will
be used.  In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.

The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #412
2011-10-18 15:49:31 -07:00
Brian Behlendorf
9fd91daeef Honor setgit bit on directories
Newly created files were always being created with the fsuid/fsgid
in the current users credentials.  This is correct except in the
case when the parent directory sets the 'setgit' bit.  In this
case according to posix the newly created file/directory should
inherit the gid of the parent directory.  Additionally, in the
case of a subdirectory it should also inherit the 'setgit' bit.

Finally, this commit performs a little cleanup of the vattr_t
initialization by moving it to a common helper function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #262
2011-07-20 14:07:13 -07:00
Brian Behlendorf
057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Ned A. Bass
aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf
c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf
81e97e2187 Linux 2.6.29 compat, credentials
As of Linux 2.6.29 a clean credential API was added to the Linux kernel.
Previously the credential was embedded in the task_struct.  Because the
SPL already has considerable support for handling this API change the
ZPL code has been updated to use the Solaris credential API.
2011-03-22 12:15:54 -07:00
Brian Behlendorf
53cf50e081 Set stat->st_dev and statfs->f_fsid
Filesystems like ZFS must use what the kernel calls an anonymous super
block.  Basically, this is just a filesystem which is not backed by a
single block device.  Normally this block device's dev_t is stored in
the super block.  For anonymous super blocks a unique reserved dev_t
is assigned as part of get_sb().

This sb->s_dev must then be set in the returned stat structures as
stat->st_dev.  This allows userspace utilities to easily detect the
boundries of a specific filesystem.  Tools such as 'du' depend on this
for proper accounting.

Additionally, under OpenSolaris the statfs->f_fsid is set to the device
id.  To preserve consistency with OpenSolaris we also set the fsid to
the device id.  Other Linux filesystem (ext) set the fsid to a unique
value determined by the filesystems uuid.  This value is unique but
maintains no relationship to the device id.  This may be desirable
when exporting NFS filesystem because it minimizes to chance of a
client observing the same fsid from two different servers.

Closes #140
2011-03-07 16:06:22 -08:00
Brian Behlendorf
5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00
Brian Behlendorf
5095000169 Use -zfs_readlink() error
The zfs_readlink() function returns a Solaris positive error value
and that needs to be converted to a Linux negative error value.
While in this case nothing would actually go wrong, it's still
incorrect and should be fixed if for no other reason than clarity.
2011-02-17 09:48:06 -08:00
Brian Behlendorf
8b4f9a2d55 Fix readlink(2)
This patch addresses three issues related to symlinks.

1) Revert the zfs_follow_link() function to a modified version
of the original zfs_readlink().  The only changes from the
original OpenSolaris version relate to using Linux types.
For the moment this means no vnode's and no zfsvfs_t.  The
caller zpl_follow_link() was also updated accordingly.  This
change was reverted because it was slightly gratuitious.

2) Update zpl_follow_link() to use local variables for the
link buffer.  I'd forgotten that iov.iov_base is updated by
uiomove() so after the call to zfs_readlink() it can not longer
be used.  We need our own private copy of the link pointer.

3) Allocate MAXPATHLEN instead of MAXPATHLEN+1.  By default
MAXPATHLEN is 4096 bytes which is a full page, adding one to
it pushes it slightly over a page.  That means you'll likely
end up allocating 2 pages which is wasteful of memory and
possibly slightly slower.
2011-02-16 15:54:55 -08:00
Brian Behlendorf
a6695d83b7 Add get/setattr, get/setxattr hooks
While the attr/xattr hooks were already in place for regular
files this hooks can also apply to directories and special files.
While they aren't typically used in this way, it should be
supported.  This patch registers these additional callbacks
for both directory and special inode types.
2011-02-16 09:55:53 -08:00
Brian Behlendorf
ee154f01bf Add Hooks for Linux Inode Operations
The Linux specific inode operations have all been located in the
file zpl_inode.c.  These functions primarily rely on the reworked
zfs_* functions to do their job.  They are also responsible for
converting the possible Solaris style error codes to negative
Linux errors.
2011-02-10 09:27:21 -08:00