For portability reasons it's handy to be able to create a root
znode and basic filesystem components without requiring the full
cooperation of the VFS. We are committing to this to simply the
filesystem creations code.
This code is used for snapshot and heavily leverages Solaris
functionality we do not want to reimplement. These files have
been removed, including references to them, and will be replaced
by a zfs_snap.c/zpl_snap.c implementation which handles snapshots.
Minor update to ensure zfs_sync() is disabled if a kernel oops/panic
is triggered. As the comment says 'data integrity is job one'. This
change could have been done by defining panicstr to oops_in_progress
in the SPL. But I felt it was better to use the native Linux API
here since to be clear.
This flag does not need to be support under Linux. As the comment
says it was only there to support fsflush() for old filesystem like
UFS. This is not needed under Linux.
Mount option parsing is still very Linux specific and will be
handled above this zfs filesystem layer. Honoring those mount
options once set if of course the responsibility of the lower
layers.
This variable was used to ensure that the ZFS module is never
removed while the filesystem is mounted. Once again the generic
Linux VFS handles this case for us so it can be removed.
The functions zfs_mount_label_policy(), zfs_mountroot(), zfs_mount()
will not be needed because most of what they do is already handled
by the generic Linux VFS layer. They all call zfs_domount() which
creates the actual dataset, the caller of this library call which
will be in the zpl layer is responsible for what's left.
Under Linux we don't need to reserve a major or minor number for
the filesystem. We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.
Additionally, there is no need to keep a zfsfstype around. We are
not limited on Linux by the OpenSolaris infrastructure which needed
this. The upper zpl layer can specify the filesystem type.
The ZFS code is being restructured to act as a library and a stand
alone module. This allows us to leverage most of the existing code
with minimal modification. It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
Basic compilation of the bulk of zfs_znode.c has been enabled. After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces. The following commits
will systematically replace update the requiter interfaces. There
are of course pros and cons to this decision.
Pros:
* This simplifies intergration with Linux in the long term. There is
no longer any need to manage vnodes which are a foreign concept to
the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.
Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
Lay the initial ground work for a include/linux/ compatibility
directory. This was less critical in the past because the bulk
of the ZFS code consumes the Solaris API via the SPL. This API
was stable and the bulk Linux API differences were handled in
the SPL.
However, with the addition of a full Posix layer written directly
against the Linux APIs we are going to need more compatibility
code. It makes sense that all this code should be cleanly located
in one place. Subsequent patches should move the existing zvol
and vdev_disk compatibility code in to this directory.
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common(). Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file. This way I don't
need to reimplement this functionality from scratch in the SPL.
Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed. If
that turns out to be the case they can then be removed.
Remove unneeded bootfs functions. This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
Certain NFS/SMB share functionality is not yet in place. These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled. I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL. So I'm just
renaming the wrapper here to HAVE_SHARE. They still won't be
compiled until all the share issues are worked through. Share
support is the last missing piece from zfs_ioctl.c.
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode. So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches. Until proven that this is a problem it's
just added complexity we don't need.
Roll the version forward to 0.6.0. While no major changes
really warrant this I want to keep the version in step with
ZFS for now which is the only SPL consumer.
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios. However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
ZFS even under Solaris does not strictly require libshare to be
available. The current implementation attempts to dlopen() the
library to access the needed symbols. If this fails libshare
support is simply disabled.
This means that on Linux we only need the most minimal libshare
implementation. In fact just enough to prevent the build from
failing. Longer term we can decide if we want to implement a
libshare library like Solaris. At best this would be an abstraction
layer between ZFS and NFS/SMB. Alternately, we can drop libshare
entirely and directly integrate ZFS with Linux's NFS/SMB.
Finally the bare bones user-libshare.m4 test was dropped. If we
do decide to implement libshare at some point it will surely be
as part of this package so the check is not needed.
By design the zfs utility is supposed to handle mounting and unmounting
a zfs filesystem. We could allow zfs to do this directly. There are
system calls available to mount/umount a filesystem. And there are
library calls available to manipulate /etc/mtab. But there are a
couple very good reasons not to take this appraoch... for now.
Instead of directly calling the system and library calls to (u)mount
the filesystem we fork and exec a (u)mount process. The principle
reason for this is to delegate the responsibility for locking and
updating /etc/mtab to (u)mount(8). This ensures maximum portability
and ensures the right locking scheme for your version of (u)mount
will be used. If we didn't do this we would have to resort to an
autoconf test to determine what locking mechanism is used.
The downside to using mount(8) instead of mount(2) is that we lose
the exact errno which was returned by the kernel. The return code
from mount(8) provides some insight in to what went wrong but it
not quite as good. For the moment this is translated as a best
guess in to a errno for the higher layers of zfs.
In the long term a shared library called libmount is under development
which provides a common API to address the locking and errno issues.
Once the standard mount utility has been updated to use this library
we can then leverage it. Until then this is the only safe solution.
http://www.kernel.org/pub/linux/utils/util-linux/libmount-docs/index.html
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Recently helper functions were added to libzfs_util to load a kernel
module or execute a process. Initially this functionality was limited
to libzfs but it has become clear there will be other consumers. This
change opens up the interface so it may be used where appropriate.
For the moment, the only advantage in registering a umount helper
would be to automatically unshare a zfs filesystem. Since under
Linux this would be unexpected (but nice) behavior there is no
harm in disabling it.
This is desirable because the 'zfs unmount' path invokes the system
umount. This is done to ensure correct mtab locking but has the
side effect that the umount.zfs helper would be called if it exists.
By default this helper calls back in to zfs to do the unmount on
Solaris which we don't want under Linux.
Once libmount is available and we have a safe way to correctly
lock and update the /etc/mtab file we can reconsider the need
for a umount helper. Using libmount is the prefered solution.
While not strictly required to mount a zfs filesystem using a
mount helper has certain advantages.
First, we need it if we want to honor the mount behavior as found
on Solaris. As part of the mount we need to validate that the
dataset has the legacy mount property set if we are using 'mount'
instead of 'zfs mount'.
Secondly, by using a mount helper we can automatically load the
zpl kernel module. This way you can just issue a 'mount' or
'zfs mount' and it will just work.
Finally, it gives us common hook in user space to add any zfs
specific mount options we might want. At the moment we don't
have any but now the infrastructure is at least in place.
If libselinux is detected on your system at configure time link
against it. This allows us to use a library call to detect if
selinux is enabled and if it is to pass the mount option:
"context=\"system_u:object_r:file_t:s0"
For now this is required because none of the existing selinux
policies are aware of the zfs filesystem type. Because of this
they do not properly enable xattr based labeling even though
zfs supports all of the required hooks.
Until distro's add zfs as a known xattr friendly fs type we
must use mntpoint labeling. Alternately, end users could modify
their existing selinux policy with a little guidance.
Simply add the policy function wrappers. They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
Dependent packages require the following missing headers to
simplify compilation. The headers are basically just stubbed
out with minimal content required.
The following flags are use to get the proper mask when getting
and setting ACLs. I'm hopeful this can all largely go away at
some point.
We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
For Linux the maximum uid can vary depending on how your kernel
is built. The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?). Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers. It also makes some minor fid_t
additions for compatibility.
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.