Under Linux we don't need to reserve a major or minor number for
the filesystem. We can rely on the VFS to handle colisions without
this being handled by the lower ZFS layers.
Additionally, there is no need to keep a zfsfstype around. We are
not limited on Linux by the OpenSolaris infrastructure which needed
this. The upper zpl layer can specify the filesystem type.
The ZFS code is being restructured to act as a library and a stand
alone module. This allows us to leverage most of the existing code
with minimal modification. It also means we need to drop the Solaris
vfs/vnode functions they will be replaced by Linux equivilants and
updated to be Linux friendly.
For the moment we have left ZFS unchanged and it updates many values
as part of the znode. However, some of these values should be set
in the inode. For the moment this is handled by adding a function
called zfs_inode_update() which updates the inode based on the znode.
This is considered a workaround until we can systematically go
through the ZFS code and have it directly update the inode. At
which point zfs_update_inode() can be dropped entirely. Keeping
two copies of the same data isn't only inefficient it's a breeding
ground for bugs.
Under Linux the convention for filesystem specific data structure is
to embed it along with the generic vfs data structure. This differs
significantly from Solaris.
Since we want to integrates as cleanly with the Linux VFS as possible.
This changes modifies zfs_znode_alloc() to allocate a znode with an
embedded inode for use with the generic VFS. This is done by calling
iget_locked() which will allocate a new inode if needed by calling
sb->alloc_inode(). This function allocates enough memory for a
znode_t by returns a pointer to the inode structure for Linux's VFS.
This function is also responsible for setting the callback
znode->z_set_ops_inodes() which is used to register the correct
handlers for the inode.
Basic compilation of the bulk of zfs_znode.c has been enabled. After
much consideration it was decided to convert the existing vnode based
interfaces to more friendly Linux interfaces. The following commits
will systematically replace update the requiter interfaces. There
are of course pros and cons to this decision.
Pros:
* This simplifies intergration with Linux in the long term. There is
no longer any need to manage vnodes which are a foreign concept to
the Linux VFS.
* Improved long term maintainability.
* Minor performance improvements by removing vnode overhead.
Cons:
* Added work in the short term to modify multiple ZFS interfaces.
* Harder to pull in changes if we ever see any new code from Solaris.
* Mixed Solaris and Linux interfaces in some ZFS code.
This code originates in OpenSolaris and was modified by KQ Infotech
to be compatible with Linux. While supporting uios in the short
term is useful to get something working this is not an abstraction
we want to keep. This code is expected to be short lived and
removed as soon as all the remaining uio based APIs and updated.
The zfs acl code makes use of the two OpenSolaris helper functions
acl_trivial_access_masks() and ace_trivial_common(). Since they are
only called from zfs_acl.c I've brought them over from OpenSolaris
and added them as static function to this file. This way I don't
need to reimplement this functionality from scratch in the SPL.
Long term once I take a more careful look at the acl implementation
it may be the case that these functions really aren't needed. If
that turns out to be the case they can then be removed.
Remove unneeded bootfs functions. This support shouldn't be required
for the Linux port, and even if it is it would need to be reworked
to integrate cleanly with Linux.
Certain NFS/SMB share functionality is not yet in place. These
functions used to be wrapped with the generic HAVE_ZPL to prevent
them from being compiled. I still don't want them compiled but
I'm working toward eliminating the use of HAVE_ZPL. So I'm just
renaming the wrapper here to HAVE_SHARE. They still won't be
compiled until all the share issues are worked through. Share
support is the last missing piece from zfs_ioctl.c.
The zfs_check_global_label() function is part of the HAVE_MLSLABEL
support which was previously commented out by a HAVE_ZPL check.
Since we're still deciding what to do about mls labels wrap it
with the preexisting macro to keep it compiled out.
Unlike Solaris the Linux implementation embeds the inode in the
znode, and has no use for a vnode. So while it's true that fragmention
of the znode cache may occur it should not be worse than any of the
other Linux FS inode caches. Until proven that this is a problem it's
just added complexity we don't need.
These functions were dropped originally because I felt they would
need to be rewritten anyway to avoid using uios. However, this
patch readds then with they dea they can just be reworked and
the uio bits dropped.
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters. However, I've now seen several instances in
OpenSolaris code where they do the following:
cv_broadcast();
cv_destroy();
This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT. This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request. But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.
Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled. This may take some time because each waiter must
acquire the mutex.
This change may have an impact on performance for frequently
created and destroyed condition variables. That however is a price
worth paying it avoid crashing your system. If performance issues
are observed they can be addressed by the caller.
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking. This causes processes
to go in the D state which increases the load average. The load
average is the summation of processes in D state and run queue.
To avoid this it can be desirable to sleep interruptibly. These
processes do not count against the load average but may be woken by
a signal. It is up to the caller to determine why the process
was woken it may be for one of three reasons.
1) cv_signal()/cv_broadcast()
2) the timeout expired
3) a signal was received
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem. This avoids the need to have to ugly
up the code with the required #define's at every call site. At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
During a rename we need to be careful to destroy and create a
new minor for the ZVOL _only_ if the rename succeeded. The previous
code would both destroy you minor device unconditionally, it would
also fail to create the new minor device on success.
These compiler warnings were introduced when code which was
previously #ifdef'ed out by HAVE_ZPL was re-added for use
by the posix layer. All of the following changes should be
obviously correct and will cause no semantic changes.
The issue is that cv_timedwait() sleeps uninterruptibly to block signals
and avoid waking up early. Under Linux this counts against the load
average keeping it artificially high. This change allows the arc to
sleep interruptibly which mean it may be woken up early due to a signal.
Normally this means some extra care must be taken to handle a potential
signal. But for the arcs usage of cv_timedwait() there is no harm in
waking up before the timeout expires so no extra handling is required.
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test. This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.
The test will first create 32 keys via tsd_create() and register
a common destructor. Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread. Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.
The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys. They then immediately
exit. This is is designed to verify tsd_exit() which will be
called via thread_exit(). This must result in all registered
destructors being run and the memory for the tsd being free'd.
After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd. At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.
If this all goes off without a hitch the test passes. To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels. This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.
The majority of the entries in the hash table are for specific tsd
entries. These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.
The hash table contains two additional type of entries. They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create(). It is used to store the address of the destructor function
and it is used as an anchor point. All tsd entries which use the same
key will be linked to this entry. This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.
The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key. The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it. This
list is using during tsd_exit() to ensure all registered destructors
are run for the process. The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid. Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
For debugging purposes the condition varaibles keep track of the
mutex used during a wait. The idea is to validate that all callers
always use the same mutex. Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe. My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex. However, there is overhead in doing this and it does
appear to be allowed under Solaris.
To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped. This ensures that while the condition variable
is in use the incorrect mutex case is detected. It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.
Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant. The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex. It turns out that was not the case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit fixes a sign extension bug affecting l2arc devices. Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to
attempt to access beyond end of device
sdbi1: rw=14, want=36028797014862705, limit=125026959
The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater. To avoid this, we store
the offset in a uint64_t.
This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future. We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.
Closes#66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the ZFS code only contains support
back to 2.6.18 vintage kernels, I'm not adding an autoconf check
for this and simply moving everything to use fops->unlocked_ioctl().
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
Commit 3ee56c292b changed an ENOTSUP return value
in one location to ENOTSUPP to fix user programs seeing an invalid ioctl()
error code. However, use of ENOTSUP is widespread in the zfs module. Instead
of changing all of those uses, we fixed the ENOTSUP definition in the SPL to be
consistent with user space. The changed return value in the above commit is
therefore no longer needed, so this commit reverses it to maintain consistency.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed. The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12. Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t. If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't. So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used. We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.
module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
enumerated type ‘krw_type_t’
Support for rolling back datasets require a functional ZPL, which we currently
do not have. The zfs command does not check for ZPL support before attempting
a rollback, and in preparation for rolling back a zvol it removes the minor
node of the device. To prevent the zvol device node from disappearing after a
failed rollback operation, this change wraps the zfs_do_rollback() function in
an #ifdef HAVE_ZPL and returns ENOSYS in the absence of a ZPL. This is
consistent with the behavior of other ZPL dependent commands such as mount.
The orginal error message observed with this bug was rather confusing:
internal error: Unknown error 524
Aborted
This was because zfs_ioc_rollback() returns ENOTSUP if we don't HAVE_ZPL, but
Linux actually has no such error code. It should instead return EOPNOTSUPP, as
that is how ENOTSUP is defined in user space. With that we would have gotten
the somewhat more helpful message
cannot rollback 'tank/fish': unsupported version
This is rather a moot point with the above changes since we will no longer make
that ioctl call without a ZPL. But, this change updates the error code just in
case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Increasing the default zio_wr_int thread count from 8 to 16 improves
write performence by 13% on large systems. More testing need to be
done but I suspect the ideal tuning here is ZTI_BATCH() with a minimum
of 8 threads.
Linux kernel thread names are expected to be short. This change shortens
the zio thread names to 10 characters leaving a few chracters to append
the /<cpuid> to which the thread is bound. For example: z_wr_iss/0.
On some older kernels, i.e. 2.6.18, zvol_ioctl_by_inode() may get passed a NULL
file pointer if the user tries to mount a zvol without a filesystem on it.
This change adds checks to prevent a null pointer dereference.
Closes#73.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of linux-2.6.35 the shrinker callback API now takes an additional
argument. The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it. This removes the need to always use global state for the
shrinker.
To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API. Then we simply setup a callback
function with the correct number of arguments. For now we do not make
use of the new 3rd argument.
It turns out that 'zpool events' over 1024 bytes in size where being
silently dropped. This was discovered while writing the zfault.sh
tests to validate common failure modes.
This could occur because the zfs interface for passing an arbitrary
size nvlist_t over an ioctl() is to provide a buffer for the packed
nvlist which is usually big enough. In this case 1024 byte is the
default. If the kernel determines the buffer is to small it returns
ENOMEM and the minimum required size of the nvlist_t. This was
working properly but in the case of 'zpool events' the event stream
was advanced dispite the error. Thus the retry with the bigger
buffer would succeed but it would skip over the previous event.
The fix is to pass this size to zfs_zevent_next() and determine
before removing the event from the list if it will fit. This was
preferable to checking after the event was returned because this
avoids the need to rewind the stream.
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.