2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
This reverts commit d135245791.
Since feature flags have now been merged we can apply the real
upstream fix from Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #997
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure. Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().
Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.
To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
mountall in Debian depends on being able to pass the -f parameter to
mount, which specifies a fake mount and just updates the mtab. Currently
mount.zfs will fail such a request if it is not passed with -o zfsutil.
This patch allows a fake mount on a non-legacy filesystem to succeed in
the same manner as a -o remount does, thus enabling mountall to work
correctly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1167
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning. To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains
at most 3 block pointers this overhead should be negligible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#950
Currently ZFS doesn't show any I/O time in eg "top" wait% or in
/proc/$pid/stat's blkio_ticks. Using io_schedule() instead of
schedule() in zio_wait()'s cv_wait() is the correct way to fix
this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1158Closes#1175
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.
The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device. This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.
This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely on io_schedule() via cv_wait_io().
The additional debugging information which was added to the
delay event has been intentionally left in place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting. The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#206
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#612
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The AUTHORS file was getting stale. Refresh its contents
using the authors listed in the git commit logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The AUTHORS file was getting stale. Refresh its contents
using the authors listed in the git commit logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ChangeLog was retired long ago, the git commit logs are
authoritative. To avoid any confusion remove the ChangeLog.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ChangeLog was retired long ago, the git commit logs are
authoritative. To avoid any confusion remove the ChangeLog.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit(). This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.
However, under Linux I'm not seeing the same contention that
was originally described. Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD. If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.
This just leaves one small ZFS TSD consumer. If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#174
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute. To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/zfs#699Closeszfsonlinux/zfs#859
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1149
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1131
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=433946
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix setting/getting users/groups in quota properties through
numeric identifier. This support was accidentally disabled
in the original port by applying the HAVE_IDMAP wrapper macro
too broadly.
Fix obtained by moving #ifdef HAVE_IDMAP to exclude only
the part of code that really needs IDMAP. Now zfs (get|set)
(user|group)quota@1000 works as expected.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1147
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=433946
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764Closes#1140
Commit 2957f38 renamed 60-vdev.rules to 69-vdev.rules but failed
to update the .gitignore file to reflect this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers. While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.
I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel. But frankly
the compatibility code to use this kernel interface has always
been problematic.
Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1053
All consumers of the kernel delayed work queues have been shifted
over to rely on the taskq implementation. This compatibility code
can now be removed. Any new callers which need this functionality
should use the taskq interfaces for delayed work items.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.
This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code. All that
is required is a thread context to perform the allocation in. The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces. This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.
However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function. The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu. It then blocks
waiting for them to complete and promptly reclaims any free slabs.
When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable. It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context. In
practice there are no callers which ever do this. This makes sense
since a cache is by design a long lived data structure.
To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context. All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:front splat test over frame size
limit. To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.
In function 'splat_taskq_test6_impl'
error: the frame size of 1648 bytes is larger than 1024 bytes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:order splat test over frame size
limit. To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.
In function 'splat_taskq_test5_impl'
error: the frame size of 1680 bytes is larger than 1024 bytes
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a test case for taskq_cancel_id() to verify it is working
properly. Just like taskq:delay we start by dispatching 100
tasks. However this time 1/3 of the tasks use taskq_dispatch()
and will be run immediately, and 2/3 use taskq_dispatch_delay().
The idea is to create a busy taskq with both active, pending,
and delayed tasks.
After all the items have been successfully dispatched the test
begins randomly canceling known task ids. It will do this for
5 seconds randomly canceling a task id and then sleeping for a
few milliseconds. The task being canceled may have already run,
still be on the pending list, or may be currently being executed
by a worker thread. The idea is to ensure we catch any subtle
race conditions.
Once all the non-canceled tasks have completed we cross check
the number of tasks which ran with the number of tasks which
were successfully canceled. Additionally, we verify that the
taskq_cancel_id() function never blocks longer than needed.
This time is bounded by the longest run time of the task which
was dispatched.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add a test case for taskq_dispatch_delay() to verify it is working
properly. The test dispatchs 100 tasks to a taskq with random
expiration times spread over 5 seconds. As each task expires and
gets executed by a worker thread it verifies that it was run at
the correct time. Once all the delayed tasks have been executed
we double check that all the dispatched tasks were successful.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the ability to dispatch a delayed task to a taskq. The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached. To achieve this two
new functions were added.
* taskq_dispatch_delay() -
This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument. The caller should pass the
desired time the task should be executed as an absolute value in
jiffies. The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.
* taskq_cancel_id() -
Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task. There are three possible
return values.
0 - The task was found and canceled before it was executed.
ENOENT - The task was not found, either it was already run or an
invalid task id was supplied by the caller.
EBUSY - The task is currently executing any may not be canceled.
This function will block until the task has been completed.
* taskq_wait_all() -
The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior. It is only curreny used by
the splat taskq regression tests.
* taskq_wait_id() -
Historically, the only difference between this function and
taskq_wait() was that you passed the task id. In both functions you
would block until ALL lower task ids which executed. This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.
To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.
This is actually a fairly low risk change for a few reasons.
* Only new ZFS callers will make use of the new interfaces and
very little common code was changed to support the new functions.
* The existing taskq_wait() implementation was not changed just
slightly refactored.
* The newly optimized taskq_wait_id() implementation was never
used by ZFS we can't accidentally introduce a new bug there.
NOTE: This functionality does not exist in the Illumos taskqs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the taskq implementation was originally written I wrapped all
the API functions in #define's. This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.
However, in practice the taskq symbols never conflicted. The only
major conflicts occured with the kmem cache API. Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the taskq implementation to conform with the style used
throughout the rest of the code. There are no functional
changes in this commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ensure the test thread blocks until the shrinker has completed its
work. This is done by putting the test thread to sleep and waking
it each time the shrinker callback runs. Once the shrinker size
drops to zero or we time out the test is allowed to proceed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#96Closes#125Closes#182
The splat command takes a verbose option which when set prints
the internal debug log for every test. This is helpful when
tracking down a common failure, but for a rare failure the
volume of log data is distracting.
Therefore, the verbose option has been adjusted to allow only
printing the debug log on failure. The legacy behavior is still
available by specifying the verbose option twice. For example:
$ splat -t all:all # Never print the debug log
$ splat -v -t all:all # Only print debug log on failure
$ splat -vv -t all:all # Always print the debug log
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning. See commit
b8d06fca08 for additional details.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1138
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.
This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently. This effectively prevents
the deadlock condition from ever occuring.
Therefore we can safely revert back to the upstream synchronous
cleanup code. This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#457
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().
But more importantly than preserving the exact time add the missing
call to sa_bulk_update(). This ensures that the znode modifications
are written to disk as part of the transaction. Without this the
inode may mistaken rollback to the previous on-disk znode state.
Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode. This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.
Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764
The taskq:front test has a race condition where task 4 and 8
race to complete, due to an incorrectly calculated set of delay
"factors" (T). If task 4 wins and actually finishes first, the
verification of the order of completion will fail.
The delays calculated to order task completion do not take into
account the terminal line in the table, and so are all off by
a factor of 1. This causes all the tasks in all queues to finish
sooner than expected and the accumulated error is the root cause
of tasks 4 and 8 racing to complete first. Before the change the
"actual" table looks like I commented in #130.
I changed:
* the table in the comment to correctly reflect the test and the
factor timings needed.
* the individual task delay factors of T so that ONLY 1 task will
every 2T. (on average)
* 1T was reduced from 100ms to 50ms. This halves the duration of
the test and makes any remaining raciness more likely to cause
failures, but it did not cause the test to fail.
* simplified the delay factor logic by using a table look-up
instead of a switch.
* Added a "task started" message so that with -v it is possible
to see the order tasks are started.
* Moved the "task completed" message inside the spinlock so that
with -v the message truly reflects the absolute order of
completion as guaranteed by the spinlock.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#130