3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@25345e4666https://www.illumos.org/issues/3349
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
This reverts commit d135245791.
Since feature flags have now been merged we can apply the real
upstream fix from Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #997
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
mountall in Debian depends on being able to pass the -f parameter to
mount, which specifies a fake mount and just updates the mtab. Currently
mount.zfs will fail such a request if it is not passed with -o zfsutil.
This patch allows a fake mount on a non-legacy filesystem to succeed in
the same manner as a -o remount does, thus enabling mountall to work
correctly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1167
In a debug build, certain GCC versions flag an array bounds warning in
the below code from dnode_sync.c
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
This usage is in fact safe, since the ASSERT ensures the index does
not exceed to maximum possible number of block pointers. However gcc
can't determine that the assignment 'i = dn->dn_next_nblkptr[txgoff];'
falls within the array bounds so it issues a warning. To avoid this,
initialize i to zero to make gcc happy but skip the elements before
dn->dn_next_nblkptr[txgoff] in the loop body. Since a dnode contains
at most 3 block pointers this overhead should be negligible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#950
Currently ZFS doesn't show any I/O time in eg "top" wait% or in
/proc/$pid/stat's blkio_ticks. Using io_schedule() instead of
schedule() in zio_wait()'s cv_wait() is the correct way to fix
this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1158Closes#1175
This reverts commit 9dcb971983
which was originally introduced to debug occasional slow I/Os.
These I/Os would complete eventually but were observed to take
several 100 seconds.
The root cause of this issue was the CFQ scheduler which can,
under certain conditions, excessively delay an I/O from being
issued to the device. This issue was mitigated somewhat by
commit 84daaddedb which ensures
the I/O elevator gets changed even for DM style devices.
This change isn't in any way harmful but it does conflict with
a required change to properly account from I/O wait time.
Because Linux does not export the io_schedule_timeout() function
we must instead rely on io_schedule() via cv_wait_io().
The additional debugging information which was added to the
delay event has been intentionally left in place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In all but one case the spa_namespace_lock is taken before the
bdev->bd_mutex lock. But Linux __blkdev_get() function calls
fops->open() with the bdev->bd_mutex lock held and we must
somehow still safely acquire the spa_namespace_lock.
To avoid a potential lock inversion deadlock we preemptively
try to take the spa_namespace_lock(). Normally it will not
be contended and this is safe because spa_open_common() handles
the case where the caller already holds the spa_namespace_lock.
When it is contended we risk a lock inversion if we were to
block waiting for the lock. Luckily, the __blkdev_get()
function allows us to return -ERESTARTSYS which will result in
bdev->bd_mutex being dropped, reacquired, and fops->open() being
called again. This process can be repeated safely until both
locks are acquired.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#612
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The AUTHORS file was getting stale. Refresh its contents
using the authors listed in the git commit logs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The ChangeLog was retired long ago, the git commit logs are
authoritative. To avoid any confusion remove the ChangeLog.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit(). This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.
However, under Linux I'm not seeing the same contention that
was originally described. Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD. If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.
This just leaves one small ZFS TSD consumer. If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#174
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1149
During the original ZoL port the vdev_uses_zvols() function was
disabled until it could be properly implemented. This prevented
a zpool from use a zvol for its slog device.
This patch implements that missing functionality by adding a
zvol_is_zvol() function to zvol.c. Given the full path to a
device it will lookup the device and verify its major number
against the registered zvol major number for the system. If
they match we know the device is a zvol.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1131
Fix setting/getting users/groups in quota properties through
numeric identifier. This support was accidentally disabled
in the original port by applying the HAVE_IDMAP wrapper macro
too broadly.
Fix obtained by moving #ifdef HAVE_IDMAP to exclude only
the part of code that really needs IDMAP. Now zfs (get|set)
(user|group)quota@1000 works as expected.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1147
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.
https://bugs.gentoo.org/show_bug.cgi?id=433946
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed. That change
accidentally resulted in unexpected ctime updates which upset tools
like git. That was always a horrible hack and I'm happy it will
never make it in to a tagged release.
The right fix is something I initially resisted doing because I
was worried about the additional overhead. However, in hindsight
the overhead isn't as bad as I feared.
This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied. We leverage this
callback to keep the znode SAs strictly in sync with the inode.
However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime. This will cover the callpath of most concern to us.
->filemap_page_mkwrite->file_update_time->update_time->
mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764Closes#1140
Commit 2957f38 renamed 60-vdev.rules to 69-vdev.rules but failed
to update the .gitignore file to reflect this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
Linux kernel commit d8e794d accidentally broke the delayed work
APIs for non-GPL callers. While the APIs to schedule a delayed
work item are still available to all callers, it is no longer
possible to initialize the delayed work item.
I'm cautiously optimistic we could get the delayed_work_timer_fn
exported for all callers in the upstream kernel. But frankly
the compatibility code to use this kernel interface has always
been problematic.
Therefore, this patch abandons direct use the of the Linux
kernel interface in favor of the new delayed taskq interface.
It provides roughly the same functionality as delayed work queues
but it's a stable interface under our control.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1053
When writes to zvols invoke ZIL, zfs_range_new_proxy() is called,
which allocates memory using KM_SLEEP, triggering a warning.
Switch to KM_PUSHPAGE to silence that warning. See commit
b8d06fca08 for additional details.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1138
This reverts commit b00131d43c which
is no longer needed due to e89260a1c8.
This change forces all xattr znodes to hold a reference on their
parent which ensures prune_icache() will never attempt to evict
both the parent and child concurrently. This effectively prevents
the deadlock condition from ever occuring.
Therefore we can safely revert back to the upstream synchronous
cleanup code. This is nice because it keeps our code base closer
to upstream and resolves the performance issues introduced by the
original deadlock fix.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#457
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().
But more importantly than preserving the exact time add the missing
call to sa_bulk_update(). This ensures that the znode modifications
are written to disk as part of the transaction. Without this the
inode may mistaken rollback to the previous on-disk znode state.
Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode. This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.
Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#764
Incorrect syntax should never cause a segfault. In this case
listing multiple comma delimited options after '-o' triggered
the problem. For example:
zpool create -o ashift=12,listsnaps=on
This patch resolves the issue by wrapping the calls which use
hdr with a NULL test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1118
Add a vdev_id feature to map device names based on already defined
udev device links. To increase the odds that vdev_id will run after
the rules it depends on, increase the vdev.rules rule number from 60
to 69. With this change, vdev_id now provides functionality analogous
to zpool_id and zpool_layout, paving the way to retire those tools.
A defined alias takes precedence over a topology-derived name, but the
two naming methods can otherwise coexist. For example, one might name
drives in a JBOD with the sas_direct topology while naming an internal
L2ARC device with an alias.
For example, the following lines in vdev_id.conf will result in the
creation of links /dev/disk/by-vdev/{d1,d2}, each pointing to the same
target as the device link specified in the third field.
# by-vdev
# name fully qualified or base name of device link
alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca
alias d2 wwn-0x5000c5002def789e
Also perform some minor vdev_id cleanup, such as removal of the unused
-s command line option.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#981
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent. Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it. Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.
This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.
The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX. Otherwise we
run the risk of deadlocking on the open create TX.
Ideally the security xattr should be fully constructed
before the new inode is unlocked. However, doing so would
require far more extensive changes to ZFS.
This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#671
Add the initial support for the 'smbshare' option using the
existing libshare infrastructure. Because this implementation
relies on usershares samba version 3.0.23 is required.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#493
Commit df83110856 missed update to
getopt() call, while delivering all the rest. This commit adds
"o" to getopt().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #566
Add the missing error handling to load_nvlist(). There's no good
reason this needs to be fatal. All callers of load_nvlist() do
correctly handle an error condition and it is preferable that an
error be returned. This will allow 'zpool import -FX' to safely
attempt to rollback through previous txgs looking for a good one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1120
Commit 57a4edd allows the bootfs property to be set on any pool.
However, many of the zpool commands still prevent you from using
EFI labeled devices for the root pool. For example:
# zpool attach rpool /dev/sda /dev/sdb
cannot label 'sdb': EFI labeled devices are not supported on
root pools. on root devices.
For non-Solaris builds such as Linux disable this error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1077
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings. The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101
Previously this check was only performed when ./configure was
attempting to autodetect your kernel source directory. But we
should also handle the case where --with-linux was provided
and is obviously wrong. This way we catch the error before
invoking make and compiling the source with an incorrect
autoconf results.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/spl#162
While expanding positional parameters shell requires non-single
digits to be enclosed in braces. When the SAS topology is
non-trivial the number of positional parameters generated internally
by vdev_id script (using set -- ...) easily crosses single digit limit
and vdev_id fails to generate links.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1119
Full bash may not be available in all environments where udev helpers
run, such as in an initial ramdisk. To avoid breakage in this case,
remove use of bash-specific features such as variable arrays and the
`declare' keyword from the vdev_id script.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#870
If zvol_alloc() fails zv will be set to NULL and dereferenced
in out_dmu_objset_disown. To avoid this entirely the zv->objset
line is moved up in to the success block.
Original-patch-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1109
Increasing this limit costs us 6144 bytes of memory per mounted
filesystem, but this is small price to pay for accomplishing
the following:
* Allows for up to 256-way concurreny when performing lookups
which helps performance when there are a large number of
processes.
* Minimizes the likelyhood of encountering the deadlock
described in issue #1101. Because vmalloc() won't strictly
honor __GFP_FS there is still a very remote chance of a
deadlock. See the zfsonlinux/spl@043f9b57 commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1101
When a zvol with snapshots is renamed the device files under
/dev/zvol/ are not renamed. This patch resolves the problem
by destroying and recreating the minors with the new name so
the links can be recreated bu udev.
Original-patch-by: Suman Chakravartula <schakrava@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#408
Canonicalize the mount point passed to the mount.zfs helper.
This way a clean path is always added to mtab which ensures
the umount can properly locate and remove the entry.
Test case:
$ mkdir /mnt/foo
$ mount -t zfs zpool/foo /mnt/../mnt/foo////
$ umount /mnt/foo
$ cat /etc/mtab | grep zpool/foo
zpool/foo /mnt/../mnt/foo//// zfs rw 0 0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#573
This branch adds some overdue ashift improvements.
* Add '-o ashift' to 'zpool add' and 'zpool attach'
* Improve AF hard disk detection
* Allow 'zpool import' to handle increases in ashift
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When adding devices to an existing pool "ashift" property is
auto-detected. However, if this property was overridden at
the pool creation time (i.e. zpool create -o ashift=12 tank ...)
this may not be what the user wants. This commit lets the user
specify the value of "ashift" property to be used with newly
added drives. For example,
zpool add -o ashift=12 tank disk1
zpool attach -o ashift=12 tank disk1 disk2
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#566
Use the bdev_physical_block_size() interface to determine the
minimize write size which can be issued without incurring a
read-modify-write operation. This is used to set the ashift
correctly to prevent a performance penalty when using AF hard
disks.
Unfortunately, this interface isn't entirely reliable because
it's not uncommon for disks to misreport this value. For this
reason you may still need to manually set your ashift with:
zpool create -o ashift=12 ...
The solution to this in the upstream Illumos source was to add
a white list of known offending drives. Maintaining such a list
will be a burden, but it still may be worth doing if we can
detect a large number of these drives. This should be considered
as future work.
Reported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#916
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Refererces to Illumos issue:
https://www.illumos.org/issues/2671
This patch has been slightly modified from the upstream Illumos
version. In the upstream implementation a warning message is
logged to the console. To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.
The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool. Depending on your workload
this may impact pool performance. Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.
NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#955
The udev data directory was hard coded in 60-vdev.rules.in. That causes
a problem when a distribution changes the location of the directory.
This was not an issue in the past because virtually all distributions
used the same path, but that is beginning to change following a decision
by the systemd developers to change the directory location to reflect
their take-over of udev maintainership. The testing branch of Gentoo
Linux adopted this change, which enabled the hardcoded directory
location to trigger a regression.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1085