This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.
* The errata warning has been made less scary for users without
any encrypted datasets.
* The errata warning now clears itself without a pool reimport if
the bookmark_v2 feature is enabled and no encrypted datasets
exist.
* It is no longer possible to create new encrypted datasets without
enabling the bookmark_v2 feature, thus helping to ensure that the
errata is resolved.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes#8504
Update zed.rc values reflect their default value. This helps
avoid confusion if a user expects functionality to be enabled.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes#8498
The ksh 'echo -n' behavior on Illumos and Linux differs. For
compatibility with others platforms switch to "printf '%s' ".
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8501
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, supposedly to reduce number of
head seeks for spinning disks. But for SSDs it makes little to no
sense, especially on FreeBSD, where due to MAXPHYS limitation device
will likely still see bunch of 128KB I/Os instead of one large.
Having more strict aggregation limit for SSDs allows to avoid
allocation of large memory buffer and copy to/from it, that is a
serious problem when throughput reaches gigabytes per second.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8494
Currently, the verbose output of zstreamdump includes new line
characters within some individual records. Presumably, this was
originally done to keep the output from getting too wide to fit
on a terminal. However, since new flags and struct members have
been added, these rules have not been maintained consistently. In
addition, these newlines can make it hard to grep the output in
some scenarios. This patch simply removes these newlines, making
the output easier to grep and removing the inconsistency.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8493
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.
This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.
This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8308
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.
For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8168Closes#8487
This is a simple fix for a typo ("perfetch" rather than "prefetch")
in arc_summary3.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Jason Cohen <jwittlincohen@gmail.com>
Closes#8499
By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write. This is the recommended
configuration.
The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.
There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds. This
in turn caused MMP to suspend the pool.
In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O. The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.
Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.
Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor. This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7709Closes#8495
Most of the zfs_arc_* module parameters do not have their values used by
the ARC code directly. Instead, there is a function, arc_tuning_update,
which is called during module initialization and periodically
thereafter, whose job is to fetch the module parameter values, clamp/
limit them appropriately, and then assign those values to a separate set
of internal variables that are actually referenced by the ARC code.
Commit 3ec34e55 featured an overhaul of arc_reclaim_thread, which is the
former location where the post-init-time calls to arc_tuning_update
would occur. The rework split the work previously done by the
arc_reclaim_thread into a pair of replacement threads; and
unfortunately, the call to arc_tuning_update fell through the cracks and
was lost in the reorganization.
This meant that changing almost any ARC-related zfs module parameter via
/sys/module/zfs/parameters/ would result in the module parameter value
itself appearing to change; however the modification would not actually
propagate to the ARC code and have any real effect.
This commit reinstates the post-init-time call to arc_tuning_update. It
is now called during arc_adjust_cb_check; this should be equivalent to
its former call location in arc_reclaim_thread.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#8405Closes#8463
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#8077
Resolve a vdev_initialize crash uncovered by ztest. Similar
to when starting a new initialization verify that a removal
is not in progress. Additionally, do not restart when the
thread already exists. This check is now congruent with the
POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl().
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8477
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes. This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.
The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.
For example, in a pool like this:
root-vdev
+------+---------------+
vdev1 vdev2
| |
| +------+-----+-----+----+
disk1 disk2 disk3 disk4 disk5 disk6
vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1
is chosen, disk1 will be chosen. However, every time vdev2 is chosen,
disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as
many MMP writes as disk2.
This may create wear issues in the case of SSDs. It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.
The new code maintains a list of leaf vdevs in the pool. MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached. It stops when a
suitable leaf is found or all leaves have been examined.
Added a test to verify MMP write distribution is even.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7953
The linux kernel's nfsd implementation use RWF_SYNC to determine if the
write is synchronous or not. This flag is used to set the kernel's I/O
control block flags. Unfortunately, ZFS was not updated to inspect these
flags so NFS sync writes were not being honored.
This change maps the IOCB_* flags to the ZFS equivalent.
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#8474Closes#8452Closes#8486
Commit torvalds/linux@736706bee has removed the get_fs() function
as a bit of cleanup. It has been defined as KERNEL_DS on all
architectures for all supported kernels. Replace get_fs() with
KERNEL_DS as was done in the kernel.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8479
This patch fixes a few issues when detecting which kernel_fpu functions
are available.
- Use kernel_fpu_begin() if it's exported on newer kernels.
- Use ZFS_LINUX_TRY_COMPILE_SYMBOL() to choose the right kernel_fpu
function when using --enable-linux-builtin.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#8259Closes#8363
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#7674Closes#7675Closes#7908
Before allowing new allocations to the metaslab we need to ensure
that any issued initializing writes have been synced. Otherwise,
it's possible for metaslab_block_alloc() to allocate a range which
is about to be overwritten by an initializing IO.
Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8461
Improve the autoconf code for finding libtirpc and do not assume the
headers are in /usr/include/tirpc.
Also remove this assumption from the `rpc/xdr.h` header in libspl and
use the same `#include_next` mechanism that is used for other libspl
headers.
Include pkg.m4 from pkg-config in config/ for PKG_CHECK_MODULES(), the
file license allows this.
Include ax_save_flags.m4 and ax_restore_flags.m4 from autoconf-archive,
the file licenses are compatible. Use the 2012 versions so as not rely
on a more recent autoconf feature AS_VAR_COPY(), which breaks some build
slaves.
Add new macro library `config/find_system_library.m4` which defines the
FIND_SYSTEM_LIBRARY() macro which is a convenience wrapper over using
PKG_CHECK_MODULES() with a fallback to standard library locations and
some sanity checks.
The parameters are:
```
FIND_SYSTEM_LIBRARY(VARIABLE-PREFIX, MODULE, HEADER, HEADER-PREFIXES,
LIBRARY, FUNCTIONS, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
```
`HEADER-PREFIXES` and `FUNCTIONS` are comma-separated m4 lists.
For libtirpc we are using:
```
FIND_SYSTEM_LIBRARY(LIBTIRPC, [libtirpc], [rpc/xdr.h], [tirpc], [tirpc],
[xdrmem_create], [], [...])
```
The headers are first checked for without the prefixes and then with.
This system works with pkg-config and falls back on checking standard
header/library locations, it can be easily overridden by the user by
setting the `PREFIX_CFLAGS` and `PREFIX_LIBS` variables which are
automatically added to the `./configure --help` output.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes#7422Closes#8313
Fix indentation of code in ifdef's.
Remove obsolete comment.
Make if/else statements more readable by adding braces.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8459
When multihost is enabled, and a pool is suspended, return
EINVAL in response to "zpool clear <pool>". The pool
may have been imported on another host while I/O was suspended.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6933Closes#8460
Improve the man page text to warn the user about the risk of adding
the same device to multiple pools via simultaneous "zpool create",
"zpool add", "zpool replace", etc.
State that MMP/multihost does not protect against these scenarios.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6473Closes#8457
abd_alloc() normally does scatter allocations, thus solving the problem
that ABD originally set out to: the bulk of ZFS's allocations are single
pages, which are faster to allocate and free, and don't suffer from
internal fragmentation (and the inability to reclaim memory because some
buffers in the slab are still allocated).
However, the current code does linear allocations for 4KB and smaller
allocations, defeating the purpose of ABD.
Scatter ABD's use at least one page each, so sub-page allocations waste
some space when allocated as scatter (e.g. 2KB scatter allocation wastes
half of each page). Using linear ABD's for small allocations means that
they will be put on slabs which contain many allocations. This can
improve memory efficiency, but it also makes it much harder for ARC
evictions to actually free pages, because all the buffers on one slab
need to be freed in order for the slab (and underlying pages) to be
freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so
it's possible for them to actually waste more memory than scatter (one
page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting
15/16th).
Spill blocks are typically 512B and are heavily used on systems running
selinux with the default dnode size and the `xattr=sa` property set.
By default we will use linear allocations for 512B and 1KB, and scatter
allocations for larger (1.5KB and up).
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8455
Debian has a panic() function which makes it possible to disable shell
access in initramfs by setting the panic kernel parameter. Use it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#8448
When feeding a replication stream to `zstreamdump -d` (raw dump mode),
it does not print the raw data for DRR_WRITE_EMBEDDED records.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes#8430
The spa_txg_history_init_io() and spa_txg_history_fini_io() were
mistakenly taking SCL_ALL when only SCL_CONFIG is required to
access the vdev stats. This could result in a deadlock which
was observed when running ztest.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8445
The "-t" argument to "zfs program" specifies a limit on the number of
LUA instructions that can be executed. The zfs.8 manpage has the wrong
description. It should be updated to match what's in zfs-program.8
Also fix the formatting of the zfs help message.
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8410
Preferentially sort by the full path name instead of GUID when determining
which device links to use. This helps ensure that the pool vdevs are named
consistently when multiple links for a device appear in the same directory.
For example, the /dev/disk/by-id/scsi* and /dev/disk/by-id/wwn* links.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes#8108Closes#8440
Reorder the `zfs create` error messages in order to return the most
specific one first. If none of them apply then an expanded version of
the invalid name message is used.
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Damian Wojsław <damian@wojslaw.pl>
Closes#8155Closes#8352
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Closes#4660Closes#8423
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8444
The issue is caused by a small discrepancy in how userland creates the
partition layout and the kernel estimates available space:
* zpool command: subtract 9M from the usable device size, then align
to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M
EFI "reserved" partition.
* kernel module: subtract 10M from the device size. 10M is the sum of
1M "start" partition alignment + 1m "end" partition alignment + 8M
EFI "reserved" partition.
For devices where the number of sectors is not a multiple of the
alignment size the zpool command will create a partition layout which
reserves less than 1M after the 8M EFI "reserved" partition:
Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F
Device Start End Sectors Size Type
/dev/sda1 2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
/dev/sda9 2078720 2095103 16384 8M Solaris reserved 1
When the kernel module vdev_open() the device its max_asize ends up
being slightly smaller than asize: this results in a huge number (16E)
reported by metaslab_class_expandable_space().
This change prevents bdev_max_capacity() from returing a size smaller
than bdev_capacity().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#1468Closes#8391
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.
Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes#8433
In
-T u|d Display a time stamp. Specify -u for a printed
representation of the internal representation of time.
See time(2). Specify -d for standard date format.
See date(1).
'Specify u' and 'Specify d' should be used instead. `zpool list -T -u`
does not work.
Bring the descriptions in `zpool list` and `zpool status` in sync with
`zpool iostat`.
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Anatoly Borodin <anatoly.borodin@gmail.com>
Closes#8438
The SCST driver (SCSI target driver implementation) and possibly
others may issue read bio's with a length of zero bytes. Although
this is unusual, such bio's issued under certain condition can cause
kernel oops, due to how rangelock is implemented.
rangelock_add_reader() is not made to handle overlap of two (or more)
ranges from read bio's with the same offset when one of them has size
of 0, even though they conceptually overlap. Allowing them to enter
rangelock results in kernel oops by dereferencing invalid pointer,
or assertion failure on AVL tree manipulation with debug enabled
kernel module.
For example, this happens when read bio whose (offset, size) is
(0, 0) enters rangelock followed by another read bio with (0, 4096)
when (0, 0) rangelock is still locked, when there are no pending
write bio's. It can also happen with reverse order, which is (0, N)
followed by (0, 0) when (0, N) is still locked. More details
mentioned in #8379.
Kernel Oops on ->make_request_fn() of ZFS volume
https://github.com/zfsonlinux/zfs/issues/8379
Prevent this by returning bio with size 0 as success without entering
rangelock. This has been done for write bio after checking flusher
bio case (though not for the same reason), but not for read bio.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes#8379Closes#8401
The cmp and diff utilities are required at configure time. Add
a dependency on diffutils to ensure they are installed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes#5205Closes#8428
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.
The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram. The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.
In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes#8358
Trying to mount a dataset from a readonly pool could inadvertently start
the user accounting upgrade task, leading to the following failure:
VERIFY3(tx->tx_threads == 2) failed (0 == 2)
PANIC at txg.c:680:txg_wait_synced()
Showing stack for process 2541
CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
[<0>] ? dump_stack+0x5d/0x78
[<0>] ? spl_panic+0xc9/0x110 [spl]
[<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs]
[<0>] ? dmu_object_next+0x77/0x130 [zfs]
[<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs]
[<0>] ? txg_wait_synced+0x91/0x220 [zfs]
[<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs]
[<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs]
[<0>] ? taskq_thread+0x2cc/0x5d0 [spl]
[<0>] ? wake_up_state+0x10/0x10
[<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl]
[<0>] ? kthread+0xbd/0xe0
[<0>] ? kthread_create_on_node+0x180/0x180
[<0>] ? ret_from_fork+0x58/0x90
[<0>] ? kthread_create_on_node+0x180/0x180
This patch updates both functions responsible for checking if we can
perform user accounting to verify the pool is not readonly.
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#8424
Missing copyright notices were noticed during the Illumos
RTI process. Add LLNS 2016 copyright based on original merge
date.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#8435
We have to use umem_free() instead of free() if we are using
umem_zalloc()
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8402
During the cleanup function of this test, an attempt to destroy a volume
can fail because the volume is busy. This leaves the system with
unexpected datasets which in turn causes subsequent failures.
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#8422