We recently had a case where our operators replaced a bad
multipathed disk, only to see it fail to autoreplace. The
zed logs showed that the multipath replacement disk did not pass
the 'is_dm' test in zfs_process_add() even though it should have.
is_dm is set if there exists a sysfs entry for to the
underlying /dev/sd* paths for the multipath disk. It's
possible this path didn't exist due to a race condition where
the sysfs paths weren't created at the time the udev event came
in to zed, but this was never verified.
This patch updates the check to look for udev properties that
indicate if the new autoreplace disk is an empty multipath disk,
rather than looking for the underlying sysfs entries. It also
adds in additional logging, and fixes a bug where zed allowed
you to use an already zfs-formatted disk from another pool
as a multipath auto-replacement disk.
Furthermore, while testing this patch, I also ran across a case
where a force-faulted disk did not have a ZPOOL_CONFIG_PHYS_PATH
entry in its config. This prevented it from being autoreplaced.
I added additional logic to derive the PHYS_PATH from the PATH if
the PATH was a /dev/disk/by-vdev/ path. For example, if PATH
was /dev/disk/by-vdev/L28, then PHYS_PATH would be L28. This is
safe since by-vdev paths represent physical locations and do not
change between boots.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#13023
Fix `zfs-dkms` installation on Debian-derived distributions by
aligning the directory detection logic to #13096.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#11449Closes#13141
- Kmemleak `clear` is invoked right before every test case run.
- Kmemleak `scan` is requested right after each test case is finished.
- Kmemleak instrumentation is not used for
setup/cleanup/pretest/posttest/failsafe stages to shorten the test
case execution time.
- Kmemleak periodic scan is disabled (`scan=0`) before the test suite
run to avoid interfering with the on-demand scan results.
- There are unavoidable potential false positives coming from kernel
areas other than OpenZFS module.
- The ZTS with kmemleak enabled duration is increased by ~50%.
Example run
```
Running Time: 07:12:13
Percent passed: 98.3%
unreferenced object 0xffff9da82aea5410 (size 80):
comm "kworker/u32:10", pid 942206, jiffies 4296749716 (age 2615.516s)
hex dump (first 32 bytes):
00 30 30 00 00 00 00 00 ff 8f 30 00 00 00 00 00 .00.......0.....
51 e6 77 05 a8 9d ff ff 00 00 00 00 00 00 00 00 Q.w.............
backtrace:
[<000000005cf1fea2>] alloc_extent_state+0x1d/0xb0 [btrfs]
[<0000000083f78ae5>] set_extent_bit+0x2ff/0x670 [btrfs]
[<00000000de29249e>] lock_extent_bits+0x6b/0xa0 [btrfs]
[<00000000b241f424>] lock_and_cleanup_extent_if_need+0xaf/0x1c0
[btrfs]
[<0000000093ca72b5>] btrfs_buffered_write+0x297/0x7d0 [btrfs]
[<000000002c2938c8>] btrfs_file_write_iter+0x127/0x390 [btrfs]
[<00000000b888f720>] do_iter_readv_writev+0x152/0x1b0
[<00000000320f0bcc>] do_iter_write+0x7c/0x1c0
[<000000000b5a8fe0>] lo_write_bvec+0x62/0x150 [loop]
[<000000009aa03c73>] loop_process_work+0x250/0xbd0 [loop]
[<00000000c7487d8a>] process_one_work+0x1f1/0x390
[<000000000b236831>] worker_thread+0x53/0x3e0
[<0000000023cb3e57>] kthread+0x127/0x150
[<000000002d48676a>] ret_from_fork+0x22/0x30
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#13084
Linux 5.11 changed kernel_fpu_begin() to an inlined function and
moved the functionality to kernel_fpu_begin_mask(). This breaks the
existing detection mechanism since it checks if kernel_fpu_begin is
an exported kernel symbol, which isn't the case for an inlined
function.
To avoid assumptions about internal implementation, replace
ZFS_LINUX_TEST_RESULT_SYMBOL in favor of ZFS_LINUX_TEST_RESULT
which already makes sure kernel_fpu_{begin,end}() is usable by us.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#13147
As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. So, xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.
This change adds xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This makes xattr=sa behavior similar to xattr=on.
Implementation notes:
The actual logging is fairly straight-forward and does not warrant
additional explanation.
However, it has been 14 years since we last added new TX types
to the ZIL [1], hence this is the first time we do it after the
introduction of zpool features. Therefore, here is an overview of the
feature activation and deactivation workflow:
1. The feature must be enabled. Otherwise, we don't log the new
record type. This ensures compatibility with older software.
2. The feature is activated per-dataset, since the ZIL is per-dataset.
3. If the feature is enabled and dataset is not for zvol, any append to
the ZIL chain will activate the feature for the dataset. Likewise
for starting a new ZIL chain.
4. A dataset that doesn't have a ZIL chain has the feature deactivated.
We ensure (3) by activating on the first zil_commit() after the feature
was enabled. Since activating the features requires waiting for txg
sync, the first zil_commit() after enabling the feature will be slower
than usual. The downside is that this is really a conservative
approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL
chain, we pay the penalty for feature activation. The upside is that the
user is in control of when we pay the penalty, i.e., upon enabling the
feature.
We ensure (4) by hooking into zil_sync(), where ZIL destroy actually
happens.
One more piece on feature activation, since it's spread across
multiple functions:
zil_commit()
zil_process_commit_list()
if lwb == NULL // first zil_commit since zil_open
zil_create()
if no log block pointer in ZIL header:
if feature enabled and not active:
// CASE 1
enable, COALESCE txg wait with dmu_tx that allocated the
log block
else // log block was allocated earlier than this zil_open
if feature enabled and not active:
// CASE 2
enable, EXPLICIT txg wait
else // already have an in-DRAM LWB
if feature enabled and not active:
// this happens when we enable the feature after zil_create
// CASE 3
enable, EXPLICIT txg wait
[1] da6c28aaf6
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes#8768Closes#9078
Systemd units do not read @initconfdir@ but refer to variables defined
there, also a minor fixup in zfs-scrub service file.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Krzysztof Piecuch <piecuch@kpiecuch.pl>
Closes#12946
As explained by the disclaimer in the test case,
"This test can fail since nothing guarantees that old
MOS blocks aren't overwritten."
This behavior is expected and correct, but results in a
flaky test case which is problematic for the CI. The best
we can do to resolve this is to retry the sub-test which
failed when the MOS blocks have clearly been overwritten.
When testing failures were rare enough that a single retry
should normally be sufficient. However, we allow up to
five for good measure.
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13119
New `zfs_type_t` value `ZFS_TYPE_INVALID` is introduced.
Variable initialization is now possible to make GCC happy.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12167Closes#13103
When a dataset is in the process of being received it gets marked as
inconsistent and should not be used. We should check for this when
opening a dataset handle in libzfs and return with an appropriate error
set, rather than hitting an abort because of the incomplete data.
zfs_open() passes errno to zfs_standard_error() after observing
make_dataset_handle() fail, which ends up aborting if errno is 0.
Set errno before returning where we know it has not been set already.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org>
Closes#13077
The default behavior where the serious ZFS errors cause FS thread to
stuck is very bad for some production scenario.
In some production scenarios (Linux), it is recommended to make real
kernel PANIC, where system can be rebooted by watchdog or kernel itself.
This patch enables coherent handling of spl_panic_halt parameter.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Wojciech Nizinski <w.nizinski@grinn-global.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#12120Closes#13109
Which produces a warning since uints are, by definition, >=0
Reviewed-by: Alejandro Colomar <alx.manpages@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13110
When attaching a vdev to a mirror wait for the resilver to complete
before invoking `zdb` to inspect the pool. This ensures the pool is
essentially idle which allows `zdb` to open the imported pool reliably.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#13112Closes#6935
- Replaced intances of `dracut_install` with `inst_simple`
- Removed calls to `test -x mark_hostonly` because the function is an
inbuilt dracut function
- Removed redundant installation of `systemd-ask-password` and
`systemd-tty-ask-password-agent` because they are already installed by
the systemd module. There is no need to install them again
- Removed multiple calls to the `mark_hostonly` function because the
`inst_simple` has a command-line switch for it
- Cleaned up the installation of the `zpool.cache`, `vdev_id.conf` and
`hostid` files to make the logic easier to follow
- Cleaned up and simplified the systemd service installation logic by
invoking systemctl instead of creating symlinks manually
- Replaced various hard-coded paths with dracut equivalents to better
conform with expected dracut behaviour
- Removed redundant call to `mkdir` (`inst_simple` creates the parent
directory if it does not exist on the destination initrd)
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Andrew J. Hesford <ajh@sideband.org>
Signed-off-by: Savyasachee Jha <hi@savyasacheejha.com>
Closes#13010
Since dracut functions can locate both udev rules and binaries, there is
no point in keeping absolute paths in the module setup script. It also
breaks the --sysroot option in dracut. This commit removes mentions to
absolute paths for binaries and udev rules.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Andrew J. Hesford <ajh@sideband.org>
Signed-off-by: Savyasachee Jha <hi@savyasacheejha.com>
Closes#13010
Dracut will now fail in initramfs generation if essential files cannot
be installed.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Andrew J. Hesford <ajh@sideband.org>
Signed-off-by: Savyasachee Jha <hi@savyasacheejha.com>
Closes#13010
Setting up the module involves multiple redundant calls to a bunch of
dracut functions wheich can be combined into one. Additionally, the mass
of code required to load libgcc_s.so* can be replaced with one dracut
function. This has the additional effect of removing errors involving
the non-installation of libgcc_s.so* which are seen on debian bullseye
when using version 2.1.2-1~bpo11+1 from the backports repository.
The systemd binaries are separated out into their own `dracut_install`
function call so they do not get pulled in when dracut does not load the
systemd module.
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Andrew J. Hesford <ajh@sideband.org>
Signed-off-by: Savyasachee Jha <hi@savyasacheejha.com>
Closes#13010
Most modern Linux distributions have separate locations for bare
source and prebuilt ("build") files. Additionally, there are `source`
and `build` symlinks in `/lib/modules/$(KERNEL_VERSION)` pointing to
them. The order of directory search is now:
- `configure` command line values if both `--with-linux` and
`--with-linux-obj` were defined
- If only `--with-linux` was defined, `--with-linux-obj` is assumed
to have the same value as `--with-linux`
- If neither `--with-linux` nor `--with-linux-obj` were defined
autodetection is used:
- `/lib/modules/$(uname -r)/{source,build}` respectively, if exist
- The first directory in `/lib/modules` with the highest version
number according to `sort -V` which contains `source` and `build`
symlinks/directories
- The first directory matching `/usr/src/kernels/*` and
`/usr/src/linux-*` with the highest version number according to
`sort -V`. Here the source and prebuilt directories are assumed
to be the same.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#9935Closes#13096
Raw sending from pool1/encrypted with ashift=9 to pool2/encrypted with
ashift=12 results to failure when mounting pool2/encrypted (Input/Output
error). Notably, the opposite, raw sending from a greater ashift to a
lower one does not fail.
This happens because zio_compress_write() falsely checks only
ZIO_FLAG_RAW_COMPRESS and not ZIO_FLAG_RAW_ENCRYPT which is also set in
encrypted raw send streams. In this case it rounds up the psize and if
not equal to the zio->io_size it modifies the block by zeroing out
the extra bytes. Because this happens in a SA attr. registration object
(type=46), the decryption fails upon mounting the filesystem, and zpool
status falsely reports an error.
Fix this by checking both ZIO_FLAG_RAW_COMPRESS and ZIO_FLAG_RAW_ENCRYPT
before deciding whether to zero-pad a block.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#13067Closes#13074
The stored ABI files are for the x86_64 architecture.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#11345Closes#13104
Provide two digits of precision when reporting send/receive
times. Tiny snapshots may take significantly less than a second
and rounding up to a full second can introduce a significant error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#13047
ZFS on Linux originally implemented xattr namespaces in a way that is
incompatible with other operating systems. On illumos, xattrs do not
have namespaces. Every xattr name is visible. FreeBSD has two
universally defined namespaces: EXTATTR_NAMESPACE_USER and
EXTATTR_NAMESPACE_SYSTEM. The system namespace is used for protected
FreeBSD-specific attributes such as MAC labels and pnfs state. These
attributes have the namespace string "freebsd:system:" prefixed to the
name in the encoding scheme used by ZFS. The user namespace is used
for general purpose user attributes and obeys normal access control
mechanisms. These attributes have no namespace string prefixed, so
xattrs written on illumos are accessible in the user namespace on
FreeBSD, and xattrs written to the user namespace on FreeBSD are
accessible by the same name on illumos.
Linux has several xattr namespaces. On Linux, ZFS encodes the
namespace in the xattr name for every namespace, including the user
namespace. As a consequence, an xattr in the user namespace with the
name "foo" is stored by ZFS with the name "user.foo" and therefore
appears on FreeBSD and illumos to have the name "user.foo" rather than
"foo". Conversely, none of the xattrs written on FreeBSD or illumos
are accessible on Linux unless the name happens to be prefixed with one
of the Linux xattr namespaces, in which case the namespace is stripped
from the name. This makes xattrs entirely incompatible between Linux
and other platforms.
We want to make the encoding of user namespace xattrs compatible across
platforms. A critical requirement of this compatibility is for xattrs
from existing pools from FreeBSD and illumos to be accessible by the
same names in the user namespace on Linux. It is also necessary that
existing pools with xattrs written by Linux retain access to those
xattrs by the same names on Linux. Making user namespace xattrs from
Linux accessible by the correct names on other platforms is important.
The handling of other namespaces is not required to be consistent.
Add a fallback mechanism for listing and getting xattrs to treat xattrs
as being in the user namespace if they do not match a known prefix.
Do not allow setting or getting xattrs with a name that is prefixed
with one of the namespace names used by ZFS on supported platforms.
Allow choosing between legacy illumos and FreeBSD compatibility and
legacy Linux compatibility with a new tunable. This facilitates
replication and migration of pools between hosts with different
compatibility needs.
The tunable controls whether or not to prefix the namespace to the
name. If the xattr is already present with the alternate prefix,
remove it so only the new version persists. By default the platform's
existing convention is used.
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#11919
These were all folded into a single kstat at
/proc/spl/kstat/kcf/NONAME_provider_stats
with no way to know which one it actually was,
and only the AES and SHA (so not Skein) ones were ever updated
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12901
They don't do anything except clogging up the AVL tree
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12901